The Standard Pipeline
The recommended path from a trained PyTorch model to optimised edge inference has three steps: export to ONNX, parse the ONNX graph with the TensorRT builder, serialise the resulting engine to disk, then load it at runtime. The engine file is platform-specific — a Jetson Nano engine will not run on a Jetson Orin, and vice versa.
We use this pipeline on several production deployments. Below is a battle-tested implementation followed by the failure modes we have encountered on real hardware.
Step 1 — Export to ONNX
The most important decision here is the ONNX opset version. TensorRT 8.x supports up to
opset 17 reliably. Newer PyTorch versions may default to opset 18 or 19. Always pin the
opset explicitly — and avoid dynamic_axes unless you genuinely need variable
batch sizes at runtime (more on this below).
import torch
import onnx
def export_to_onnx(model: torch.nn.Module,
input_shape: tuple,
onnx_path: str,
opset: int = 16) -> bool:
"""
Export a PyTorch model to ONNX.
Returns True on success, prints parse errors on failure.
"""
model.eval().cuda()
dummy = torch.randn(input_shape, device='cuda')
try:
torch.onnx.export(
model,
dummy,
onnx_path,
opset_version=opset, # Pin this. Do NOT use the default.
input_names=['input'],
output_names=['output'],
do_constant_folding=True, # Fold constants for smaller graph
# Omit dynamic_axes unless you specifically need variable batch size.
# Dynamic axes cause TensorRT to generate slower, larger engines.
)
# Always validate the exported graph before attempting a TRT build
onnx_model = onnx.load(onnx_path)
onnx.checker.check_model(onnx_model)
print(f"ONNX export OK → {onnx_path}")
return True
except Exception as e:
print(f"ONNX export failed: {e}")
return False
# Example: ResNet-50 at fixed batch size 1
if __name__ == '__main__':
from torchvision.models import resnet50
model = resnet50(weights='IMAGENET1K_V1')
export_to_onnx(model, (1, 3, 224, 224), 'resnet50.onnx')
Step 2 — Build the TensorRT Engine
Engine building is slow — 30 to 120 seconds on Jetson Nano, 10 to 30 seconds on Orin. Always serialise the built engine to a file and load from disk on subsequent runs. Never build from ONNX at application startup in production.
import tensorrt as trt
from pathlib import Path
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
def build_engine(onnx_path: str,
engine_path: str,
fp16: bool = True,
workspace_gb: int = 2) -> bool:
"""
Build and serialise a TensorRT engine from an ONNX file.
fp16=True is safe for most classification/detection models on Jetson AGX/Orin.
On Jetson Nano, check accuracy before enabling — Nano's FP16 throughput gain
is smaller and precision loss more noticeable.
"""
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, TRT_LOGGER)
config = builder.create_builder_config()
config.set_memory_pool_limit(
trt.MemoryPoolType.WORKSPACE,
workspace_gb * (1 << 30)
)
if fp16 and builder.platform_has_fast_fp16:
config.set_flag(trt.BuilderFlag.FP16)
print("FP16 enabled")
with open(onnx_path, 'rb') as f:
ok = parser.parse(f.read())
if not ok:
for i in range(parser.num_errors):
print(f" TRT parse error [{i}]: {parser.get_error(i)}")
return False
print(f"Building engine (this takes 30–120 s on Jetson Nano)...")
serialized = builder.build_serialized_network(network, config)
if serialized is None:
print("Engine build failed — check ONNX graph for unsupported ops")
return False
with open(engine_path, 'wb') as f:
f.write(serialized)
size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
print(f"Engine saved → {engine_path} ({size_mb:.1f} MB)")
return True
Step 3 — Runtime Inference
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit # initialises CUDA context
import numpy as np
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
class TRTInferencer:
"""Minimal TensorRT inferencer. One input, one output."""
def __init__(self, engine_path: str):
runtime = trt.Runtime(TRT_LOGGER)
with open(engine_path, 'rb') as f:
self.engine = runtime.deserialize_cuda_engine(f.read())
self.context = self.engine.create_execution_context()
def infer(self, x: np.ndarray) -> np.ndarray:
# Allocate device memory
d_input = cuda.mem_alloc(x.nbytes)
out_shape = tuple(self.engine.get_binding_shape(1))
output = np.empty(out_shape, dtype=np.float32)
d_output = cuda.mem_alloc(output.nbytes)
# H2D copy, execute, D2H copy
cuda.memcpy_htod(d_input, np.ascontiguousarray(x))
self.context.execute_v2(bindings=[int(d_input), int(d_output)])
cuda.memcpy_dtoh(output, d_output)
return output
# Usage
if __name__ == '__main__':
trt_model = TRTInferencer('resnet50.engine')
frame = np.random.rand(1, 3, 224, 224).astype(np.float32)
logits = trt_model.infer(frame)
print("Top-1 class:", logits.argmax())
Pitfall 1 — Dynamic Axes
If your ONNX export includes dynamic_axes={'input': {0: 'batch'}}, TensorRT
requires you to set optimisation profiles to tell the builder what batch sizes to plan for.
Skip this and the build will either fail or produce an engine that panics at runtime when
you pass a concrete tensor.
Unless you genuinely need variable batch size at inference time, remove dynamic_axes
entirely. Fix your batch size to 1 (or whatever you run at the edge) and let TensorRT optimise
for that exact shape. The resulting engine is faster and the build process is simpler.
Pitfall 2 — Unsupported ONNX Ops
Some PyTorch operations — particularly custom activation functions, certain attention mechanisms, and operations added after ONNX opset 17 — do not have a TensorRT implementation. The parser will report an error like:
TRT parse error [0]: No importer registered for op: HardSwish
TRT parse error [1]: While parsing node number 47 [HardSwish -> "456"]
Options in order of preference: (1) replace the unsupported op with a supported equivalent in PyTorch before exporting, (2) downgrade the opset, or (3) write a custom TensorRT plugin. The third option is rarely worth it for edge deployments — prefer model architecture changes.
Pitfall 3 — FP16 Precision Loss
FP16 reduces memory bandwidth and often increases throughput, but not all models tolerate it. The failure mode is subtle: the engine builds fine, inference runs, but accuracy metrics degrade silently. Always validate FP16 accuracy against your ground truth dataset before deploying. A 200-sample spot check is usually sufficient to catch problems.
def compare_fp32_fp16(val_loader, fp32_engine, fp16_engine, threshold=0.01):
"""
Run validation set through both engines.
Returns True if mean absolute output difference is below threshold.
"""
fp32_model = TRTInferencer(fp32_engine)
fp16_model = TRTInferencer(fp16_engine)
diffs = []
for images, _ in val_loader:
x = images.numpy().astype(np.float32)
out_fp32 = fp32_model.infer(x)
out_fp16 = fp16_model.infer(x)
diffs.append(np.abs(out_fp32 - out_fp16).mean())
mean_diff = np.mean(diffs)
print(f"Mean absolute FP32/FP16 output diff: {mean_diff:.5f}")
return mean_diff < threshold
Platform Notes
- Jetson Nano (Maxwell): FP16 throughput gain is modest. INT8 requires calibration data. Memory is the main constraint — keep engine + runtime under 2 GB.
- Jetson Orin (Ampere): FP16 and INT8 both significantly faster than FP32. Tensor cores are available. Engine build times are much shorter. Start with FP16.
- Engine portability: Engines are not portable across JetPack versions. If you update JetPack, rebuild the engine. Automate this in your deployment pipeline.