The Standard Pipeline

The recommended path from a trained PyTorch model to optimised edge inference has three steps: export to ONNX, parse the ONNX graph with the TensorRT builder, serialise the resulting engine to disk, then load it at runtime. The engine file is platform-specific — a Jetson Nano engine will not run on a Jetson Orin, and vice versa.

We use this pipeline on several production deployments. Below is a battle-tested implementation followed by the failure modes we have encountered on real hardware.

Step 1 — Export to ONNX

The most important decision here is the ONNX opset version. TensorRT 8.x supports up to opset 17 reliably. Newer PyTorch versions may default to opset 18 or 19. Always pin the opset explicitly — and avoid dynamic_axes unless you genuinely need variable batch sizes at runtime (more on this below).

Python — export.py
import torch
import onnx

def export_to_onnx(model: torch.nn.Module,
                   input_shape: tuple,
                   onnx_path: str,
                   opset: int = 16) -> bool:
    """
    Export a PyTorch model to ONNX.
    Returns True on success, prints parse errors on failure.
    """
    model.eval().cuda()
    dummy = torch.randn(input_shape, device='cuda')

    try:
        torch.onnx.export(
            model,
            dummy,
            onnx_path,
            opset_version=opset,           # Pin this. Do NOT use the default.
            input_names=['input'],
            output_names=['output'],
            do_constant_folding=True,      # Fold constants for smaller graph
            # Omit dynamic_axes unless you specifically need variable batch size.
            # Dynamic axes cause TensorRT to generate slower, larger engines.
        )
        # Always validate the exported graph before attempting a TRT build
        onnx_model = onnx.load(onnx_path)
        onnx.checker.check_model(onnx_model)
        print(f"ONNX export OK  →  {onnx_path}")
        return True

    except Exception as e:
        print(f"ONNX export failed: {e}")
        return False


# Example: ResNet-50 at fixed batch size 1
if __name__ == '__main__':
    from torchvision.models import resnet50
    model = resnet50(weights='IMAGENET1K_V1')
    export_to_onnx(model, (1, 3, 224, 224), 'resnet50.onnx')

Step 2 — Build the TensorRT Engine

Engine building is slow — 30 to 120 seconds on Jetson Nano, 10 to 30 seconds on Orin. Always serialise the built engine to a file and load from disk on subsequent runs. Never build from ONNX at application startup in production.

Python — build_engine.py
import tensorrt as trt
from pathlib import Path

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


def build_engine(onnx_path: str,
                 engine_path: str,
                 fp16: bool = True,
                 workspace_gb: int = 2) -> bool:
    """
    Build and serialise a TensorRT engine from an ONNX file.
    fp16=True is safe for most classification/detection models on Jetson AGX/Orin.
    On Jetson Nano, check accuracy before enabling — Nano's FP16 throughput gain
    is smaller and precision loss more noticeable.
    """
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser  = trt.OnnxParser(network, TRT_LOGGER)
    config  = builder.create_builder_config()

    config.set_memory_pool_limit(
        trt.MemoryPoolType.WORKSPACE,
        workspace_gb * (1 << 30)
    )

    if fp16 and builder.platform_has_fast_fp16:
        config.set_flag(trt.BuilderFlag.FP16)
        print("FP16 enabled")

    with open(onnx_path, 'rb') as f:
        ok = parser.parse(f.read())

    if not ok:
        for i in range(parser.num_errors):
            print(f"  TRT parse error [{i}]: {parser.get_error(i)}")
        return False

    print(f"Building engine (this takes 30–120 s on Jetson Nano)...")
    serialized = builder.build_serialized_network(network, config)

    if serialized is None:
        print("Engine build failed — check ONNX graph for unsupported ops")
        return False

    with open(engine_path, 'wb') as f:
        f.write(serialized)

    size_mb = Path(engine_path).stat().st_size / (1024 * 1024)
    print(f"Engine saved  →  {engine_path}  ({size_mb:.1f} MB)")
    return True

Step 3 — Runtime Inference

Python — infer.py
import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit          # initialises CUDA context
import numpy as np

TRT_LOGGER = trt.Logger(trt.Logger.WARNING)


class TRTInferencer:
    """Minimal TensorRT inferencer. One input, one output."""

    def __init__(self, engine_path: str):
        runtime = trt.Runtime(TRT_LOGGER)
        with open(engine_path, 'rb') as f:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

    def infer(self, x: np.ndarray) -> np.ndarray:
        # Allocate device memory
        d_input  = cuda.mem_alloc(x.nbytes)
        out_shape = tuple(self.engine.get_binding_shape(1))
        output   = np.empty(out_shape, dtype=np.float32)
        d_output = cuda.mem_alloc(output.nbytes)

        # H2D copy, execute, D2H copy
        cuda.memcpy_htod(d_input, np.ascontiguousarray(x))
        self.context.execute_v2(bindings=[int(d_input), int(d_output)])
        cuda.memcpy_dtoh(output, d_output)

        return output


# Usage
if __name__ == '__main__':
    trt_model = TRTInferencer('resnet50.engine')
    frame = np.random.rand(1, 3, 224, 224).astype(np.float32)
    logits = trt_model.infer(frame)
    print("Top-1 class:", logits.argmax())

Pitfall 1 — Dynamic Axes

If your ONNX export includes dynamic_axes={'input': {0: 'batch'}}, TensorRT requires you to set optimisation profiles to tell the builder what batch sizes to plan for. Skip this and the build will either fail or produce an engine that panics at runtime when you pass a concrete tensor.

Unless you genuinely need variable batch size at inference time, remove dynamic_axes entirely. Fix your batch size to 1 (or whatever you run at the edge) and let TensorRT optimise for that exact shape. The resulting engine is faster and the build process is simpler.

Pitfall 2 — Unsupported ONNX Ops

Some PyTorch operations — particularly custom activation functions, certain attention mechanisms, and operations added after ONNX opset 17 — do not have a TensorRT implementation. The parser will report an error like:

Error output
TRT parse error [0]: No importer registered for op: HardSwish
TRT parse error [1]: While parsing node number 47 [HardSwish -> "456"]

Options in order of preference: (1) replace the unsupported op with a supported equivalent in PyTorch before exporting, (2) downgrade the opset, or (3) write a custom TensorRT plugin. The third option is rarely worth it for edge deployments — prefer model architecture changes.

Pitfall 3 — FP16 Precision Loss

FP16 reduces memory bandwidth and often increases throughput, but not all models tolerate it. The failure mode is subtle: the engine builds fine, inference runs, but accuracy metrics degrade silently. Always validate FP16 accuracy against your ground truth dataset before deploying. A 200-sample spot check is usually sufficient to catch problems.

Python — fp16 validation
def compare_fp32_fp16(val_loader, fp32_engine, fp16_engine, threshold=0.01):
    """
    Run validation set through both engines.
    Returns True if mean absolute output difference is below threshold.
    """
    fp32_model = TRTInferencer(fp32_engine)
    fp16_model = TRTInferencer(fp16_engine)
    diffs = []

    for images, _ in val_loader:
        x = images.numpy().astype(np.float32)
        out_fp32 = fp32_model.infer(x)
        out_fp16 = fp16_model.infer(x)
        diffs.append(np.abs(out_fp32 - out_fp16).mean())

    mean_diff = np.mean(diffs)
    print(f"Mean absolute FP32/FP16 output diff: {mean_diff:.5f}")
    return mean_diff < threshold

Platform Notes

JetPack version lock: Pin your JetPack version in production. Updating JetPack updates TensorRT, which means your engine files are invalid until rebuilt. Document the JetPack version alongside every engine file you ship.