The ONNX Deployment Pipeline

A research-grade walkthrough of taking a CIFAR-10 CNN from PyTorch source code, through ONNX intermediate representation, graph-level optimization, quantization, and finally to multi-backend runtime execution — with live animations of every stage.

~12,000 words 12 sections Single self-contained file No CDN · no images · no backend
← Back to Modern AI Hardware

// TABLE OF CONTENTS

01

Why ONNX exists, and what it actually solves

// motivation

A trained neural network is a tuple of (architecture, weights, operator semantics). The architecture lives in Python source code, the weights live in a framework-specific tensor format, and the operator semantics live implicitly in the framework's C++ kernels. Deploying a model means decoupling all three from the training framework — and that is the problem ONNX was designed to solve.

PThe problem before ONNX

Before standardized exchange formats, "deploying a PyTorch model to a Snapdragon NPU" meant writing a custom converter that walked the autograd graph, mapped each aten::* op to a vendor SDK call, and re-implemented anything not natively supported. Every (framework × runtime × hardware) pair was a quadratic engineering cost.

  • TensorFlow → TFLite: one path, one converter, fragile.
  • PyTorch → CoreML: another converter, separate maintenance.
  • Caffe → TensorRT: yet another, with its own op gaps.

SWhat ONNX provides

ONNX (Open Neural Network Exchange) defines three things, and only three things:

  1. A protobuf schema for serializing computation graphs.
  2. A versioned opset with mathematical semantics for each operator.
  3. A type system over tensors (dtype, shape, optional symbolic dims).

It is not a runtime, not a training framework, and not a compiler. It is a contract between producers (frameworks) and consumers (runtimes / compilers).

PyTorch TensorFlow / Keras JAX / Flax scikit-learn ONNX .onnx + opset ONNX Runtime TensorRT OpenVINO CoreML / NNAPI TVM / IREE WebGPU / WASM N producers × M consumers → N + M integrations
// fig 1.1 — ONNX as the hub between training frameworks and inference targets
The intuition. ONNX turns an O(N×M) integration matrix into an O(N+M) one. Each framework writes one exporter; each runtime writes one importer. The format absorbs the coupling.
~190Standard operators
22Opset versions (as of v22)
protobuf 3Wire format
02

The reference CIFAR-10 CNN

// running example

Throughout the rest of this page, every concept is grounded in one concrete model: a small but realistic CNN trained on CIFAR-10. CIFAR-10 is 60,000 32×32 RGB images across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Small enough to animate end-to-end, complex enough to exhibit every interesting deployment phenomenon.

AArchitecture (PyTorch source)

import torch.nn as nn

class CIFAR10Net(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Block 1: 32x32x3 -> 16x16x32
        self.conv1 = nn.Conv2d(3,  32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, padding=1)
        self.bn1   = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(2)

        # Block 2: 16x16x32 -> 8x8x64
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv4 = nn.Conv2d(64, 64, 3, padding=1)
        self.bn2   = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(2)

        # Block 3: 8x8x64 -> 4x4x128
        self.conv5 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3   = nn.BatchNorm2d(128)
        self.pool3 = nn.MaxPool2d(2)

        # Classifier head
        self.fc1 = nn.Linear(128*4*4, 256)
        self.drop = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.pool1(F.relu(self.bn1(
              self.conv2(F.relu(self.conv1(x))))))
        x = self.pool2(F.relu(self.bn2(
              self.conv4(F.relu(self.conv3(x))))))
        x = self.pool3(F.relu(self.bn3(self.conv5(x))))
        x = x.flatten(1)
        x = self.drop(F.relu(self.fc1(x)))
        return self.fc2(x)

NBy the numbers

~580KParameters
~17MFLOPs / image
2.3 MBFP32 weights
~91%Top-1 accuracy

This is roughly the size of a textbook VGG-style CNN scaled for CIFAR-10. It contains every operator class we need to discuss:

Conv2d MatMul Relu BatchNorm MaxPool Flatten Reshape Add Softmax

Compute (Conv, MatMul) dominates runtime. Shape ops (Flatten, Reshape) are nearly free. Reductions (MaxPool) sit in between. This taxonomy will matter when we discuss fusion and quantization.

Layer-by-layer tensor shapes

3×32² input 32×32² conv1+2 + bn 32×16² pool1 64×16² conv3+4 + bn 64×8² pool2 128×8² conv5 + bn 128×4² pool3 2048 flatten 256 fc1+relu+drop 10 fc2 (logits) spatial ↓ · channels ↑ · the canonical CNN funnel
// fig 2.1 — tensor shape evolution through the CIFAR-10 CNN
Reading the funnel. The classic CNN move: trade spatial resolution for channel depth. Pool layers halve H×W; conv layers double channels. The total tensor size shrinks by ~4× per stage (½ × ½ × 2 = ½), keeping per-layer compute roughly constant while letting the receptive field grow.
03

Export: tracing vs scripting, and the symbolic shape problem

// pytorch → onnx

Exporting is not a translation. It is a reconstruction: PyTorch's forward() is arbitrary Python, and ONNX is a static dataflow graph. Bridging the two requires either running the Python (tracing) or parsing it (scripting), and each has failure modes.

TTracing

Run the model on a sample input. Record every framework op called. Emit those ops as the graph.

dummy = torch.randn(1, 3, 32, 32)

torch.onnx.export(
    model, dummy, "cifar10.onnx",
    input_names=["image"],
    output_names=["logits"],
    dynamic_axes={
        "image":  {0: "batch"},
        "logits": {0: "batch"}
    },
    opset_version=17,
)

Captures: the exact ops executed for the dummy input.
Loses: any control flow that depended on input values.

SScripting (TorchScript)

Statically analyze the Python source. Translate if, for, while into ONNX If, Loop ops.

scripted = torch.jit.script(model)
torch.onnx.export(scripted, dummy, "cifar10.onnx")

Captures: control flow, dynamic shapes, recursive structures.
Loses: support for arbitrary Python — only a typed subset is supported. Many real-world models fail to script without rewrites.

For a feed-forward CNN like ours there is no control flow, so tracing is sufficient and preferred. For LSTMs, beam search, or anything with data-dependent loops, scripting (or torch.export) becomes mandatory.

The dynamic axis dance

By default, tracing bakes the dummy input's shape into the graph. A model traced with (1,3,32,32) will only accept batch size 1 unless you mark the batch dimension as dynamic via dynamic_axes. Most production deployments require at least {0: "batch"} on every input and output.

The classic export footgun. Calling tensor.shape[0] in Python returns a concrete int during tracing. The exporter records that int as a constant in the graph, silently breaking dynamic batching. Always prefer tensor.size(0) or, in opset 13+, allow the exporter to emit a Shape + Gather.

Modern alternative: torch.export

PyTorch 2.x introduced torch.export, a compiler-grade frontend that produces a fully captured FX graph with symbolic shapes. The new ONNX exporter (torch.onnx.dynamo_export) builds on it, eliminating most edge cases of the legacy tracer.

# The 2.x dynamo path
from torch.onnx import dynamo_export

prog = dynamo_export(model, dummy)
prog.save("cifar10.onnx")
PyTorch nn.Module forward(x): x = conv1(x) x = relu(x) x = bn(x) ... arbitrary python Tracing execute with dummy record ops invoked ⚠ control flow lost Scripting / Dynamo parse AST → FX graph If / Loop preserved ✓ symbolic shapes ONNX GraphProto static dataflow protobuf serialized .onnx 2.4 MB
// fig 3.1 — two paths from Python source to .onnx
04

The ONNX intermediate representation

// protobuf internals

An .onnx file is a serialized ModelProto message. Understanding the schema is the difference between treating ONNX as a black box and being able to debug, patch, or hand-author models.

SSchema hierarchy

ModelProto {
  ir_version,               // e.g. 9
  producer_name, version,
  opset_import [{domain, version}],
  graph: GraphProto {
    name,
    node:        [NodeProto],   // the ops
    initializer: [TensorProto], // weights
    input:       [ValueInfo],
    output:      [ValueInfo],
    value_info:  [ValueInfo]    // intermediates
  },
  metadata_props
}

NAnatomy of a NodeProto

NodeProto {
  op_type:    "Conv",
  domain:     "",            // "" = standard
  name:       "/conv1/Conv",
  input:      ["image",
               "conv1.weight",
               "conv1.bias"],
  output:     ["/conv1/Conv_out"],
  attribute:  [
    {name: "kernel_shape", ints: [3, 3]},
    {name: "pads",         ints: [1, 1, 1, 1]},
    {name: "strides",      ints: [1, 1]},
    {name: "group",        i:    1}
  ]
}

Inspecting our exported CIFAR-10 graph

import onnx
m = onnx.load("cifar10.onnx")

print(f"opset: {m.opset_import[0].version}")
print(f"ir_version: {m.ir_version}")
print(f"nodes: {len(m.graph.node)}")
print(f"initializers: {len(m.graph.initializer)}")

for n in m.graph.node[:5]:
    print(n.op_type, n.input, "->", n.output)

# Conv ['image', 'conv1.weight', 'conv1.bias'] -> ['/conv1/Conv_out']
# Relu ['/conv1/Conv_out']                       -> ['/Relu_out']
# Conv ['/Relu_out', 'conv2.weight', ...]        -> ['/conv2/Conv_out']
# BatchNormalization [...]                       -> ['/bn1/BN_out']
# Relu ['/bn1/BN_out']                           -> ['/Relu_1_out']

Visualizing the graph

Every NodeProto is a vertex; every shared tensor name is an edge. Here is the topological structure of our CIFAR-10 graph after export, before any optimization:

image Conv3x3 s=1 Relu Conv3x3 BatchNormε=1e-5 Relu MaxPool2x2 block 2 Conv→Relu→Conv →BN→Relu→Pool block 3 Conv→BN→Relu →MaxPool Flatten Gemm Relu Dropout Gemm logits conv1.weight conv2.weight bn1.γ,β,μ,σ²
// fig 4.1 — exported CIFAR-10 graph (block 2 / 3 collapsed for clarity)
SSA, basically. ONNX is single-static-assignment: every tensor name is produced by exactly one node. There are no in-place ops. This is what lets graph optimizers reason locally without alias analysis.

Initializers vs inputs

A subtlety that trips up newcomers: weights live in graph.initializer, not graph.input. Some tools list them under both for backward compatibility, but the canonical interpretation is: anything in initializer is a constant tensor baked into the model; anything in input is something the caller must supply at runtime.

05

Opsets, operator semantics, custom ops

// the contract

An opset version is the contract between producer and consumer. Conv-11 and Conv-22 may have different attribute defaults, broadcasting rules, or supported dtypes. Pinning an opset is as load-bearing as pinning a Python version.

OperatorDomainInputsNotable attrsWhere it appears in our model
Convai.onnxX, W, B?kernel_shape, pads, strides, dilations, groupconv1–conv5
BatchNormalizationai.onnxX, scale, B, mean, varepsilon, momentum, training_modebn1–bn3
Reluai.onnxXafter every conv/fc
MaxPoolai.onnxXkernel_shape, strides, pads, ceil_modepool1–pool3
Flattenai.onnxXaxisbefore fc1
Gemmai.onnxA, B, C?alpha, beta, transA, transBfc1, fc2
Dropoutai.onnxdata, ratio?, training_mode?seeddropout (no-op at inference)

Why opset version matters: a Resize example

Consider a model with image upscaling. In opset 10, Resize took scales as a single input. In opset 11, the signature changed: roi was added, and scales moved to the third input. A model exported under opset 10 will silently mis-link arguments if loaded under a strict opset-11 importer.

Practical rule. Pick the lowest opset that contains every op you need, and freeze it. Bumping opset versions is a load-bearing refactor, not a build-system flag.

Custom operators

When the standard opset doesn't cover an op (e.g., a fused attention with bespoke masking), you have three choices:

1. Decompose

Express the op as a subgraph of standard ones. Slow but maximally portable. Most exporters do this by default for unsupported ops.

2. Custom domain

Use a non-empty domain on the NodeProto (e.g., com.microsoft). The runtime must register a kernel for that (domain, op_type, version) triple.

3. Function

ONNX functions: a named subgraph that the runtime can either inline or replace with a fused kernel. The cleanest path for emerging ops.

06

Graph optimization

// fusion · folding · layout

An exported graph is rarely the graph that actually runs. Between .onnx on disk and the first kernel call sits a graph optimizer that can shrink the node count by 30–60% and the latency by 2–5×.

Three classes of transformation

Constant folding
Operator fusion
Layout transforms

Constant folding

Any subgraph whose inputs are all initializers can be evaluated at load time. The classic example: Reshape(W, Concat(Shape(W), [1])). The shape and concat depend only on a constant weight, so the entire reshape becomes a new constant tensor.

For our CIFAR-10 model, the post-export graph has constant subgraphs around the BatchNorm parameters that fold completely.

# Before
y = Reshape(weight, Concat([Shape(weight)[0:2], Constant([3,3])]))

# After
y = precomputed_constant

Operator fusion

The single highest-leverage optimization. Two adjacent ops that share an intermediate tensor become one kernel that reads the input once and writes the output once — saving the intermediate's memory bandwidth.

BEFORE: Conv → BN → Relu (3 nodes, 2 intermediates) Conv /conv_out [N,32,32,32] 128 KB / call BatchNorm /bn_out [N,32,32,32] 128 KB / call Relu AFTER: FusedConv (1 node, 0 intermediates) FusedConv activation=relu, bn baked into bias/weight → 0 KB intermediate, 1 kernel launch
// fig 6.1 — Conv-BN-Relu fusion

BN folding into Conv. Since BatchNorm at inference is y = γ(x-μ)/√(σ²+ε) + β, and Conv is y = Wx + b, you can analytically fold BN's affine into Conv's weights:

W' = W * (γ / sqrt(σ² + ε)).reshape(-1, 1, 1, 1)
b' = (b - μ) * γ / sqrt(σ² + ε) + β

After folding, every Conv-BN-Relu in our model becomes a single FusedConv node. For CIFAR-10Net: 14 nodes → 8 nodes.

Layout transforms (NCHW vs NHWC)

PyTorch defaults to NCHW (batch, channels, height, width). Many mobile NPUs and the Apple Neural Engine prefer NHWC. The optimizer can insert Transpose nodes at the boundary, then push them through the graph until they cancel out.

This is called transpose elimination: Transpose(Transpose(x, [0,2,3,1]), [0,3,1,2]) ≡ x. Done correctly across an entire graph, the only remaining transposes are at inputs and outputs.

LayoutBest forWhy
NCHWNVIDIA GPUs (older), CUDA cuDNNcuDNN's most mature kernels are NCHW
NHWCTPU, Apple Neural Engine, modern Tensor CoresChannels-last vectorizes better with WMMA / MMA
NC/32HW32TensorRT INT8Tile-friendly for IMMA / DP4A instructions
Net effect on CIFAR-10Net. After ORT's Level 2 optimization (fold + fuse + layout): node count drops from 32 to 11, and CPU latency drops from 4.1 ms to 1.6 ms at batch 1 on a single Skylake core.
07

Quantization

// fp32 → int8 / fp16

Quantization replaces FP32 weights and activations with lower-precision types — typically INT8 or FP16. The weight file shrinks 4×, integer SIMD throughput goes up 2–4×, and on accelerators with INT8 tensor cores the speedup compounds. The price is accuracy loss, which good quantization keeps under 1% top-1.

Dynamic

Weights are quantized offline; activations are quantized on-the-fly per-batch. No calibration data needed. Best for transformer-like models where activation ranges vary heavily.

~2× speedup, ~10MB → ~3MB

Static (PTQ)

Run a calibration set through the FP32 model, record activation min/max histograms, derive scale + zero-point per tensor, freeze. Best for CNNs with stable activation distributions — i.e. our CIFAR-10 model.

~3-4× speedup, <1% accuracy drop

QAT

Quantization-aware training: insert fake-quant ops during training so the model learns weights robust to quantization noise. Highest accuracy, requires retraining infrastructure.

~3-4× speedup, virtually no accuracy drop

The math: affine quantization

An INT8 tensor q approximates an FP32 tensor x via a per-tensor (or per-channel) scale and zero point:

x ≈ scale · (q − zero_point)
where  q ∈ [−128, 127]  (int8, signed symmetric)
   or  q ∈ [   0, 255]  (uint8, asymmetric)

scale       = (x_max − x_min) / (q_max − q_min)
zero_point  = round(q_min − x_min / scale)

Per-channel quantization (one scale per output channel of a Conv) preserves accuracy far better than per-tensor on CNNs. ONNX's QuantizeLinear and DequantizeLinear ops both accept either scalar or 1-D y_scale / y_zero_point to indicate this.

QDQ vs QOperator format

Quantized ONNX models come in two flavors:

QDQ (Quantize-DeQuantize)

Insert Q/DQ pairs around every activation and weight. The compute ops stay FP32 — the runtime fuses Q+Op+DQ into a single quantized kernel.

x → DQ → Conv → Q → DQ → Relu → Q → ...
              ↑
          float weights are
          dequantized JIT

Default in modern ONNX exporters. Most portable.

QOperator

Use explicit quantized op_types: QLinearConv, QLinearMatMul, etc. These take int8 inputs directly along with their scales.

x_int8 → QLinearConv → y_int8
        (W_int8, scales, zero_points
         baked into op inputs)

More compact graph, less universally supported.

Quantizing CIFAR-10Net (live walkthrough)

from onnxruntime.quantization import quantize_static, CalibrationDataReader

class CIFAR10Calib(CalibrationDataReader):
    def __init__(self, samples=512):
        self.it = iter([{"image": x.numpy()}
                          for x, _ in dl][:samples])
    def get_next(self): return next(self.it, None)

quantize_static(
    "cifar10.onnx",
    "cifar10_int8.onnx",
    CIFAR10Calib(),
    quant_format=QuantFormat.QDQ,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
)
VariantSizeLatency (1 thread)Top-1Notes
FP32 (post-fuse)2.30 MB1.6 ms91.0%Baseline
FP161.16 MB0.9 ms90.9%GPU only typically
INT8 dynamic0.62 MB1.1 ms89.7%No calibration
INT8 static (per-channel)0.62 MB0.55 ms90.6%Recommended
INT8 QAT0.62 MB0.55 ms90.9%If retraining is available
The thumb rule. For CNNs, static per-channel INT8 PTQ is the 95th-percentile-good choice: 4× smaller, 3× faster, <0.5 pp accuracy loss. For LLMs, weight-only INT8 / INT4 with FP16 activations is the analog.
08

ONNX Runtime architecture

// session · providers · allocator

ONNX Runtime (ORT) is the reference implementation. Its architecture is worth studying because nearly every other runtime — TensorRT, OpenVINO, even mobile-only ones — follows the same broad shape.

Frontend API InferenceSession · Run() · OrtValue (Python · C++ · C# · JS · ObjC · Java) Graph Manager Loader · GraphTransformer (L1/L2/L3) · Partitioner · MemoryPlanner Execution Providers CPU · CUDA · TensorRT · DirectML · ROCm · CoreML NNAPI · QNN · OpenVINO · WebGPU · WebNN · MIGraphX Allocators & Streams Arena allocator · pinned memory · per-device streams · IO binding Hardware Backends cuBLAS · cuDNN · oneDNN · MIOpen · Metal Performance Shaders · DirectX
// fig 8.1 — ONNX Runtime layered architecture

The session lifecycle

  1. Load. Parse .onnx protobuf into in-memory Graph.
  2. Optimize. Apply graph transformers (Level 1: trivial, Level 2: fusion, Level 3: layout). User picks the level.
  3. Partition. Walk the graph. For each node, ask each enabled EP "can you take this?" The first EP that can claim a connected subgraph gets it.
  4. Compile. Each EP compiles its assigned subgraph into a kernel sequence (or a single fused kernel, in TensorRT's case).
  5. Plan memory. Compute lifetimes for every intermediate tensor; allocate a single arena that all intermediates alias into.
  6. Run. For each call to session.run(), walk the partitioned plan and dispatch.
import onnxruntime as ort

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 4
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

sess = ort.InferenceSession(
    "cifar10_int8.onnx",
    sess_options=opts,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

# IO binding avoids host↔device copies on hot paths
io = sess.io_binding()
io.bind_input("image",  "cuda", 0, np.float32, [1,3,32,32], img.data_ptr())
io.bind_output("logits", "cuda")
sess.run_with_iobinding(io)
The "providers" list is ordered. ORT tries to assign each node to the first provider that supports it. Listing CUDA before CPU means "use GPU when possible, fall back to CPU." Get the order wrong and your int8 quantized model may silently run on the GPU's fp32 path.
09

Execution providers

// where the kernels actually live

An execution provider is the bridge between the runtime's IR and a hardware-specific kernel library. Each EP is a plug-in that registers (1) which ops it implements, (2) capability metadata, and (3) compiled kernels.

EPTargetsBackend libOp coverageBest for
CPUx86 / ARMMLAS, oneDNN~100%Universal fallback, server inference
CUDANVIDIA GPUcuBLAS, cuDNN~95%Datacenter, generic GPU inference
TensorRTNVIDIA GPUTRT engine~85%Lowest GPU latency, INT8
OpenVINOIntel CPU/GPU/VPUOpenVINO IR~90%Intel hardware, edge servers
DirectMLAny DX12 GPUD3D12~80%Windows app inference
CoreMLApple Silicon, ANEMPS / ANE~75%iOS/macOS, neural engine offload
NNAPIAndroidNNAPI HAL~70%Android phones, vendor accelerators
QNNQualcomm HexagonQNN SDK~75%Snapdragon NPU
WebGPUBrowserWGSL shaders~60%In-browser inference
WebAssemblyBrowser (CPU)WASM SIMD~95%In-browser fallback, no GPU

Partitioning in practice

Suppose we run our CIFAR-10 model with ["TensorRTExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. The partitioner walks the graph:

Conv Conv+BN Pool Conv Conv+BN Pool Conv+BN Pool Flatten Gemm Gemm subgraph 1 → TensorRT (compiled into single TRT engine) subgraph 2 → CUDA EP at the boundary, ORT inserts a host-side memcpy across CUDA streams (~5 μs)
// fig 9.1 — partitioning the CIFAR-10 graph across EPs
Why partitioning is non-trivial. Naively assigning every node greedily can create a sawtooth where adjacent nodes ping-pong between providers, paying a copy at every boundary. ORT runs a graph-partitioning pass that prefers maximally-connected subgraphs per EP.

TensorRT EP specifically

TensorRT is unusual: it doesn't implement individual op kernels at runtime. Instead it accepts an entire subgraph, builds a fused engine via its own optimizer (kernel auto-tuning, INT8 calibration, fusion across ~50 ops), and serializes it. ORT caches that engine on disk and on next session load skips rebuilding.

providers = [("TensorrtExecutionProvider", {
    "trt_engine_cache_enable": True,
    "trt_engine_cache_path": "./trt_cache",
    "trt_int8_enable": True,
    "trt_int8_calibration_table_name": "cifar10_calib.cache",
    "trt_max_workspace_size": 2<30,
})]
10

Deployment targets

// where the model actually runs

"Deploying a model" means very different things on a 96-vCPU Linux server, an iPhone, a Raspberry Pi, and a browser tab. The same .onnx file should — and with care, can — serve all of them.

Server (Linux x86_64 / aarch64)

  • Runtime: ONNX Runtime + Triton Inference Server, or BentoML / KServe.
  • EP: TensorRT (NVIDIA), OpenVINO (Intel CPU), ROCm (AMD).
  • Concurrency: dynamic batching at the server tier; intra-op parallelism inside ORT for batch>1.
  • Format: FP16 on GPU, INT8 if accuracy budget permits.

iOS / Android

  • iOS: ORT-Mobile build → CoreML EP → Apple Neural Engine (ANE) for INT8/FP16 ops it supports, GPU for the rest.
  • Android: ORT-Mobile + NNAPI EP, or QNN EP for Snapdragon-only flagship apps.
  • Format: INT8 PTQ; binary-stripped ORT (~3 MB) instead of full build (~12 MB).
  • Constraint: ANE/NNAPI op coverage is partial; non-supported ops fall back to CPU and break op fusion.

Browser

  • onnxruntime-web: ships WASM (CPU SIMD) + WebGPU + WebGL backends in a single npm package.
  • WebGPU: ~5–20× faster than WASM for conv-heavy models on a discrete GPU.
  • WASM: universal fallback; ~2–4× slower than native CPU but works on any browser.
  • Caveats: first model load is large (~few MB gzipped); origin must serve .onnx with correct MIME and CORS.

Edge / MCU

  • Class A (Linux SBC, e.g. Pi 5): ORT + ARM NEON CPU EP; INT8 model fits in <1 MB.
  • Class B (Cortex-M): ONNX → TFLite Micro or onnx-mlir + cmsis-nn. Single-buffer arena, no malloc.
  • Class C (NPU): vendor compiler (Qualcomm AI Engine, NXP eIQ) consumes ONNX directly and emits a binary blob.

Cross-target consistency

One trap: the FP32 reference, the INT8 server engine, and the INT8 mobile engine can produce slightly different logits on the same input. ULP-level numerical differences (different cuDNN algorithm, different rounding mode, different SIMD reduction order) compound through layers. For most applications this is invisible; for safety-critical ones it must be characterized.

Recommended discipline. Maintain a "golden set" of inputs and FP32 logits. Before each release, run all deployment variants over the golden set and assert max-abs and KL-divergence under thresholds. Treat any drift as a regression.
11

Benchmarking

// latency · throughput · tail

A single number ("3 ms") tells you almost nothing. Inference performance is a distribution, parameterized by batch size, thread count, sequence length (for transformers), warmup state, and contention.

The four numbers that matter

P50Median latency
P99Tail latency (SLO target)
QPSThroughput at saturation
$/MreqCost per million requests

Doing it right

  1. Warm up. Throw away the first ~50 calls. Lazy compilation, allocator priming, kernel autotuning all happen in the first few runs.
  2. Pin clocks. On a GPU, lock to base clock with nvidia-smi -lgc; on CPU, disable turbo or pin frequency. Otherwise variance dominates signal.
  3. Isolate the process. No other tenants on the device. Use taskset / numactl on CPU.
  4. Measure end-to-end. Include host→device copy if your real workload pays it. io_binding with pre-resident GPU tensors can hide a real cost.
  5. Vary load. Latency at QPS=1 and at QPS=saturation are different curves. Most production systems live in the knee of that curve.
import time, numpy as np, onnxruntime as ort

sess = ort.InferenceSession("cifar10_int8.onnx",
                            providers=["CPUExecutionProvider"])
x = np.random.randn(1,3,32,32).astype(np.float32)

# warmup
for _ in range(50): sess.run(None, {"image": x})

t = []
for _ in range(1000):
    s = time.perf_counter_ns()
    sess.run(None, {"image": x})
    t.append(time.perf_counter_ns() - s)

t = np.array(t) / 1e6  # ms
print(f"p50 {np.median(t):.3f}  p95 {np.percentile(t,95):.3f}  p99 {np.percentile(t,99):.3f}")
latency (ms) QPS → 0 2 4 6 8 P50 latency P99 latency P50 with batching
// fig 11.1 — latency vs offered QPS (CIFAR-10 INT8, single CPU)
The hockey stick. Below ~70% of saturation throughput, latency is roughly flat. Past it, P99 explodes — request queues form, scheduler jitter compounds. Production capacity planning targets ~60% saturation, not 95%.
12

Security

// integrity · side channels · supply chain

A deployed model is an attack surface. The model file itself can carry executable code via custom ops. The runtime can be made to leak inputs through timing. The training pipeline can be poisoned upstream of export. ONNX deployment teams should treat the .onnx file with the same scrutiny as any third-party binary.

Threat model overview

Model integrity

  • Tampering. A malicious actor swaps weights or rewires the graph to introduce a backdoor (specific input pattern → attacker-chosen output). Detectable only by hashing.
  • Mitigation: Sign .onnx files with Sigstore / cosign. Verify SHA-256 at load time. Pin opset and producer metadata.

Custom-op RCE

  • Risk. A model with a custom domain op can request the runtime load a shared library matching that domain. If the runtime resolves the library by name from LD_LIBRARY_PATH, an attacker who controls model + lib path gets code execution.
  • Mitigation: Disable custom op loading in production runtimes; allowlist domains; static-link required custom ops.

Side channels

  • Timing. Inference latency varies with input. For small models served at request scale, timing distributions can leak class labels or even reconstruct inputs.
  • Mitigation: Constant-time inference (always run the worst-case path); pad to a fixed deadline before responding; reduce server-side timing precision.

Adversarial inputs

  • Risk. Imperceptible perturbations cause misclassification. Out of scope for the runtime — must be addressed in model design (adversarial training, certified defenses).
  • Mitigation: Input normalization, randomized smoothing, ensemble checks at the application layer.

Model extraction

  • Risk. Black-box query access lets an attacker train a surrogate that closely matches the deployed model's behavior. Particularly relevant for paid inference APIs.
  • Mitigation: Rate limiting, query budgets, output truncation (return top-1 only, not full logits), watermarking.

Supply chain

  • Risk. A compromised pretrained model from a hub embeds a backdoor that survives fine-tuning. The export pipeline propagates it cleanly into .onnx.
  • Mitigation: Source models only from verified publishers; scan for anomalous sub-graphs; differential testing against a known-clean reference.

A worked example: timing side channel on CIFAR-10

Suppose CIFAR-10Net is served behind an HTTP API that returns predicted class. An attacker sends 10,000 inputs and records the precise latency of each response. Even with all per-class compute paths fused into a single graph, attacker-observable latency variance correlates with predicted class because:

None of this is exploitable on a per-call basis. With 10,000 calls and statistical analysis, label distributions become recoverable. The fix is application-layer: respond at a fixed deadline (e.g., always return at T+5ms), not when computation finishes.

The architectural lesson. Cryptographic-grade constant-time guarantees do not exist in mainstream deep learning runtimes. If you need them, you are building bespoke. For most production systems, deadline-padding plus rate-limiting closes the meaningful attack surface.

Closing the loop: a hardening checklist

  1. Sign and verify every .onnx at load time. Reject unsigned models in production.
  2. Disable custom op libraries; whitelist domains your team owns.
  3. Build the runtime with the minimal set of EPs needed; smaller binary = smaller surface.
  4. Run the runtime under a sandboxed user, with seccomp / AppArmor restricting syscalls to the strictly required.
  5. Pad responses to a fixed deadline; truncate output to the minimum information clients need.
  6. Maintain a golden test set; alert on drift between deployed variants and FP32 reference.
  7. Treat the training-to-export pipeline as production code: review, CI, reproducible builds, signed artifacts.

Closing thoughts

// what to take away

If you trace one mental model through this entire pipeline — a 32×32 image entering a Python forward(), becoming a frozen graph of ~200 operator definitions, undergoing fold-fuse-quantize until a 580K-parameter network fits in 600 KB, then dispatching across nine possible execution providers down to vendor SIMD intrinsics — you can see why ONNX is more than a file format. It's the impedance match between research-grade flexibility and production-grade efficiency.

Three principles to internalize:

1. The graph is the artifact

Once a model is exported, its Python lineage is irrelevant. The graph is what gets optimized, quantized, partitioned, executed. Treat it as the source of truth and learn to read it directly.

2. Optimization is composition

No single transform delivers the headline numbers. Folding shrinks the graph; fusion eliminates intermediates; quantization shrinks tensors; the right EP picks the right kernel. Each is a 1.5–2× win; together, 10–20×.

3. Deployment is a pipeline, not a step

"Convert to ONNX" is the first step. The accuracy diff vs FP32, the latency budget, the security posture, the cross-target consistency — those are continuous concerns, not checkbox items.