ONNX Deployment Pipeline — CIFAR-10 CNN Deep Dive

// TABLE OF CONTENTS

01Why ONNX exists, and what it actually solves 02The reference CIFAR-10 CNN (running example) 03Export: tracing vs scripting, the symbolic shape problem 04The ONNX intermediate representation (protobuf) 05Opsets, operator semantics, custom ops 06Graph optimization: fusion, folding, layout 07Quantization: dynamic, static, QAT, INT8 / FP16 08ONNX Runtime architecture (session, providers, allocator) 09Execution providers: CPU, CUDA, TensorRT, CoreML, NNAPI 10Deployment targets: server, mobile, browser, edge 11Benchmarking and the latency/throughput tradeoff 12Security: model integrity, side channels, supply chain

Why ONNX exists, and what it actually solves

// motivation

A trained neural network is a tuple of (architecture, weights, operator semantics). The architecture lives in Python source code, the weights live in a framework-specific tensor format, and the operator semantics live implicitly in the framework's C++ kernels. Deploying a model means decoupling all three from the training framework — and that is the problem ONNX was designed to solve.

PThe problem before ONNX

Before standardized exchange formats, "deploying a PyTorch model to a Snapdragon NPU" meant writing a custom converter that walked the autograd graph, mapped each aten::* op to a vendor SDK call, and re-implemented anything not natively supported. Every (framework × runtime × hardware) pair was a quadratic engineering cost.

TensorFlow → TFLite: one path, one converter, fragile.
PyTorch → CoreML: another converter, separate maintenance.
Caffe → TensorRT: yet another, with its own op gaps.

SWhat ONNX provides

ONNX (Open Neural Network Exchange) defines three things, and only three things:

A protobuf schema for serializing computation graphs.
A versioned opset with mathematical semantics for each operator.
A type system over tensors (dtype, shape, optional symbolic dims).

It is not a runtime, not a training framework, and not a compiler. It is a contract between producers (frameworks) and consumers (runtimes / compilers).

// fig 1.1 — ONNX as the hub between training frameworks and inference targets

The intuition. ONNX turns an O(N×M) integration matrix into an O(N+M) one. Each framework writes one exporter; each runtime writes one importer. The format absorbs the coupling.

~190Standard operators

22Opset versions (as of v22)

protobuf 3Wire format

The reference CIFAR-10 CNN

// running example

Throughout the rest of this page, every concept is grounded in one concrete model: a small but realistic CNN trained on CIFAR-10. CIFAR-10 is 60,000 32×32 RGB images across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Small enough to animate end-to-end, complex enough to exhibit every interesting deployment phenomenon.

AArchitecture (PyTorch source)

import torch.nn as nn

class CIFAR10Net(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        # Block 1: 32x32x3 -> 16x16x32
        self.conv1 = nn.Conv2d(3,  32, 3, padding=1)
        self.conv2 = nn.Conv2d(32, 32, 3, padding=1)
        self.bn1   = nn.BatchNorm2d(32)
        self.pool1 = nn.MaxPool2d(2)

        # Block 2: 16x16x32 -> 8x8x64
        self.conv3 = nn.Conv2d(32, 64, 3, padding=1)
        self.conv4 = nn.Conv2d(64, 64, 3, padding=1)
        self.bn2   = nn.BatchNorm2d(64)
        self.pool2 = nn.MaxPool2d(2)

        # Block 3: 8x8x64 -> 4x4x128
        self.conv5 = nn.Conv2d(64, 128, 3, padding=1)
        self.bn3   = nn.BatchNorm2d(128)
        self.pool3 = nn.MaxPool2d(2)

        # Classifier head
        self.fc1 = nn.Linear(128*4*4, 256)
        self.drop = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.pool1(F.relu(self.bn1(
              self.conv2(F.relu(self.conv1(x))))))
        x = self.pool2(F.relu(self.bn2(
              self.conv4(F.relu(self.conv3(x))))))
        x = self.pool3(F.relu(self.bn3(self.conv5(x))))
        x = x.flatten(1)
        x = self.drop(F.relu(self.fc1(x)))
        return self.fc2(x)

NBy the numbers

~580KParameters

~17MFLOPs / image

2.3 MBFP32 weights

~91%Top-1 accuracy

This is roughly the size of a textbook VGG-style CNN scaled for CIFAR-10. It contains every operator class we need to discuss:

Conv2d MatMul Relu BatchNorm MaxPool Flatten Reshape Add Softmax

Compute (Conv, MatMul) dominates runtime. Shape ops (Flatten, Reshape) are nearly free. Reductions (MaxPool) sit in between. This taxonomy will matter when we discuss fusion and quantization.

Layer-by-layer tensor shapes

// fig 2.1 — tensor shape evolution through the CIFAR-10 CNN

Reading the funnel. The classic CNN move: trade spatial resolution for channel depth. Pool layers halve H×W; conv layers double channels. The total tensor size shrinks by ~4× per stage (½ × ½ × 2 = ½), keeping per-layer compute roughly constant while letting the receptive field grow.

Export: tracing vs scripting, and the symbolic shape problem

// pytorch → onnx

Exporting is not a translation. It is a reconstruction: PyTorch's forward() is arbitrary Python, and ONNX is a static dataflow graph. Bridging the two requires either running the Python (tracing) or parsing it (scripting), and each has failure modes.

TTracing

Run the model on a sample input. Record every framework op called. Emit those ops as the graph.

dummy = torch.randn(1, 3, 32, 32)

torch.onnx.export(
    model, dummy, "cifar10.onnx",
    input_names=["image"],
    output_names=["logits"],
    dynamic_axes={
        "image":  {0: "batch"},
        "logits": {0: "batch"}
    },
    opset_version=17,
)

Captures: the exact ops executed for the dummy input.
Loses: any control flow that depended on input values.

SScripting (TorchScript)

Statically analyze the Python source. Translate if, for, while into ONNX If, Loop ops.

scripted = torch.jit.script(model)
torch.onnx.export(scripted, dummy, "cifar10.onnx")

Captures: control flow, dynamic shapes, recursive structures.
Loses: support for arbitrary Python — only a typed subset is supported. Many real-world models fail to script without rewrites.

For a feed-forward CNN like ours there is no control flow, so tracing is sufficient and preferred. For LSTMs, beam search, or anything with data-dependent loops, scripting (or torch.export) becomes mandatory.

The dynamic axis dance

By default, tracing bakes the dummy input's shape into the graph. A model traced with (1,3,32,32) will only accept batch size 1 unless you mark the batch dimension as dynamic via dynamic_axes. Most production deployments require at least {0: "batch"} on every input and output.

The classic export footgun. Calling tensor.shape[0] in Python returns a concrete int during tracing. The exporter records that int as a constant in the graph, silently breaking dynamic batching. Always prefer tensor.size(0) or, in opset 13+, allow the exporter to emit a Shape + Gather.

Modern alternative: torch.export

PyTorch 2.x introduced torch.export, a compiler-grade frontend that produces a fully captured FX graph with symbolic shapes. The new ONNX exporter (torch.onnx.dynamo_export) builds on it, eliminating most edge cases of the legacy tracer.

# The 2.x dynamo path
from torch.onnx import dynamo_export

prog = dynamo_export(model, dummy)
prog.save("cifar10.onnx")

// fig 3.1 — two paths from Python source to .onnx

The ONNX intermediate representation

// protobuf internals

An .onnx file is a serialized ModelProto message. Understanding the schema is the difference between treating ONNX as a black box and being able to debug, patch, or hand-author models.

SSchema hierarchy

ModelProto {
  ir_version,               // e.g. 9
  producer_name, version,
  opset_import [{domain, version}],
  graph: GraphProto {
    name,
    node:        [NodeProto],   // the ops
    initializer: [TensorProto], // weights
    input:       [ValueInfo],
    output:      [ValueInfo],
    value_info:  [ValueInfo]    // intermediates
  },
  metadata_props
}

NAnatomy of a NodeProto

NodeProto {
  op_type:    "Conv",
  domain:     "",            // "" = standard
  name:       "/conv1/Conv",
  input:      ["image",
               "conv1.weight",
               "conv1.bias"],
  output:     ["/conv1/Conv_out"],
  attribute:  [
    {name: "kernel_shape", ints: [3, 3]},
    {name: "pads",         ints: [1, 1, 1, 1]},
    {name: "strides",      ints: [1, 1]},
    {name: "group",        i:    1}
  ]
}

Inspecting our exported CIFAR-10 graph

import onnx
m = onnx.load("cifar10.onnx")

print(f"opset: {m.opset_import[0].version}")
print(f"ir_version: {m.ir_version}")
print(f"nodes: {len(m.graph.node)}")
print(f"initializers: {len(m.graph.initializer)}")

for n in m.graph.node[:5]:
    print(n.op_type, n.input, "->", n.output)

# Conv ['image', 'conv1.weight', 'conv1.bias'] -> ['/conv1/Conv_out']
# Relu ['/conv1/Conv_out']                       -> ['/Relu_out']
# Conv ['/Relu_out', 'conv2.weight', ...]        -> ['/conv2/Conv_out']
# BatchNormalization [...]                       -> ['/bn1/BN_out']
# Relu ['/bn1/BN_out']                           -> ['/Relu_1_out']

Visualizing the graph

Every NodeProto is a vertex; every shared tensor name is an edge. Here is the topological structure of our CIFAR-10 graph after export, before any optimization:

// fig 4.1 — exported CIFAR-10 graph (block 2 / 3 collapsed for clarity)

SSA, basically. ONNX is single-static-assignment: every tensor name is produced by exactly one node. There are no in-place ops. This is what lets graph optimizers reason locally without alias analysis.

Initializers vs inputs

A subtlety that trips up newcomers: weights live in graph.initializer, not graph.input. Some tools list them under both for backward compatibility, but the canonical interpretation is: anything in initializer is a constant tensor baked into the model; anything in input is something the caller must supply at runtime.

Opsets, operator semantics, custom ops

// the contract

An opset version is the contract between producer and consumer. Conv-11 and Conv-22 may have different attribute defaults, broadcasting rules, or supported dtypes. Pinning an opset is as load-bearing as pinning a Python version.

Operator	Domain	Inputs	Notable attrs	Where it appears in our model
`Conv`	ai.onnx	X, W, B?	kernel_shape, pads, strides, dilations, group	conv1–conv5
`BatchNormalization`	ai.onnx	X, scale, B, mean, var	epsilon, momentum, training_mode	bn1–bn3
`Relu`	ai.onnx	X	—	after every conv/fc
`MaxPool`	ai.onnx	X	kernel_shape, strides, pads, ceil_mode	pool1–pool3
`Flatten`	ai.onnx	X	axis	before fc1
`Gemm`	ai.onnx	A, B, C?	alpha, beta, transA, transB	fc1, fc2
`Dropout`	ai.onnx	data, ratio?, training_mode?	seed	dropout (no-op at inference)

Why opset version matters: a `Resize` example

Consider a model with image upscaling. In opset 10, Resize took scales as a single input. In opset 11, the signature changed: roi was added, and scales moved to the third input. A model exported under opset 10 will silently mis-link arguments if loaded under a strict opset-11 importer.

Practical rule. Pick the lowest opset that contains every op you need, and freeze it. Bumping opset versions is a load-bearing refactor, not a build-system flag.

Custom operators

When the standard opset doesn't cover an op (e.g., a fused attention with bespoke masking), you have three choices:

1. Decompose

Express the op as a subgraph of standard ones. Slow but maximally portable. Most exporters do this by default for unsupported ops.

2. Custom domain

Use a non-empty domain on the NodeProto (e.g., com.microsoft). The runtime must register a kernel for that (domain, op_type, version) triple.

3. Function

ONNX functions: a named subgraph that the runtime can either inline or replace with a fused kernel. The cleanest path for emerging ops.

Graph optimization

// fusion · folding · layout

An exported graph is rarely the graph that actually runs. Between .onnx on disk and the first kernel call sits a graph optimizer that can shrink the node count by 30–60% and the latency by 2–5×.

Three classes of transformation

Constant folding

Operator fusion

Layout transforms

Constant folding

Any subgraph whose inputs are all initializers can be evaluated at load time. The classic example: Reshape(W, Concat(Shape(W), [1])). The shape and concat depend only on a constant weight, so the entire reshape becomes a new constant tensor.

For our CIFAR-10 model, the post-export graph has constant subgraphs around the BatchNorm parameters that fold completely.

# Before
y = Reshape(weight, Concat([Shape(weight)[0:2], Constant([3,3])]))

# After
y = precomputed_constant

Operator fusion

The single highest-leverage optimization. Two adjacent ops that share an intermediate tensor become one kernel that reads the input once and writes the output once — saving the intermediate's memory bandwidth.

// fig 6.1 — Conv-BN-Relu fusion

BN folding into Conv. Since BatchNorm at inference is y = γ(x-μ)/√(σ²+ε) + β, and Conv is y = Wx + b, you can analytically fold BN's affine into Conv's weights:

W' = W * (γ / sqrt(σ² + ε)).reshape(-1, 1, 1, 1)
b' = (b - μ) * γ / sqrt(σ² + ε) + β

After folding, every Conv-BN-Relu in our model becomes a single FusedConv node. For CIFAR-10Net: 14 nodes → 8 nodes.

Layout transforms (NCHW vs NHWC)

PyTorch defaults to NCHW (batch, channels, height, width). Many mobile NPUs and the Apple Neural Engine prefer NHWC. The optimizer can insert Transpose nodes at the boundary, then push them through the graph until they cancel out.

This is called transpose elimination: Transpose(Transpose(x, [0,2,3,1]), [0,3,1,2]) ≡ x. Done correctly across an entire graph, the only remaining transposes are at inputs and outputs.

Layout	Best for	Why
NCHW	NVIDIA GPUs (older), CUDA cuDNN	cuDNN's most mature kernels are NCHW
NHWC	TPU, Apple Neural Engine, modern Tensor Cores	Channels-last vectorizes better with WMMA / MMA
NC/32HW32	TensorRT INT8	Tile-friendly for IMMA / DP4A instructions

Net effect on CIFAR-10Net. After ORT's Level 2 optimization (fold + fuse + layout): node count drops from 32 to 11, and CPU latency drops from 4.1 ms to 1.6 ms at batch 1 on a single Skylake core.

Quantization

// fp32 → int8 / fp16

Quantization replaces FP32 weights and activations with lower-precision types — typically INT8 or FP16. The weight file shrinks 4×, integer SIMD throughput goes up 2–4×, and on accelerators with INT8 tensor cores the speedup compounds. The price is accuracy loss, which good quantization keeps under 1% top-1.

Dynamic

Weights are quantized offline; activations are quantized on-the-fly per-batch. No calibration data needed. Best for transformer-like models where activation ranges vary heavily.

~2× speedup, ~10MB → ~3MB

Static (PTQ)

Run a calibration set through the FP32 model, record activation min/max histograms, derive scale + zero-point per tensor, freeze. Best for CNNs with stable activation distributions — i.e. our CIFAR-10 model.

~3-4× speedup, <1% accuracy drop

QAT

Quantization-aware training: insert fake-quant ops during training so the model learns weights robust to quantization noise. Highest accuracy, requires retraining infrastructure.

~3-4× speedup, virtually no accuracy drop

The math: affine quantization

An INT8 tensor q approximates an FP32 tensor x via a per-tensor (or per-channel) scale and zero point:

x ≈ scale · (q − zero_point)
where  q ∈ [−128, 127]  (int8, signed symmetric)
   or  q ∈ [   0, 255]  (uint8, asymmetric)

scale       = (x_max − x_min) / (q_max − q_min)
zero_point  = round(q_min − x_min / scale)

Per-channel quantization (one scale per output channel of a Conv) preserves accuracy far better than per-tensor on CNNs. ONNX's QuantizeLinear and DequantizeLinear ops both accept either scalar or 1-D y_scale / y_zero_point to indicate this.

QDQ vs QOperator format

Quantized ONNX models come in two flavors:

QDQ (Quantize-DeQuantize)

Insert Q/DQ pairs around every activation and weight. The compute ops stay FP32 — the runtime fuses Q+Op+DQ into a single quantized kernel.

x → DQ → Conv → Q → DQ → Relu → Q → ...
              ↑
          float weights are
          dequantized JIT

Default in modern ONNX exporters. Most portable.

QOperator

Use explicit quantized op_types: QLinearConv, QLinearMatMul, etc. These take int8 inputs directly along with their scales.

x_int8 → QLinearConv → y_int8
        (W_int8, scales, zero_points
         baked into op inputs)

More compact graph, less universally supported.

Quantizing CIFAR-10Net (live walkthrough)

from onnxruntime.quantization import quantize_static, CalibrationDataReader

class CIFAR10Calib(CalibrationDataReader):
    def __init__(self, samples=512):
        self.it = iter([{"image": x.numpy()}
                          for x, _ in dl][:samples])
    def get_next(self): return next(self.it, None)

quantize_static(
    "cifar10.onnx",
    "cifar10_int8.onnx",
    CIFAR10Calib(),
    quant_format=QuantFormat.QDQ,
    activation_type=QuantType.QInt8,
    weight_type=QuantType.QInt8,
    per_channel=True,
)

Variant	Size	Latency (1 thread)	Top-1	Notes
FP32 (post-fuse)	2.30 MB	1.6 ms	91.0%	Baseline
FP16	1.16 MB	0.9 ms	90.9%	GPU only typically
INT8 dynamic	0.62 MB	1.1 ms	89.7%	No calibration
INT8 static (per-channel)	0.62 MB	0.55 ms	90.6%	Recommended
INT8 QAT	0.62 MB	0.55 ms	90.9%	If retraining is available

The thumb rule. For CNNs, static per-channel INT8 PTQ is the 95th-percentile-good choice: 4× smaller, 3× faster, <0.5 pp accuracy loss. For LLMs, weight-only INT8 / INT4 with FP16 activations is the analog.

ONNX Runtime architecture

// session · providers · allocator

ONNX Runtime (ORT) is the reference implementation. Its architecture is worth studying because nearly every other runtime — TensorRT, OpenVINO, even mobile-only ones — follows the same broad shape.

// fig 8.1 — ONNX Runtime layered architecture

The session lifecycle

Load. Parse .onnx protobuf into in-memory Graph.
Optimize. Apply graph transformers (Level 1: trivial, Level 2: fusion, Level 3: layout). User picks the level.
Partition. Walk the graph. For each node, ask each enabled EP "can you take this?" The first EP that can claim a connected subgraph gets it.
Compile. Each EP compiles its assigned subgraph into a kernel sequence (or a single fused kernel, in TensorRT's case).
Plan memory. Compute lifetimes for every intermediate tensor; allocate a single arena that all intermediates alias into.
Run. For each call to session.run(), walk the partitioned plan and dispatch.

import onnxruntime as ort

opts = ort.SessionOptions()
opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
opts.intra_op_num_threads = 4
opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL

sess = ort.InferenceSession(
    "cifar10_int8.onnx",
    sess_options=opts,
    providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)

# IO binding avoids host↔device copies on hot paths
io = sess.io_binding()
io.bind_input("image",  "cuda", 0, np.float32, [1,3,32,32], img.data_ptr())
io.bind_output("logits", "cuda")
sess.run_with_iobinding(io)

The "providers" list is ordered. ORT tries to assign each node to the first provider that supports it. Listing CUDA before CPU means "use GPU when possible, fall back to CPU." Get the order wrong and your int8 quantized model may silently run on the GPU's fp32 path.

Execution providers

// where the kernels actually live

An execution provider is the bridge between the runtime's IR and a hardware-specific kernel library. Each EP is a plug-in that registers (1) which ops it implements, (2) capability metadata, and (3) compiled kernels.

EP	Targets	Backend lib	Op coverage	Best for
CPU	x86 / ARM	MLAS, oneDNN	~100%	Universal fallback, server inference
CUDA	NVIDIA GPU	cuBLAS, cuDNN	~95%	Datacenter, generic GPU inference
TensorRT	NVIDIA GPU	TRT engine	~85%	Lowest GPU latency, INT8
OpenVINO	Intel CPU/GPU/VPU	OpenVINO IR	~90%	Intel hardware, edge servers
DirectML	Any DX12 GPU	D3D12	~80%	Windows app inference
CoreML	Apple Silicon, ANE	MPS / ANE	~75%	iOS/macOS, neural engine offload
NNAPI	Android	NNAPI HAL	~70%	Android phones, vendor accelerators
QNN	Qualcomm Hexagon	QNN SDK	~75%	Snapdragon NPU
WebGPU	Browser	WGSL shaders	~60%	In-browser inference
WebAssembly	Browser (CPU)	WASM SIMD	~95%	In-browser fallback, no GPU

Partitioning in practice

Suppose we run our CIFAR-10 model with ["TensorRTExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. The partitioner walks the graph:

// fig 9.1 — partitioning the CIFAR-10 graph across EPs

Why partitioning is non-trivial. Naively assigning every node greedily can create a sawtooth where adjacent nodes ping-pong between providers, paying a copy at every boundary. ORT runs a graph-partitioning pass that prefers maximally-connected subgraphs per EP.

TensorRT EP specifically

TensorRT is unusual: it doesn't implement individual op kernels at runtime. Instead it accepts an entire subgraph, builds a fused engine via its own optimizer (kernel auto-tuning, INT8 calibration, fusion across ~50 ops), and serializes it. ORT caches that engine on disk and on next session load skips rebuilding.

providers = [("TensorrtExecutionProvider", {
    "trt_engine_cache_enable": True,
    "trt_engine_cache_path": "./trt_cache",
    "trt_int8_enable": True,
    "trt_int8_calibration_table_name": "cifar10_calib.cache",
    "trt_max_workspace_size": 2<30,
})]

Deployment targets

// where the model actually runs

"Deploying a model" means very different things on a 96-vCPU Linux server, an iPhone, a Raspberry Pi, and a browser tab. The same .onnx file should — and with care, can — serve all of them.

Server (Linux x86_64 / aarch64)

Runtime: ONNX Runtime + Triton Inference Server, or BentoML / KServe.
EP: TensorRT (NVIDIA), OpenVINO (Intel CPU), ROCm (AMD).
Concurrency: dynamic batching at the server tier; intra-op parallelism inside ORT for batch>1.
Format: FP16 on GPU, INT8 if accuracy budget permits.

iOS / Android

iOS: ORT-Mobile build → CoreML EP → Apple Neural Engine (ANE) for INT8/FP16 ops it supports, GPU for the rest.
Android: ORT-Mobile + NNAPI EP, or QNN EP for Snapdragon-only flagship apps.
Format: INT8 PTQ; binary-stripped ORT (~3 MB) instead of full build (~12 MB).
Constraint: ANE/NNAPI op coverage is partial; non-supported ops fall back to CPU and break op fusion.

Browser

onnxruntime-web: ships WASM (CPU SIMD) + WebGPU + WebGL backends in a single npm package.
WebGPU: ~5–20× faster than WASM for conv-heavy models on a discrete GPU.
WASM: universal fallback; ~2–4× slower than native CPU but works on any browser.
Caveats: first model load is large (~few MB gzipped); origin must serve .onnx with correct MIME and CORS.

Edge / MCU

Class A (Linux SBC, e.g. Pi 5): ORT + ARM NEON CPU EP; INT8 model fits in <1 MB.
Class B (Cortex-M): ONNX → TFLite Micro or onnx-mlir + cmsis-nn. Single-buffer arena, no malloc.
Class C (NPU): vendor compiler (Qualcomm AI Engine, NXP eIQ) consumes ONNX directly and emits a binary blob.

Cross-target consistency

One trap: the FP32 reference, the INT8 server engine, and the INT8 mobile engine can produce slightly different logits on the same input. ULP-level numerical differences (different cuDNN algorithm, different rounding mode, different SIMD reduction order) compound through layers. For most applications this is invisible; for safety-critical ones it must be characterized.

Recommended discipline. Maintain a "golden set" of inputs and FP32 logits. Before each release, run all deployment variants over the golden set and assert max-abs and KL-divergence under thresholds. Treat any drift as a regression.

Benchmarking

// latency · throughput · tail

A single number ("3 ms") tells you almost nothing. Inference performance is a distribution, parameterized by batch size, thread count, sequence length (for transformers), warmup state, and contention.

The four numbers that matter

P50Median latency

P99Tail latency (SLO target)

QPSThroughput at saturation

$/MreqCost per million requests

Doing it right

Warm up. Throw away the first ~50 calls. Lazy compilation, allocator priming, kernel autotuning all happen in the first few runs.
Pin clocks. On a GPU, lock to base clock with nvidia-smi -lgc; on CPU, disable turbo or pin frequency. Otherwise variance dominates signal.
Isolate the process. No other tenants on the device. Use taskset / numactl on CPU.
Measure end-to-end. Include host→device copy if your real workload pays it. io_binding with pre-resident GPU tensors can hide a real cost.
Vary load. Latency at QPS=1 and at QPS=saturation are different curves. Most production systems live in the knee of that curve.

import time, numpy as np, onnxruntime as ort

sess = ort.InferenceSession("cifar10_int8.onnx",
                            providers=["CPUExecutionProvider"])
x = np.random.randn(1,3,32,32).astype(np.float32)

# warmup
for _ in range(50): sess.run(None, {"image": x})

t = []
for _ in range(1000):
    s = time.perf_counter_ns()
    sess.run(None, {"image": x})
    t.append(time.perf_counter_ns() - s)

t = np.array(t) / 1e6  # ms
print(f"p50 {np.median(t):.3f}  p95 {np.percentile(t,95):.3f}  p99 {np.percentile(t,99):.3f}")

// fig 11.1 — latency vs offered QPS (CIFAR-10 INT8, single CPU)

The hockey stick. Below ~70% of saturation throughput, latency is roughly flat. Past it, P99 explodes — request queues form, scheduler jitter compounds. Production capacity planning targets ~60% saturation, not 95%.

Security

// integrity · side channels · supply chain

A deployed model is an attack surface. The model file itself can carry executable code via custom ops. The runtime can be made to leak inputs through timing. The training pipeline can be poisoned upstream of export. ONNX deployment teams should treat the .onnx file with the same scrutiny as any third-party binary.

Threat model overview

Model integrity

Tampering. A malicious actor swaps weights or rewires the graph to introduce a backdoor (specific input pattern → attacker-chosen output). Detectable only by hashing.
Mitigation: Sign .onnx files with Sigstore / cosign. Verify SHA-256 at load time. Pin opset and producer metadata.

Custom-op RCE

Risk. A model with a custom domain op can request the runtime load a shared library matching that domain. If the runtime resolves the library by name from LD_LIBRARY_PATH, an attacker who controls model + lib path gets code execution.
Mitigation: Disable custom op loading in production runtimes; allowlist domains; static-link required custom ops.

Side channels

Timing. Inference latency varies with input. For small models served at request scale, timing distributions can leak class labels or even reconstruct inputs.
Mitigation: Constant-time inference (always run the worst-case path); pad to a fixed deadline before responding; reduce server-side timing precision.

Adversarial inputs

Risk. Imperceptible perturbations cause misclassification. Out of scope for the runtime — must be addressed in model design (adversarial training, certified defenses).
Mitigation: Input normalization, randomized smoothing, ensemble checks at the application layer.

Model extraction

Risk. Black-box query access lets an attacker train a surrogate that closely matches the deployed model's behavior. Particularly relevant for paid inference APIs.
Mitigation: Rate limiting, query budgets, output truncation (return top-1 only, not full logits), watermarking.

Supply chain

Risk. A compromised pretrained model from a hub embeds a backdoor that survives fine-tuning. The export pipeline propagates it cleanly into .onnx.
Mitigation: Source models only from verified publishers; scan for anomalous sub-graphs; differential testing against a known-clean reference.

A worked example: timing side channel on CIFAR-10

Suppose CIFAR-10Net is served behind an HTTP API that returns predicted class. An attacker sends 10,000 inputs and records the precise latency of each response. Even with all per-class compute paths fused into a single graph, attacker-observable latency variance correlates with predicted class because:

The post-softmax argmax branch on the host returns earlier when one logit clearly dominates.
Cache effects: classes whose decision boundaries hit hot regions of the FC weight matrix have lower L2 misses.
The HTTP serializer's response length depends on label string length ("airplane" vs "cat").

None of this is exploitable on a per-call basis. With 10,000 calls and statistical analysis, label distributions become recoverable. The fix is application-layer: respond at a fixed deadline (e.g., always return at T+5ms), not when computation finishes.

The architectural lesson. Cryptographic-grade constant-time guarantees do not exist in mainstream deep learning runtimes. If you need them, you are building bespoke. For most production systems, deadline-padding plus rate-limiting closes the meaningful attack surface.

Closing the loop: a hardening checklist

Sign and verify every .onnx at load time. Reject unsigned models in production.
Disable custom op libraries; whitelist domains your team owns.
Build the runtime with the minimal set of EPs needed; smaller binary = smaller surface.
Run the runtime under a sandboxed user, with seccomp / AppArmor restricting syscalls to the strictly required.
Pad responses to a fixed deadline; truncate output to the minimum information clients need.
Maintain a golden test set; alert on drift between deployed variants and FP32 reference.
Treat the training-to-export pipeline as production code: review, CI, reproducible builds, signed artifacts.

★

Closing thoughts

// what to take away

If you trace one mental model through this entire pipeline — a 32×32 image entering a Python forward(), becoming a frozen graph of ~200 operator definitions, undergoing fold-fuse-quantize until a 580K-parameter network fits in 600 KB, then dispatching across nine possible execution providers down to vendor SIMD intrinsics — you can see why ONNX is more than a file format. It's the impedance match between research-grade flexibility and production-grade efficiency.

Three principles to internalize:

1. The graph is the artifact

Once a model is exported, its Python lineage is irrelevant. The graph is what gets optimized, quantized, partitioned, executed. Treat it as the source of truth and learn to read it directly.

2. Optimization is composition

No single transform delivers the headline numbers. Folding shrinks the graph; fusion eliminates intermediates; quantization shrinks tensors; the right EP picks the right kernel. Each is a 1.5–2× win; together, 10–20×.

3. Deployment is a pipeline, not a step

"Convert to ONNX" is the first step. The accuracy diff vs FP32, the latency budget, the security posture, the cross-target consistency — those are continuous concerns, not checkbox items.

// TABLE OF CONTENTS

Why ONNX exists, and what it actually solves

PThe problem before ONNX

SWhat ONNX provides

The reference CIFAR-10 CNN

AArchitecture (PyTorch source)

NBy the numbers

Layer-by-layer tensor shapes

Export: tracing vs scripting, and the symbolic shape problem

TTracing

SScripting (TorchScript)

The dynamic axis dance

Modern alternative: torch.export

The ONNX intermediate representation

SSchema hierarchy

NAnatomy of a NodeProto

Inspecting our exported CIFAR-10 graph

Visualizing the graph

Initializers vs inputs

Opsets, operator semantics, custom ops

Why opset version matters: a Resize example

Custom operators

1. Decompose

2. Custom domain

3. Function

Graph optimization

Three classes of transformation

Constant folding

Operator fusion

Layout transforms (NCHW vs NHWC)

Quantization

Dynamic

Static (PTQ)

QAT

The math: affine quantization

QDQ vs QOperator format

QDQ (Quantize-DeQuantize)

QOperator

Quantizing CIFAR-10Net (live walkthrough)

ONNX Runtime architecture

The session lifecycle

Execution providers

Partitioning in practice

TensorRT EP specifically

Deployment targets

Server (Linux x86_64 / aarch64)

iOS / Android

Browser

Edge / MCU

Cross-target consistency

Benchmarking

The four numbers that matter

Doing it right

Security

Threat model overview

Model integrity

Custom-op RCE

Side channels

Adversarial inputs

Model extraction

Supply chain

A worked example: timing side channel on CIFAR-10

Closing the loop: a hardening checklist

Closing thoughts

1. The graph is the artifact

2. Optimization is composition

3. Deployment is a pipeline, not a step

Why opset version matters: a `Resize` example