// TABLE OF CONTENTS
Why ONNX exists, and what it actually solves
// motivationA trained neural network is a tuple of (architecture, weights, operator semantics). The architecture lives in Python source code, the weights live in a framework-specific tensor format, and the operator semantics live implicitly in the framework's C++ kernels. Deploying a model means decoupling all three from the training framework — and that is the problem ONNX was designed to solve.
PThe problem before ONNX
Before standardized exchange formats, "deploying a PyTorch model to a Snapdragon NPU" meant writing a custom converter that walked the autograd graph, mapped each aten::* op to a vendor SDK call, and re-implemented anything not natively supported. Every (framework × runtime × hardware) pair was a quadratic engineering cost.
- TensorFlow → TFLite: one path, one converter, fragile.
- PyTorch → CoreML: another converter, separate maintenance.
- Caffe → TensorRT: yet another, with its own op gaps.
SWhat ONNX provides
ONNX (Open Neural Network Exchange) defines three things, and only three things:
- A protobuf schema for serializing computation graphs.
- A versioned opset with mathematical semantics for each operator.
- A type system over tensors (dtype, shape, optional symbolic dims).
It is not a runtime, not a training framework, and not a compiler. It is a contract between producers (frameworks) and consumers (runtimes / compilers).
The reference CIFAR-10 CNN
// running exampleThroughout the rest of this page, every concept is grounded in one concrete model: a small but realistic CNN trained on CIFAR-10. CIFAR-10 is 60,000 32×32 RGB images across 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck). Small enough to animate end-to-end, complex enough to exhibit every interesting deployment phenomenon.
AArchitecture (PyTorch source)
import torch.nn as nn class CIFAR10Net(nn.Module): def __init__(self, num_classes=10): super().__init__() # Block 1: 32x32x3 -> 16x16x32 self.conv1 = nn.Conv2d(3, 32, 3, padding=1) self.conv2 = nn.Conv2d(32, 32, 3, padding=1) self.bn1 = nn.BatchNorm2d(32) self.pool1 = nn.MaxPool2d(2) # Block 2: 16x16x32 -> 8x8x64 self.conv3 = nn.Conv2d(32, 64, 3, padding=1) self.conv4 = nn.Conv2d(64, 64, 3, padding=1) self.bn2 = nn.BatchNorm2d(64) self.pool2 = nn.MaxPool2d(2) # Block 3: 8x8x64 -> 4x4x128 self.conv5 = nn.Conv2d(64, 128, 3, padding=1) self.bn3 = nn.BatchNorm2d(128) self.pool3 = nn.MaxPool2d(2) # Classifier head self.fc1 = nn.Linear(128*4*4, 256) self.drop = nn.Dropout(0.5) self.fc2 = nn.Linear(256, num_classes) def forward(self, x): x = self.pool1(F.relu(self.bn1( self.conv2(F.relu(self.conv1(x)))))) x = self.pool2(F.relu(self.bn2( self.conv4(F.relu(self.conv3(x)))))) x = self.pool3(F.relu(self.bn3(self.conv5(x)))) x = x.flatten(1) x = self.drop(F.relu(self.fc1(x))) return self.fc2(x)
NBy the numbers
This is roughly the size of a textbook VGG-style CNN scaled for CIFAR-10. It contains every operator class we need to discuss:
Compute (Conv, MatMul) dominates runtime. Shape ops (Flatten, Reshape) are nearly free. Reductions (MaxPool) sit in between. This taxonomy will matter when we discuss fusion and quantization.
Layer-by-layer tensor shapes
Export: tracing vs scripting, and the symbolic shape problem
// pytorch → onnxExporting is not a translation. It is a reconstruction: PyTorch's forward() is arbitrary Python, and ONNX is a static dataflow graph. Bridging the two requires either running the Python (tracing) or parsing it (scripting), and each has failure modes.
TTracing
Run the model on a sample input. Record every framework op called. Emit those ops as the graph.
dummy = torch.randn(1, 3, 32, 32) torch.onnx.export( model, dummy, "cifar10.onnx", input_names=["image"], output_names=["logits"], dynamic_axes={ "image": {0: "batch"}, "logits": {0: "batch"} }, opset_version=17, )
Captures: the exact ops executed for the dummy input.
Loses: any control flow that depended on input values.
SScripting (TorchScript)
Statically analyze the Python source. Translate if, for, while into ONNX If, Loop ops.
scripted = torch.jit.script(model)
torch.onnx.export(scripted, dummy, "cifar10.onnx")
Captures: control flow, dynamic shapes, recursive structures.
Loses: support for arbitrary Python — only a typed subset is supported. Many real-world models fail to script without rewrites.
For a feed-forward CNN like ours there is no control flow, so tracing is sufficient and preferred. For LSTMs, beam search, or anything with data-dependent loops, scripting (or torch.export) becomes mandatory.
The dynamic axis dance
By default, tracing bakes the dummy input's shape into the graph. A model traced with (1,3,32,32) will only accept batch size 1 unless you mark the batch dimension as dynamic via dynamic_axes. Most production deployments require at least {0: "batch"} on every input and output.
tensor.shape[0] in Python returns a concrete int during tracing. The exporter records that int as a constant in the graph, silently breaking dynamic batching. Always prefer tensor.size(0) or, in opset 13+, allow the exporter to emit a Shape + Gather.
Modern alternative: torch.export
PyTorch 2.x introduced torch.export, a compiler-grade frontend that produces a fully captured FX graph with symbolic shapes. The new ONNX exporter (torch.onnx.dynamo_export) builds on it, eliminating most edge cases of the legacy tracer.
# The 2.x dynamo path from torch.onnx import dynamo_export prog = dynamo_export(model, dummy) prog.save("cifar10.onnx")
The ONNX intermediate representation
// protobuf internalsAn .onnx file is a serialized ModelProto message. Understanding the schema is the difference between treating ONNX as a black box and being able to debug, patch, or hand-author models.
SSchema hierarchy
ModelProto {
ir_version, // e.g. 9
producer_name, version,
opset_import [{domain, version}],
graph: GraphProto {
name,
node: [NodeProto], // the ops
initializer: [TensorProto], // weights
input: [ValueInfo],
output: [ValueInfo],
value_info: [ValueInfo] // intermediates
},
metadata_props
}
NAnatomy of a NodeProto
NodeProto {
op_type: "Conv",
domain: "", // "" = standard
name: "/conv1/Conv",
input: ["image",
"conv1.weight",
"conv1.bias"],
output: ["/conv1/Conv_out"],
attribute: [
{name: "kernel_shape", ints: [3, 3]},
{name: "pads", ints: [1, 1, 1, 1]},
{name: "strides", ints: [1, 1]},
{name: "group", i: 1}
]
}
Inspecting our exported CIFAR-10 graph
import onnx m = onnx.load("cifar10.onnx") print(f"opset: {m.opset_import[0].version}") print(f"ir_version: {m.ir_version}") print(f"nodes: {len(m.graph.node)}") print(f"initializers: {len(m.graph.initializer)}") for n in m.graph.node[:5]: print(n.op_type, n.input, "->", n.output) # Conv ['image', 'conv1.weight', 'conv1.bias'] -> ['/conv1/Conv_out'] # Relu ['/conv1/Conv_out'] -> ['/Relu_out'] # Conv ['/Relu_out', 'conv2.weight', ...] -> ['/conv2/Conv_out'] # BatchNormalization [...] -> ['/bn1/BN_out'] # Relu ['/bn1/BN_out'] -> ['/Relu_1_out']
Visualizing the graph
Every NodeProto is a vertex; every shared tensor name is an edge. Here is the topological structure of our CIFAR-10 graph after export, before any optimization:
Initializers vs inputs
A subtlety that trips up newcomers: weights live in graph.initializer, not graph.input. Some tools list them under both for backward compatibility, but the canonical interpretation is: anything in initializer is a constant tensor baked into the model; anything in input is something the caller must supply at runtime.
Opsets, operator semantics, custom ops
// the contractAn opset version is the contract between producer and consumer. Conv-11 and Conv-22 may have different attribute defaults, broadcasting rules, or supported dtypes. Pinning an opset is as load-bearing as pinning a Python version.
| Operator | Domain | Inputs | Notable attrs | Where it appears in our model |
|---|---|---|---|---|
Conv | ai.onnx | X, W, B? | kernel_shape, pads, strides, dilations, group | conv1–conv5 |
BatchNormalization | ai.onnx | X, scale, B, mean, var | epsilon, momentum, training_mode | bn1–bn3 |
Relu | ai.onnx | X | — | after every conv/fc |
MaxPool | ai.onnx | X | kernel_shape, strides, pads, ceil_mode | pool1–pool3 |
Flatten | ai.onnx | X | axis | before fc1 |
Gemm | ai.onnx | A, B, C? | alpha, beta, transA, transB | fc1, fc2 |
Dropout | ai.onnx | data, ratio?, training_mode? | seed | dropout (no-op at inference) |
Why opset version matters: a Resize example
Consider a model with image upscaling. In opset 10, Resize took scales as a single input. In opset 11, the signature changed: roi was added, and scales moved to the third input. A model exported under opset 10 will silently mis-link arguments if loaded under a strict opset-11 importer.
Custom operators
When the standard opset doesn't cover an op (e.g., a fused attention with bespoke masking), you have three choices:
1. Decompose
Express the op as a subgraph of standard ones. Slow but maximally portable. Most exporters do this by default for unsupported ops.
2. Custom domain
Use a non-empty domain on the NodeProto (e.g., com.microsoft). The runtime must register a kernel for that (domain, op_type, version) triple.
3. Function
ONNX functions: a named subgraph that the runtime can either inline or replace with a fused kernel. The cleanest path for emerging ops.
Graph optimization
// fusion · folding · layoutAn exported graph is rarely the graph that actually runs. Between .onnx on disk and the first kernel call sits a graph optimizer that can shrink the node count by 30–60% and the latency by 2–5×.
Three classes of transformation
Constant folding
Any subgraph whose inputs are all initializers can be evaluated at load time. The classic example: Reshape(W, Concat(Shape(W), [1])). The shape and concat depend only on a constant weight, so the entire reshape becomes a new constant tensor.
For our CIFAR-10 model, the post-export graph has constant subgraphs around the BatchNorm parameters that fold completely.
# Before y = Reshape(weight, Concat([Shape(weight)[0:2], Constant([3,3])])) # After y = precomputed_constant
Operator fusion
The single highest-leverage optimization. Two adjacent ops that share an intermediate tensor become one kernel that reads the input once and writes the output once — saving the intermediate's memory bandwidth.
BN folding into Conv. Since BatchNorm at inference is y = γ(x-μ)/√(σ²+ε) + β, and Conv is y = Wx + b, you can analytically fold BN's affine into Conv's weights:
W' = W * (γ / sqrt(σ² + ε)).reshape(-1, 1, 1, 1) b' = (b - μ) * γ / sqrt(σ² + ε) + β
After folding, every Conv-BN-Relu in our model becomes a single FusedConv node. For CIFAR-10Net: 14 nodes → 8 nodes.
Layout transforms (NCHW vs NHWC)
PyTorch defaults to NCHW (batch, channels, height, width). Many mobile NPUs and the Apple Neural Engine prefer NHWC. The optimizer can insert Transpose nodes at the boundary, then push them through the graph until they cancel out.
This is called transpose elimination: Transpose(Transpose(x, [0,2,3,1]), [0,3,1,2]) ≡ x. Done correctly across an entire graph, the only remaining transposes are at inputs and outputs.
| Layout | Best for | Why |
|---|---|---|
| NCHW | NVIDIA GPUs (older), CUDA cuDNN | cuDNN's most mature kernels are NCHW |
| NHWC | TPU, Apple Neural Engine, modern Tensor Cores | Channels-last vectorizes better with WMMA / MMA |
| NC/32HW32 | TensorRT INT8 | Tile-friendly for IMMA / DP4A instructions |
Quantization
// fp32 → int8 / fp16Quantization replaces FP32 weights and activations with lower-precision types — typically INT8 or FP16. The weight file shrinks 4×, integer SIMD throughput goes up 2–4×, and on accelerators with INT8 tensor cores the speedup compounds. The price is accuracy loss, which good quantization keeps under 1% top-1.
Dynamic
Weights are quantized offline; activations are quantized on-the-fly per-batch. No calibration data needed. Best for transformer-like models where activation ranges vary heavily.
~2× speedup, ~10MB → ~3MB
Static (PTQ)
Run a calibration set through the FP32 model, record activation min/max histograms, derive scale + zero-point per tensor, freeze. Best for CNNs with stable activation distributions — i.e. our CIFAR-10 model.
~3-4× speedup, <1% accuracy drop
QAT
Quantization-aware training: insert fake-quant ops during training so the model learns weights robust to quantization noise. Highest accuracy, requires retraining infrastructure.
~3-4× speedup, virtually no accuracy drop
The math: affine quantization
An INT8 tensor q approximates an FP32 tensor x via a per-tensor (or per-channel) scale and zero point:
x ≈ scale · (q − zero_point) where q ∈ [−128, 127] (int8, signed symmetric) or q ∈ [ 0, 255] (uint8, asymmetric) scale = (x_max − x_min) / (q_max − q_min) zero_point = round(q_min − x_min / scale)
Per-channel quantization (one scale per output channel of a Conv) preserves accuracy far better than per-tensor on CNNs. ONNX's QuantizeLinear and DequantizeLinear ops both accept either scalar or 1-D y_scale / y_zero_point to indicate this.
QDQ vs QOperator format
Quantized ONNX models come in two flavors:
QDQ (Quantize-DeQuantize)
Insert Q/DQ pairs around every activation and weight. The compute ops stay FP32 — the runtime fuses Q+Op+DQ into a single quantized kernel.
x → DQ → Conv → Q → DQ → Relu → Q → ...
↑
float weights are
dequantized JIT
Default in modern ONNX exporters. Most portable.
QOperator
Use explicit quantized op_types: QLinearConv, QLinearMatMul, etc. These take int8 inputs directly along with their scales.
x_int8 → QLinearConv → y_int8
(W_int8, scales, zero_points
baked into op inputs)
More compact graph, less universally supported.
Quantizing CIFAR-10Net (live walkthrough)
from onnxruntime.quantization import quantize_static, CalibrationDataReader class CIFAR10Calib(CalibrationDataReader): def __init__(self, samples=512): self.it = iter([{"image": x.numpy()} for x, _ in dl][:samples]) def get_next(self): return next(self.it, None) quantize_static( "cifar10.onnx", "cifar10_int8.onnx", CIFAR10Calib(), quant_format=QuantFormat.QDQ, activation_type=QuantType.QInt8, weight_type=QuantType.QInt8, per_channel=True, )
| Variant | Size | Latency (1 thread) | Top-1 | Notes |
|---|---|---|---|---|
| FP32 (post-fuse) | 2.30 MB | 1.6 ms | 91.0% | Baseline |
| FP16 | 1.16 MB | 0.9 ms | 90.9% | GPU only typically |
| INT8 dynamic | 0.62 MB | 1.1 ms | 89.7% | No calibration |
| INT8 static (per-channel) | 0.62 MB | 0.55 ms | 90.6% | Recommended |
| INT8 QAT | 0.62 MB | 0.55 ms | 90.9% | If retraining is available |
ONNX Runtime architecture
// session · providers · allocatorONNX Runtime (ORT) is the reference implementation. Its architecture is worth studying because nearly every other runtime — TensorRT, OpenVINO, even mobile-only ones — follows the same broad shape.
The session lifecycle
- Load. Parse
.onnxprotobuf into in-memoryGraph. - Optimize. Apply graph transformers (Level 1: trivial, Level 2: fusion, Level 3: layout). User picks the level.
- Partition. Walk the graph. For each node, ask each enabled EP "can you take this?" The first EP that can claim a connected subgraph gets it.
- Compile. Each EP compiles its assigned subgraph into a kernel sequence (or a single fused kernel, in TensorRT's case).
- Plan memory. Compute lifetimes for every intermediate tensor; allocate a single arena that all intermediates alias into.
- Run. For each call to
session.run(), walk the partitioned plan and dispatch.
import onnxruntime as ort opts = ort.SessionOptions() opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL opts.intra_op_num_threads = 4 opts.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL sess = ort.InferenceSession( "cifar10_int8.onnx", sess_options=opts, providers=["CUDAExecutionProvider", "CPUExecutionProvider"], ) # IO binding avoids host↔device copies on hot paths io = sess.io_binding() io.bind_input("image", "cuda", 0, np.float32, [1,3,32,32], img.data_ptr()) io.bind_output("logits", "cuda") sess.run_with_iobinding(io)
Execution providers
// where the kernels actually liveAn execution provider is the bridge between the runtime's IR and a hardware-specific kernel library. Each EP is a plug-in that registers (1) which ops it implements, (2) capability metadata, and (3) compiled kernels.
| EP | Targets | Backend lib | Op coverage | Best for |
|---|---|---|---|---|
| CPU | x86 / ARM | MLAS, oneDNN | ~100% | Universal fallback, server inference |
| CUDA | NVIDIA GPU | cuBLAS, cuDNN | ~95% | Datacenter, generic GPU inference |
| TensorRT | NVIDIA GPU | TRT engine | ~85% | Lowest GPU latency, INT8 |
| OpenVINO | Intel CPU/GPU/VPU | OpenVINO IR | ~90% | Intel hardware, edge servers |
| DirectML | Any DX12 GPU | D3D12 | ~80% | Windows app inference |
| CoreML | Apple Silicon, ANE | MPS / ANE | ~75% | iOS/macOS, neural engine offload |
| NNAPI | Android | NNAPI HAL | ~70% | Android phones, vendor accelerators |
| QNN | Qualcomm Hexagon | QNN SDK | ~75% | Snapdragon NPU |
| WebGPU | Browser | WGSL shaders | ~60% | In-browser inference |
| WebAssembly | Browser (CPU) | WASM SIMD | ~95% | In-browser fallback, no GPU |
Partitioning in practice
Suppose we run our CIFAR-10 model with ["TensorRTExecutionProvider", "CUDAExecutionProvider", "CPUExecutionProvider"]. The partitioner walks the graph:
TensorRT EP specifically
TensorRT is unusual: it doesn't implement individual op kernels at runtime. Instead it accepts an entire subgraph, builds a fused engine via its own optimizer (kernel auto-tuning, INT8 calibration, fusion across ~50 ops), and serializes it. ORT caches that engine on disk and on next session load skips rebuilding.
providers = [("TensorrtExecutionProvider", { "trt_engine_cache_enable": True, "trt_engine_cache_path": "./trt_cache", "trt_int8_enable": True, "trt_int8_calibration_table_name": "cifar10_calib.cache", "trt_max_workspace_size": 2<30, })]
Deployment targets
// where the model actually runs"Deploying a model" means very different things on a 96-vCPU Linux server, an iPhone, a Raspberry Pi, and a browser tab. The same .onnx file should — and with care, can — serve all of them.
Server (Linux x86_64 / aarch64)
- Runtime: ONNX Runtime + Triton Inference Server, or BentoML / KServe.
- EP: TensorRT (NVIDIA), OpenVINO (Intel CPU), ROCm (AMD).
- Concurrency: dynamic batching at the server tier; intra-op parallelism inside ORT for batch>1.
- Format: FP16 on GPU, INT8 if accuracy budget permits.
iOS / Android
- iOS: ORT-Mobile build → CoreML EP → Apple Neural Engine (ANE) for INT8/FP16 ops it supports, GPU for the rest.
- Android: ORT-Mobile + NNAPI EP, or QNN EP for Snapdragon-only flagship apps.
- Format: INT8 PTQ; binary-stripped ORT (~3 MB) instead of full build (~12 MB).
- Constraint: ANE/NNAPI op coverage is partial; non-supported ops fall back to CPU and break op fusion.
Browser
- onnxruntime-web: ships WASM (CPU SIMD) + WebGPU + WebGL backends in a single npm package.
- WebGPU: ~5–20× faster than WASM for conv-heavy models on a discrete GPU.
- WASM: universal fallback; ~2–4× slower than native CPU but works on any browser.
- Caveats: first model load is large (~few MB gzipped); origin must serve
.onnxwith correct MIME and CORS.
Edge / MCU
- Class A (Linux SBC, e.g. Pi 5): ORT + ARM NEON CPU EP; INT8 model fits in <1 MB.
- Class B (Cortex-M): ONNX → TFLite Micro or onnx-mlir + cmsis-nn. Single-buffer arena, no malloc.
- Class C (NPU): vendor compiler (Qualcomm AI Engine, NXP eIQ) consumes ONNX directly and emits a binary blob.
Cross-target consistency
One trap: the FP32 reference, the INT8 server engine, and the INT8 mobile engine can produce slightly different logits on the same input. ULP-level numerical differences (different cuDNN algorithm, different rounding mode, different SIMD reduction order) compound through layers. For most applications this is invisible; for safety-critical ones it must be characterized.
Benchmarking
// latency · throughput · tailA single number ("3 ms") tells you almost nothing. Inference performance is a distribution, parameterized by batch size, thread count, sequence length (for transformers), warmup state, and contention.
The four numbers that matter
Doing it right
- Warm up. Throw away the first ~50 calls. Lazy compilation, allocator priming, kernel autotuning all happen in the first few runs.
- Pin clocks. On a GPU, lock to base clock with
nvidia-smi -lgc; on CPU, disable turbo or pin frequency. Otherwise variance dominates signal. - Isolate the process. No other tenants on the device. Use
taskset/numactlon CPU. - Measure end-to-end. Include host→device copy if your real workload pays it.
io_bindingwith pre-resident GPU tensors can hide a real cost. - Vary load. Latency at QPS=1 and at QPS=saturation are different curves. Most production systems live in the knee of that curve.
import time, numpy as np, onnxruntime as ort sess = ort.InferenceSession("cifar10_int8.onnx", providers=["CPUExecutionProvider"]) x = np.random.randn(1,3,32,32).astype(np.float32) # warmup for _ in range(50): sess.run(None, {"image": x}) t = [] for _ in range(1000): s = time.perf_counter_ns() sess.run(None, {"image": x}) t.append(time.perf_counter_ns() - s) t = np.array(t) / 1e6 # ms print(f"p50 {np.median(t):.3f} p95 {np.percentile(t,95):.3f} p99 {np.percentile(t,99):.3f}")
Security
// integrity · side channels · supply chainA deployed model is an attack surface. The model file itself can carry executable code via custom ops. The runtime can be made to leak inputs through timing. The training pipeline can be poisoned upstream of export. ONNX deployment teams should treat the .onnx file with the same scrutiny as any third-party binary.
Threat model overview
Model integrity
- Tampering. A malicious actor swaps weights or rewires the graph to introduce a backdoor (specific input pattern → attacker-chosen output). Detectable only by hashing.
- Mitigation: Sign
.onnxfiles with Sigstore / cosign. Verify SHA-256 at load time. Pin opset and producer metadata.
Custom-op RCE
- Risk. A model with a custom domain op can request the runtime load a shared library matching that domain. If the runtime resolves the library by name from
LD_LIBRARY_PATH, an attacker who controls model + lib path gets code execution. - Mitigation: Disable custom op loading in production runtimes; allowlist domains; static-link required custom ops.
Side channels
- Timing. Inference latency varies with input. For small models served at request scale, timing distributions can leak class labels or even reconstruct inputs.
- Mitigation: Constant-time inference (always run the worst-case path); pad to a fixed deadline before responding; reduce server-side timing precision.
Adversarial inputs
- Risk. Imperceptible perturbations cause misclassification. Out of scope for the runtime — must be addressed in model design (adversarial training, certified defenses).
- Mitigation: Input normalization, randomized smoothing, ensemble checks at the application layer.
Model extraction
- Risk. Black-box query access lets an attacker train a surrogate that closely matches the deployed model's behavior. Particularly relevant for paid inference APIs.
- Mitigation: Rate limiting, query budgets, output truncation (return top-1 only, not full logits), watermarking.
Supply chain
- Risk. A compromised pretrained model from a hub embeds a backdoor that survives fine-tuning. The export pipeline propagates it cleanly into
.onnx. - Mitigation: Source models only from verified publishers; scan for anomalous sub-graphs; differential testing against a known-clean reference.
A worked example: timing side channel on CIFAR-10
Suppose CIFAR-10Net is served behind an HTTP API that returns predicted class. An attacker sends 10,000 inputs and records the precise latency of each response. Even with all per-class compute paths fused into a single graph, attacker-observable latency variance correlates with predicted class because:
- The post-softmax argmax branch on the host returns earlier when one logit clearly dominates.
- Cache effects: classes whose decision boundaries hit hot regions of the FC weight matrix have lower L2 misses.
- The HTTP serializer's response length depends on label string length ("airplane" vs "cat").
None of this is exploitable on a per-call basis. With 10,000 calls and statistical analysis, label distributions become recoverable. The fix is application-layer: respond at a fixed deadline (e.g., always return at T+5ms), not when computation finishes.
Closing the loop: a hardening checklist
- Sign and verify every
.onnxat load time. Reject unsigned models in production. - Disable custom op libraries; whitelist domains your team owns.
- Build the runtime with the minimal set of EPs needed; smaller binary = smaller surface.
- Run the runtime under a sandboxed user, with seccomp / AppArmor restricting syscalls to the strictly required.
- Pad responses to a fixed deadline; truncate output to the minimum information clients need.
- Maintain a golden test set; alert on drift between deployed variants and FP32 reference.
- Treat the training-to-export pipeline as production code: review, CI, reproducible builds, signed artifacts.
Closing thoughts
// what to take awayIf you trace one mental model through this entire pipeline — a 32×32 image entering a Python forward(), becoming a frozen graph of ~200 operator definitions, undergoing fold-fuse-quantize until a 580K-parameter network fits in 600 KB, then dispatching across nine possible execution providers down to vendor SIMD intrinsics — you can see why ONNX is more than a file format. It's the impedance match between research-grade flexibility and production-grade efficiency.
Three principles to internalize:
1. The graph is the artifact
Once a model is exported, its Python lineage is irrelevant. The graph is what gets optimized, quantized, partitioned, executed. Treat it as the source of truth and learn to read it directly.
2. Optimization is composition
No single transform delivers the headline numbers. Folding shrinks the graph; fusion eliminates intermediates; quantization shrinks tensors; the right EP picks the right kernel. Each is a 1.5–2× win; together, 10–20×.
3. Deployment is a pipeline, not a step
"Convert to ONNX" is the first step. The accuracy diff vs FP32, the latency budget, the security posture, the cross-target consistency — those are continuous concerns, not checkbox items.