Jetson Orin Nano Super — Edge AI Hardware Deep Dive
An interactive, research-level tour of NVIDIA's $249 generative-AI edge computer:
the Ampere GPU, 6-core ARM CPU, vision pipeline, and how a real camera-to-decision
inference loop runs end-to-end inside a 25 W power envelope.
Ampere GA10B8 nm Samsung8 GB LPDDR567 TOPS (INT8 sparse)7–25 W
The "Super" badge is a firmware unlock, not a new chip. NVIDIA raised GPU, CPU and memory clocks on
the existing Orin Nano module via JetPack 6.1, lifting AI throughput from 40 TOPS to 67 TOPS and memory
bandwidth from 68 GB/s to 102 GB/s — the same silicon, just permitted more power (15 W → 25 W).
Architecture
AmpereGA10B SoC
Process node
Samsung 8 nm8LPP
CUDA cores
1 024
Tensor cores
323rd-gen Ampere
SM count
81 GPC × 4 TPCs × 2
GPU clock
1.02 GHzwas 625 MHz
CPU
6× Cortex-A78AE1.7 GHz, was 1.5
CPU cache
1.5 MB L2+ 4 MB shared L3
Memory
8 GB LPDDR5128-bit, 3200 MT/s
Memory BW
102 GB/swas 68 GB/s
AI perf
67 TOPSINT8 sparse
Power envelope
7–25 Wconfigurable
What's missing vs Orin NX / AGX: the Orin Nano variant has the
two NVDLA v2 deep-learning accelerators and the PVA v2 vision DSP physically fused off.
All AI work runs on the GPU. There is also no NVENC — H.264/H.265 encode is done on the CPU cores.
NVDEC is present for decode.
Numbers from NVIDIA's Jetson Orin Nano Super Developer Kit datasheet (Dec 2024) and Jetson AGX Orin Series Technical Brief v1.2.
02 Developer Kit Carrier Board (100 × 79 mm)
The reference carrier board is what you actually unbox. The SoM (the small heat-sinked module on top) plugs into a
260-pin SO-DIMM-style connector. Everything else — USB, M.2 slots, camera connectors, the 40-pin GPIO header — is
routed off the SoM through this carrier PCB. Hover the labels to see what each port does.
The credit-card-sized SoM is the actual computer. It packs the Orin SoC, four LPDDR5 memory chips, a PMIC,
a small EEPROM with board-ID, and the 260-pin SO-DIMM connector that talks to the carrier. Storage is
external on Orin Nano — there is no on-module eMMC, so you boot from microSD or NVMe.
Why no on-module storage? Removing the eMMC saved board area and ~$10 BOM cost.
The trade-off: a microSD boot is slow (UHS-I peaks at ~100 MB/s), so most builders pop a NVMe SSD into
the M.2 Key M slot and reflash to boot from PCIe — a 10–40× I/O boost.
04 Orin SoC Floorplan (T234 / GA10B)
The Orin die is a heterogeneous SoC, not a monolithic GPU. Compute is split across an Ampere GPU island, a 6-core ARM
cluster (12 in AGX, 8 in NX, 6 in Nano), an image signal processor for cameras, NVDEC for video, and a "Safety Cluster"
of Cortex-R52 cores for functional-safety housekeeping. Greyed-out blocks are present in the silicon but
fused off on the Nano variant — they ship enabled on Orin NX and AGX.
The Orin Nano GPU is a tiny slice of Ampere — just one Graphics Processing Cluster.
Compare that to a desktop RTX 3060 with 3 GPCs (28 SMs), or a full Orin AGX with 16 SMs across 2 GPCs.
Inside the single GPC, 4 Texture Processing Clusters each hold 2 SMs that share one PolyMorph engine.
How the math works out
Level
Count
Holds
Math
GPU
1
1 GPC
—
GPC
1
4 TPCs · 1 raster engine · 16 ROPs
—
TPC
4
2 SMs · 1 PolyMorph engine
—
SM
8
128 CUDA · 4 Tensor · 1 RT
—
CUDA cores
1024
FP32 / INT32 lanes
8 SMs × 128
Tensor cores
32
3rd-gen Ampere matrix engines
8 SMs × 4
RT cores
8
2nd-gen ray-traversal units
8 SMs × 1
06 Inside one Ampere SM
Each Streaming Multiprocessor is split into 4 sub-cores (processing blocks). Each sub-core has
its own warp scheduler, dispatch unit, register file slice, and a dedicated Tensor Core. The L1 / shared
memory cache and the RT core are shared across the SM. Press ▶ run warp to watch a 32-thread warp
travel through one sub-core.
Ready. A warp = 32 threads executing the same instruction in lock-step (SIMT).
07 3rd-Gen Tensor Core (Ampere) · the heart of 67 TOPS
Each Tensor Core is a fixed-function matrix-multiply-accumulate engine. In one clock it computes
D = A · B + C on small matrix tiles. Ampere added two huge wins: BF16 / TF32
for training-friendly precision, and structured 2:4 sparsity — if half your weights are
zero in a regular pattern, throughput doubles for free. That sparsity trick is where the headline
"67 TOPS" comes from (dense INT8 is 33 TOPS).
Precision / throughput ladder (FP16 dense = 1×)
FP16 / BF16 dense
1×~17 TFLOPS
TF32 (training)
0.5×training-friendly
INT8 dense
2×33 TOPS
INT8 + 2:4 sparsity
4×67 TOPS — headline
08 ARM CPU Cluster — 6× Cortex-A78AE
The CPU is a real out-of-order ARMv8.2 design — same family that ran the original Pixel 6 and many automotive
SoCs. The "AE" (Automotive Enhanced) variant adds split-lock ECC: any two cores can be paired into
lock-step mode for safety-critical code. On Orin Nano you get 6 cores in two clusters of 3, each cluster sharing
a 2 MB L3.
Pipeline
13-stageOoO superscalar
Issue width
4-wideup to 6 µops
Frequency
1.7 GHz(was 1.5 pre-Super)
SIMD
NEON 128-bitFP16/32/64 · INT8 dot-product
L1 I / L1 D
64 / 64 KBper core
L2 / core
256 KBprivate, 8-way
09 Unified Memory Hierarchy
Unlike a discrete GPU, Jetson has physically unified memory: the CPU and GPU literally share the
same 8 GB LPDDR5. There is no PCIe-style host-to-device copy — a tensor allocated by the CPU is already visible to
a CUDA kernel. This is huge for edge AI: zero-copy from the camera buffer straight into a YOLO inference call.
Press ▶ animate load to watch a single 64-byte cache line walk down the hierarchy.
Ready. The cache line will walk LPDDR5 → System Cache → GPU L2 → L1 → register.
Why unified memory matters at the edge: on a desktop RTX, every camera frame must be
DMA'd over PCIe before the GPU can see it (~10 ms for 1080p). On Jetson, the ISP writes the frame
straight into a buffer the GPU can already address — CUDA kernels can read it with
cudaHostGetDevicePointer() and 0 bytes are copied. The catch: GPU and CPU
are now competing for the same 102 GB/s, so memory-bound kernels feel each other.
10 The Edge AI Loop — Camera → ISP → GPU → Action
This is the canonical Jetson workload: a MIPI camera streams raw Bayer pixels into the SoC, the ISP cooks them into
RGB, the GPU runs an inference network, and the result either drives a display or fires an event over UART/CAN to a
robot's motor controller. Press ▶ run pipeline to watch one frame cross every block.
Ready. One frame will travel from sensor to action.
11 Power Modes & Thermal Envelope
The "Super" software unlock didn't change the silicon — it just permitted the existing PMIC to run the
chip harder. NVIDIA exposes power presets through the nvpmodel tool. The throttling logic is
enforced by the BPMP (Boot & Power Management Processor), a tiny ARM Cortex-R5 that sits
inside the SoC and watches the INA3221 current sensor on the carrier board. Cross 25 W average and the BPMP
starts dropping GPU clocks within microseconds.
Power mode
TDP
CPU cores
CPU clk
GPU clk
Mem clk
AI perf
15 W (legacy)
15 W
4 of 6
1.5 GHz
625 MHz
2133 MT/s
~28 TOPS
25 W "Super" (MAXN)
25 W
6 of 6
1.7 GHz
1.02 GHz
3200 MT/s
67 TOPS
7 W (silent)
7 W
2 of 6
1.1 GHz
408 MHz
2133 MT/s
~10 TOPS
Power-vs-throughput knobs
The same chip can be a fanless 7 W silent computer (nvpmodel -m 1) or a 25 W generative-AI
workstation. Throttle behaviour is graceful: clocks step down in 50 MHz increments rather than thermal-shutting
the chip. Below 60 °C junction temperature you get full clocks; from 60–100 °C the BPMP linearly de-rates the GPU.
12 JetPack Software Stack
Hardware is half the story. JetPack is the BSP + libraries that make the silicon usable. Every
user-facing AI framework (PyTorch, TensorFlow, ONNX Runtime) eventually compiles to a TensorRT engine and dispatches
to CUDA kernels. Below is the layered view from your Python script all the way down to the SoC.
The TensorRT optimization step (where the magic happens)
$ trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine \
--int8 --useDLACore=-1 # -1 means GPU only — Orin Nano has no DLA
# TRT will:
# 1. Fuse Conv+BN+ReLU into a single kernel
# 2. Pick the fastest tactic per layer (Winograd? im2col? implicit GEMM?)
# 3. Calibrate INT8 scales using a small representative dataset
# 4. Layout-transform tensors to NHWC for tensor cores
# 5. Emit a serialized .engine file (~6 MB for YOLOv8n)
13 Live demo — YOLO-style inference on the GPU
Click anywhere in the scene to drop a target object. The page simulates what the Orin Nano does for every
frame: capture → preprocess → run a CNN → post-process → draw boxes. The "frame timeline" shows where each
millisecond is spent. This isn't a real YOLO model (we don't ship 6 MB of weights in an HTML file 😉), but the
pipeline timings and behaviour mirror real benchmarks.
Frame timeline
Add a few objects, then press run inference.
14 Sources & further reading
All numbers in this page are pulled from primary NVIDIA documentation. If a figure looks off, the source-of-truth wins.
Jetson Orin Nano Super Developer Kit datasheet — NVIDIA, December 2024
Jetson AGX Orin Series Technical Brief v1.2 — NVIDIA, July 2022 (the only public Orin SoC architecture document)
"Maximizing Deep Learning Performance on NVIDIA Jetson Orin with DLA" — NVIDIA Technical Blog, 2023
Cortex-A78AE Technical Reference Manual — Arm, for CPU pipeline details