Jetson Orin Nano Super — Edge AI Hardware Deep Dive

01 At a Glance — Jetson Orin Nano 8 GB Super

The "Super" badge is a firmware unlock, not a new chip. NVIDIA raised GPU, CPU and memory clocks on the existing Orin Nano module via JetPack 6.1, lifting AI throughput from 40 TOPS to 67 TOPS and memory bandwidth from 68 GB/s to 102 GB/s — the same silicon, just permitted more power (15 W → 25 W).

Architecture

AmpereGA10B SoC

Process node

Samsung 8 nm8LPP

CUDA cores

1 024

Tensor cores

323rd-gen Ampere

SM count

81 GPC × 4 TPCs × 2

GPU clock

1.02 GHzwas 625 MHz

CPU

6× Cortex-A78AE1.7 GHz, was 1.5

CPU cache

1.5 MB L2+ 4 MB shared L3

Memory

8 GB LPDDR5128-bit, 3200 MT/s

Memory BW

102 GB/swas 68 GB/s

AI perf

67 TOPSINT8 sparse

Power envelope

7–25 Wconfigurable

What's missing vs Orin NX / AGX: the Orin Nano variant has the two NVDLA v2 deep-learning accelerators and the PVA v2 vision DSP physically fused off. All AI work runs on the GPU. There is also no NVENC — H.264/H.265 encode is done on the CPU cores. NVDEC is present for decode.

Numbers from NVIDIA's Jetson Orin Nano Super Developer Kit datasheet (Dec 2024) and Jetson AGX Orin Series Technical Brief v1.2.

02 Developer Kit Carrier Board (100 × 79 mm)

The reference carrier board is what you actually unbox. The SoM (the small heat-sinked module on top) plugs into a 260-pin SO-DIMM-style connector. Everything else — USB, M.2 slots, camera connectors, the 40-pin GPIO header — is routed off the SoM through this carrier PCB. Hover the labels to see what each port does.

PCIe / storage USB / network / GPIO Camera / display / SD Power / debug Ethernet Reset / JTAG

03 SoM — System-on-Module Cross-Section

The credit-card-sized SoM is the actual computer. It packs the Orin SoC, four LPDDR5 memory chips, a PMIC, a small EEPROM with board-ID, and the 260-pin SO-DIMM connector that talks to the carrier. Storage is external on Orin Nano — there is no on-module eMMC, so you boot from microSD or NVMe.

Why no on-module storage? Removing the eMMC saved board area and ~$10 BOM cost. The trade-off: a microSD boot is slow (UHS-I peaks at ~100 MB/s), so most builders pop a NVMe SSD into the M.2 Key M slot and reflash to boot from PCIe — a 10–40× I/O boost.

04 Orin SoC Floorplan (T234 / GA10B)

The Orin die is a heterogeneous SoC, not a monolithic GPU. Compute is split across an Ampere GPU island, a 6-core ARM cluster (12 in AGX, 8 in NX, 6 in Nano), an image signal processor for cameras, NVDEC for video, and a "Safety Cluster" of Cortex-R52 cores for functional-safety housekeeping. Greyed-out blocks are present in the silicon but fused off on the Nano variant — they ship enabled on Orin NX and AGX.

GPU / Tensor CPU / cache / UPHY Camera / display / video Decoder / safety Safety / power Fused off (Nano)

05 GPU Hierarchy — 1 GPC · 4 TPCs · 8 SMs

The Orin Nano GPU is a tiny slice of Ampere — just one Graphics Processing Cluster. Compare that to a desktop RTX 3060 with 3 GPCs (28 SMs), or a full Orin AGX with 16 SMs across 2 GPCs. Inside the single GPC, 4 Texture Processing Clusters each hold 2 SMs that share one PolyMorph engine.

How the math works out

Level	Count	Holds	Math
GPU	1	1 GPC	—
GPC	1	4 TPCs · 1 raster engine · 16 ROPs	—
TPC	4	2 SMs · 1 PolyMorph engine	—
SM	8	128 CUDA · 4 Tensor · 1 RT	—
CUDA cores	1024	FP32 / INT32 lanes	8 SMs × 128
Tensor cores	32	3rd-gen Ampere matrix engines	8 SMs × 4
RT cores	8	2nd-gen ray-traversal units	8 SMs × 1

06 Inside one Ampere SM

Each Streaming Multiprocessor is split into 4 sub-cores (processing blocks). Each sub-core has its own warp scheduler, dispatch unit, register file slice, and a dedicated Tensor Core. The L1 / shared memory cache and the RT core are shared across the SM. Press ▶ run warp to watch a 32-thread warp travel through one sub-core.

Ready. A warp = 32 threads executing the same instruction in lock-step (SIMT).

07 3rd-Gen Tensor Core (Ampere) · the heart of 67 TOPS

Each Tensor Core is a fixed-function matrix-multiply-accumulate engine. In one clock it computes D = A · B + C on small matrix tiles. Ampere added two huge wins: BF16 / TF32 for training-friendly precision, and structured 2:4 sparsity — if half your weights are zero in a regular pattern, throughput doubles for free. That sparsity trick is where the headline "67 TOPS" comes from (dense INT8 is 33 TOPS).

Precision / throughput ladder (FP16 dense = 1×)

FP16 / BF16 dense

1×~17 TFLOPS

TF32 (training)

0.5×training-friendly

INT8 dense

2×33 TOPS

INT8 + 2:4 sparsity

4×67 TOPS — headline

08 ARM CPU Cluster — 6× Cortex-A78AE

The CPU is a real out-of-order ARMv8.2 design — same family that ran the original Pixel 6 and many automotive SoCs. The "AE" (Automotive Enhanced) variant adds split-lock ECC: any two cores can be paired into lock-step mode for safety-critical code. On Orin Nano you get 6 cores in two clusters of 3, each cluster sharing a 2 MB L3.

Pipeline

13-stageOoO superscalar

Issue width

4-wideup to 6 µops

Frequency

1.7 GHz(was 1.5 pre-Super)

SIMD

NEON 128-bitFP16/32/64 · INT8 dot-product

L1 I / L1 D

64 / 64 KBper core

L2 / core

256 KBprivate, 8-way

09 Unified Memory Hierarchy

Unlike a discrete GPU, Jetson has physically unified memory: the CPU and GPU literally share the same 8 GB LPDDR5. There is no PCIe-style host-to-device copy — a tensor allocated by the CPU is already visible to a CUDA kernel. This is huge for edge AI: zero-copy from the camera buffer straight into a YOLO inference call. Press ▶ animate load to watch a single 64-byte cache line walk down the hierarchy.

Ready. The cache line will walk LPDDR5 → System Cache → GPU L2 → L1 → register.

Why unified memory matters at the edge: on a desktop RTX, every camera frame must be DMA'd over PCIe before the GPU can see it (~10 ms for 1080p). On Jetson, the ISP writes the frame straight into a buffer the GPU can already address — CUDA kernels can read it with cudaHostGetDevicePointer() and 0 bytes are copied. The catch: GPU and CPU are now competing for the same 102 GB/s, so memory-bound kernels feel each other.

10 The Edge AI Loop — Camera → ISP → GPU → Action

This is the canonical Jetson workload: a MIPI camera streams raw Bayer pixels into the SoC, the ISP cooks them into RGB, the GPU runs an inference network, and the result either drives a display or fires an event over UART/CAN to a robot's motor controller. Press ▶ run pipeline to watch one frame cross every block.

Ready. One frame will travel from sensor to action.

11 Power Modes & Thermal Envelope

The "Super" software unlock didn't change the silicon — it just permitted the existing PMIC to run the chip harder. NVIDIA exposes power presets through the nvpmodel tool. The throttling logic is enforced by the BPMP (Boot & Power Management Processor), a tiny ARM Cortex-R5 that sits inside the SoC and watches the INA3221 current sensor on the carrier board. Cross 25 W average and the BPMP starts dropping GPU clocks within microseconds.

Power mode	TDP	CPU cores	CPU clk	GPU clk	Mem clk	AI perf
15 W (legacy)	15 W	4 of 6	1.5 GHz	625 MHz	2133 MT/s	~28 TOPS
25 W "Super" (MAXN)	25 W	6 of 6	1.7 GHz	1.02 GHz	3200 MT/s	67 TOPS
7 W (silent)	7 W	2 of 6	1.1 GHz	408 MHz	2133 MT/s	~10 TOPS

Power-vs-throughput knobs

The same chip can be a fanless 7 W silent computer (nvpmodel -m 1) or a 25 W generative-AI workstation. Throttle behaviour is graceful: clocks step down in 50 MHz increments rather than thermal-shutting the chip. Below 60 °C junction temperature you get full clocks; from 60–100 °C the BPMP linearly de-rates the GPU.

12 JetPack Software Stack

Hardware is half the story. JetPack is the BSP + libraries that make the silicon usable. Every user-facing AI framework (PyTorch, TensorFlow, ONNX Runtime) eventually compiles to a TensorRT engine and dispatches to CUDA kernels. Below is the layered view from your Python script all the way down to the SoC.

The TensorRT optimization step (where the magic happens)

$ trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine \
          --int8 --useDLACore=-1  # -1 means GPU only — Orin Nano has no DLA
# TRT will:
#   1. Fuse Conv+BN+ReLU into a single kernel
#   2. Pick the fastest tactic per layer (Winograd? im2col? implicit GEMM?)
#   3. Calibrate INT8 scales using a small representative dataset
#   4. Layout-transform tensors to NHWC for tensor cores
#   5. Emit a serialized .engine file (~6 MB for YOLOv8n)

13 Live demo — YOLO-style inference on the GPU

Click anywhere in the scene to drop a target object. The page simulates what the Orin Nano does for every frame: capture → preprocess → run a CNN → post-process → draw boxes. The "frame timeline" shows where each millisecond is spent. This isn't a real YOLO model (we don't ship 6 MB of weights in an HTML file 😉), but the pipeline timings and behaviour mirror real benchmarks.

Frame timeline

Add a few objects, then press run inference.

14 Sources & further reading

All numbers in this page are pulled from primary NVIDIA documentation. If a figure looks off, the source-of-truth wins.

Jetson Orin Nano Super Developer Kit datasheet — NVIDIA, December 2024
Jetson AGX Orin Series Technical Brief v1.2 — NVIDIA, July 2022 (the only public Orin SoC architecture document)
"Maximizing Deep Learning Performance on NVIDIA Jetson Orin with DLA" — NVIDIA Technical Blog, 2023
Cortex-A78AE Technical Reference Manual — Arm, for CPU pipeline details
NVIDIA Ampere Architecture Whitepaper — for SM, Tensor Core, RT Core internals
Jetson Linux Developer Guide r36.x — for power modes, BPMP behavior, INA3221 telemetry
Ultralytics YOLO benchmarks on Jetson Orin Nano Super — for real-world inference timings