Jetson Orin Nano Super — Edge AI Hardware Deep Dive

An interactive, research-level tour of NVIDIA's $249 generative-AI edge computer: the Ampere GPU, 6-core ARM CPU, vision pipeline, and how a real camera-to-decision inference loop runs end-to-end inside a 25 W power envelope.

Ampere GA10B 8 nm Samsung 8 GB LPDDR5 67 TOPS (INT8 sparse) 7–25 W
← Back to Modern AI Hardware

01 At a Glance — Jetson Orin Nano 8 GB Super

The "Super" badge is a firmware unlock, not a new chip. NVIDIA raised GPU, CPU and memory clocks on the existing Orin Nano module via JetPack 6.1, lifting AI throughput from 40 TOPS to 67 TOPS and memory bandwidth from 68 GB/s to 102 GB/s — the same silicon, just permitted more power (15 W → 25 W).

Architecture
AmpereGA10B SoC
Process node
Samsung 8 nm8LPP
CUDA cores
1 024
Tensor cores
323rd-gen Ampere
SM count
81 GPC × 4 TPCs × 2
GPU clock
1.02 GHzwas 625 MHz
CPU
6× Cortex-A78AE1.7 GHz, was 1.5
CPU cache
1.5 MB L2+ 4 MB shared L3
Memory
8 GB LPDDR5128-bit, 3200 MT/s
Memory BW
102 GB/swas 68 GB/s
AI perf
67 TOPSINT8 sparse
Power envelope
7–25 Wconfigurable
What's missing vs Orin NX / AGX: the Orin Nano variant has the two NVDLA v2 deep-learning accelerators and the PVA v2 vision DSP physically fused off. All AI work runs on the GPU. There is also no NVENC — H.264/H.265 encode is done on the CPU cores. NVDEC is present for decode.

Numbers from NVIDIA's Jetson Orin Nano Super Developer Kit datasheet (Dec 2024) and Jetson AGX Orin Series Technical Brief v1.2.

02 Developer Kit Carrier Board (100 × 79 mm)

The reference carrier board is what you actually unbox. The SoM (the small heat-sinked module on top) plugs into a 260-pin SO-DIMM-style connector. Everything else — USB, M.2 slots, camera connectors, the 40-pin GPIO header — is routed off the SoM through this carrier PCB. Hover the labels to see what each port does.

Reference Carrier Board · 100 × 79 mm · 4-layer PCB JETSON ORIN NANO 8 GB MODULE (SoM) 69.6 × 45 mm · 260-pin SO-DIMM connector ⚙ HEATSINK + FAN CSI-2 #0 CSI-2 #1 2× MIPI CSI-2 · 22-pin · up to 4 lanes each DC-IN 19V / 65W PWR · RST 40-pin expansion header I²C · SPI · UART · I²S · GPIO · 3.3 V M.2 Key M · ×4 PCIe Gen3 NVMe SSD · ~3.9 GB/s M.2 Key M · ×2 PCIe Gen3 2nd NVMe slot M.2 Key E · ×1 PCIe Gen3 Wi-Fi / BT card slot (empty by default) microSD UHS-I (boot media) 4× USB 3.2 Gen2 Type-A 10 Gbps each USB-C UFP · flash DP 1.4 8K@30 / 4K@120 Gigabit Ethernet (RJ-45) 1 GbE · WoL supported FAN PWM (4-pin) JTAG / debug Camera control I²C · UART · automation pins PCIe diag · LEDs Power monitor Power tree: 19 V DC → buck regulators (5 V, 3.3 V, 1.8 V, 1.1 V, 0.85 V) → SoM via SO-DIMM connector PMIC + INA3221 3-channel current sense → power throttling at 25 W TDP P3768 — reference carrier · NVIDIA 945-13766-0005
PCIe / storage USB / network / GPIO Camera / display / SD Power / debug Ethernet Reset / JTAG

03 SoM — System-on-Module Cross-Section

The credit-card-sized SoM is the actual computer. It packs the Orin SoC, four LPDDR5 memory chips, a PMIC, a small EEPROM with board-ID, and the 260-pin SO-DIMM connector that talks to the carrier. Storage is external on Orin Nano — there is no on-module eMMC, so you boot from microSD or NVMe.

Jetson Orin Nano SoM — 69.6 × 45 mm · top view NVIDIA Orin SoC T234 · GA10B · FCBGA Samsung 8 nm 8LPP ~17 B transistors ~25 × 25 mm package LPDDR5 2 GB · 32-bit 3200 MT/s LPDDR5 2 GB · 32-bit LPDDR5 2 GB · 32-bit LPDDR5 2 GB · 32-bit Total: 8 GB · 128-bit bus · 102 GB/s PMIC EEPROM VR · DC-DC 260-pin SO-DIMM-style edge connector → carrier board UPHY: 3× PCIe Gen3 controllers · 4× USB 3.2 · 1× GbE · 4× MGBE · 16 MIPI CSI lanes (8 active on Nano) · DP/HDMI
Why no on-module storage? Removing the eMMC saved board area and ~$10 BOM cost. The trade-off: a microSD boot is slow (UHS-I peaks at ~100 MB/s), so most builders pop a NVMe SSD into the M.2 Key M slot and reflash to boot from PCIe — a 10–40× I/O boost.

04 Orin SoC Floorplan (T234 / GA10B)

The Orin die is a heterogeneous SoC, not a monolithic GPU. Compute is split across an Ampere GPU island, a 6-core ARM cluster (12 in AGX, 8 in NX, 6 in Nano), an image signal processor for cameras, NVDEC for video, and a "Safety Cluster" of Cortex-R52 cores for functional-safety housekeeping. Greyed-out blocks are present in the silicon but fused off on the Nano variant — they ship enabled on Orin NX and AGX.

Orin SoC top-level block diagram (topological — block sizes approximate) Ampere GPU Island (GA10B) 1 GPC · 4 TPCs · 8 SMs · 1024 CUDA · 32 Tensor · 8 RT 2 MB L2 cache (shared by all SMs) CPU Complex — 6× Cortex-A78AE v8.2 2 clusters of 3 cores · 1.7 GHz · ARMv8.2-A 64-bit Cluster A A78AE 256 KB L2 A78AE 256 KB L2 A78AE 256 KB L2 2 MB L3 (cluster shared) Cluster B A78AE A78AE A78AE 2 MB L3 (cluster shared) 256 KB L2 / core · 4 MB L3 total · DSU-110 (DynamIQ) "AE" = Automotive Enhanced · split-lock ECC SCF · System Coherence Fabric  ·  4 MB system cache  ·  4× LPDDR5 controllers (32-bit each) ISP 1.85 Gpix/s HDR · 3A · denoise VIC v4.2 2D engine scale · blend · LDC NVDEC H.265 · AV1 1× 4K60 NVENC fused off CPU encode DLA 0 NVDLA v2 fused off DLA 1 fused off PVA v2 vision DSP fused off UPHY PCIe Gen3 ×4/×2 · USB 3.2 · GbE CSI / DSI 8 MIPI CSI-2 lanes · DP 1.4 Safety Cluster 2× Cortex-R52 (lock-step) SE / Crypto AES · SHA · TrustZone · root-of-trust BPMP power µctrl Low-speed I/O: 4× UART · 3× SPI · 8× I²C · 4× DAP (I²S) · CAN · GPIO · QSPI · SDMMC
GPU / Tensor CPU / cache / UPHY Camera / display / video Decoder / safety Safety / power Fused off (Nano)

05 GPU Hierarchy — 1 GPC · 4 TPCs · 8 SMs

The Orin Nano GPU is a tiny slice of Ampere — just one Graphics Processing Cluster. Compare that to a desktop RTX 3060 with 3 GPCs (28 SMs), or a full Orin AGX with 16 SMs across 2 GPCs. Inside the single GPC, 4 Texture Processing Clusters each hold 2 SMs that share one PolyMorph engine.

One GPC — Raster Engine + 4 TPCs (= 8 SMs) + ROPs Raster Engine — triangle setup · edge/Z · tile binning · hierarchical-Z feeds shaded pixels into the SMs ROP partition 0 — 8 ROPs blend · depth-test · color write-out ROP partition 1 — 8 ROPs 16 ROPs total in the single GPC

How the math works out

LevelCountHoldsMath
GPU11 GPC
GPC14 TPCs · 1 raster engine · 16 ROPs
TPC42 SMs · 1 PolyMorph engine
SM8128 CUDA · 4 Tensor · 1 RT
CUDA cores1024FP32 / INT32 lanes8 SMs × 128
Tensor cores323rd-gen Ampere matrix engines8 SMs × 4
RT cores82nd-gen ray-traversal units8 SMs × 1

06 Inside one Ampere SM

Each Streaming Multiprocessor is split into 4 sub-cores (processing blocks). Each sub-core has its own warp scheduler, dispatch unit, register file slice, and a dedicated Tensor Core. The L1 / shared memory cache and the RT core are shared across the SM. Press ▶ run warp to watch a 32-thread warp travel through one sub-core.

Ampere SM — 128 CUDA cores · 4 Tensor Cores · 1 RT Core · 192 KB L1 / shared L1 Instruction Cache  ·  Warp dispatch (issues 32-thread warps to sub-cores) 2nd-Gen RT Core (shared by SM) BVH traversal · ray–triangle intersect · concurrent ray + compute 2× faster than Turing; 1 RT core × 8 SMs = 8 total L1 Data + Shared Memory · 192 KB unified configurable: 0/8/16/32/64/100 KB shared, rest as L1$ ~32 cycle hit latency · 16 banks · async copy from L2 SM ↔ L2 cache (2 MB) ↔ Memory Crossbar ↔ LPDDR5 controllers
Ready. A warp = 32 threads executing the same instruction in lock-step (SIMT).

07 3rd-Gen Tensor Core (Ampere)  ·  the heart of 67 TOPS

Each Tensor Core is a fixed-function matrix-multiply-accumulate engine. In one clock it computes D = A · B + C on small matrix tiles. Ampere added two huge wins: BF16 / TF32 for training-friendly precision, and structured 2:4 sparsity — if half your weights are zero in a regular pattern, throughput doubles for free. That sparsity trick is where the headline "67 TOPS" comes from (dense INT8 is 33 TOPS).

D = A · B + C — one MMA tile per clock A (input · INT8) B (weights · 2:4 sparse) multiply– accumulate 256 MACs/clk ×4 cores/SM C accumulator (INT32) D (out) Per-clock: 256 INT8 MACs per Tensor Core × 4 per SM × 8 SMs × 1.02 GHz × 2 (sparsity) ≈ 67 TOPS Supports FP16, BF16, TF32, INT8, INT4  ·  no FP8/FP4 (those landed on Hopper/Blackwell)

Precision / throughput ladder (FP16 dense = 1×)

FP16 / BF16 dense
~17 TFLOPS
TF32 (training)
0.5×training-friendly
INT8 dense
33 TOPS
INT8 + 2:4 sparsity
67 TOPS — headline

08 ARM CPU Cluster — 6× Cortex-A78AE

The CPU is a real out-of-order ARMv8.2 design — same family that ran the original Pixel 6 and many automotive SoCs. The "AE" (Automotive Enhanced) variant adds split-lock ECC: any two cores can be paired into lock-step mode for safety-critical code. On Orin Nano you get 6 cores in two clusters of 3, each cluster sharing a 2 MB L3.

Inside one Cortex-A78AE core — 4-wide OoO ARMv8.2 Front-end  ·  64 KB L1 I-cache  ·  branch predictor  ·  6-wide decode 160-entry reorder buffer · perceptron-style branch predictor Rename · Dispatch · Issue queues (4-wide issue) 3× ALU int add/shift/logic 1-cycle latency 2× MUL/DIV int multiply 3-cycle latency 2× ASIMD/FP 128-bit NEON · FP16/32/64 FMA · crypto extensions 2× LOAD / 1× STORE to L1 D-cache 4-cycle hit latency Branch / B-Pred Update resolves 1 branch / cycle 64 KB L1 D-cache (per core, 4-way) 256 KB L2 (per core, 8-way) 2 MB L3 (cluster-shared) → DSU-110 → SCF → LPDDR5
Pipeline
13-stageOoO superscalar
Issue width
4-wideup to 6 µops
Frequency
1.7 GHz(was 1.5 pre-Super)
SIMD
NEON 128-bitFP16/32/64 · INT8 dot-product
L1 I / L1 D
64 / 64 KBper core
L2 / core
256 KBprivate, 8-way

09 Unified Memory Hierarchy

Unlike a discrete GPU, Jetson has physically unified memory: the CPU and GPU literally share the same 8 GB LPDDR5. There is no PCIe-style host-to-device copy — a tensor allocated by the CPU is already visible to a CUDA kernel. This is huge for edge AI: zero-copy from the camera buffer straight into a YOLO inference call. Press ▶ animate load to watch a single 64-byte cache line walk down the hierarchy.

Memory hierarchy — unified across CPU and GPU · ~3 orders of magnitude latency spread LPDDR5 (off-package, 4 chips) · 8 GB · 102 GB/s ~120 ns 3200 MT/s · 128-bit · single physical pool, MMU partitioned System Cache (SCF) · 4 MB · CPU+GPU coherent ~80 cy snoops between A78AE clusters and GPU L2 GPU L2 cache · 2 MB · shared by all 8 SMs ~200 cy L1 / shared memory · 192 KB per SM (configurable split) ~30 cy CPU L2 · 256 KB / core CPU L3 · 2 MB / cluster Register file (GPU) · 256 KB per SM  ·  CPU L1 D · 64 KB / core ~1 cy Operand collectors / FU input latches 0 cy
Ready. The cache line will walk LPDDR5 → System Cache → GPU L2 → L1 → register.
Why unified memory matters at the edge: on a desktop RTX, every camera frame must be DMA'd over PCIe before the GPU can see it (~10 ms for 1080p). On Jetson, the ISP writes the frame straight into a buffer the GPU can already address — CUDA kernels can read it with cudaHostGetDevicePointer() and 0 bytes are copied. The catch: GPU and CPU are now competing for the same 102 GB/s, so memory-bound kernels feel each other.

10 The Edge AI Loop — Camera → ISP → GPU → Action

This is the canonical Jetson workload: a MIPI camera streams raw Bayer pixels into the SoC, the ISP cooks them into RGB, the GPU runs an inference network, and the result either drives a display or fires an event over UART/CAN to a robot's motor controller. Press ▶ run pipeline to watch one frame cross every block.

End-to-end frame latency budget — target 33 ms (30 fps) Camera IMX477 / OV5693 1080p · 30 fps Bayer raw MIPI CSI-2 ~2 ms ISP demosaic 3A · WB · denoise tone-map · LDC → NV12 in unified RAM zero-copy ~1 ms VIC + GPU resize → 640² color-space normalize → FP16/INT8 tensor ~3 ms CUDA stream TensorRT engine YOLOv8n · 640×640 INT8 quantized ~14 ms / frame ~70 fps possible postproc Action NMS + boxes → DisplayPort or UART/CAN → motors ~2 ms Shared LPDDR5 — every block reads/writes here · zero-copy possible because of the unified address space Total: ~22 ms / frame ≈ 45 fps sustained @ 1080p YOLOv8n INT8 Capture 2 ms + ISP 1 ms + VIC 3 ms + Inference 14 ms + Post 2 ms
Ready. One frame will travel from sensor to action.

11 Power Modes & Thermal Envelope

The "Super" software unlock didn't change the silicon — it just permitted the existing PMIC to run the chip harder. NVIDIA exposes power presets through the nvpmodel tool. The throttling logic is enforced by the BPMP (Boot & Power Management Processor), a tiny ARM Cortex-R5 that sits inside the SoC and watches the INA3221 current sensor on the carrier board. Cross 25 W average and the BPMP starts dropping GPU clocks within microseconds.

Power modeTDPCPU coresCPU clkGPU clkMem clkAI perf
15 W (legacy)15 W4 of 61.5 GHz625 MHz2133 MT/s~28 TOPS
25 W "Super" (MAXN)25 W6 of 61.7 GHz1.02 GHz3200 MT/s67 TOPS
7 W (silent)7 W2 of 61.1 GHz408 MHz2133 MT/s~10 TOPS

Power-vs-throughput knobs

The same chip can be a fanless 7 W silent computer (nvpmodel -m 1) or a 25 W generative-AI workstation. Throttle behaviour is graceful: clocks step down in 50 MHz increments rather than thermal-shutting the chip. Below 60 °C junction temperature you get full clocks; from 60–100 °C the BPMP linearly de-rates the GPU.

GPU clock vs junction temperature (typical) Junction temperature (°C) GPU clock 255075 100105shutdown 1020 MHz510 MHz0 MHz SAFE — full clocks THROTTLE — BPMP de-rates EMERGENCY OFF

12 JetPack Software Stack

Hardware is half the story. JetPack is the BSP + libraries that make the silicon usable. Every user-facing AI framework (PyTorch, TensorFlow, ONNX Runtime) eventually compiles to a TensorRT engine and dispatches to CUDA kernels. Below is the layered view from your Python script all the way down to the SoC.

Top-down: user app → JetPack → CUDA driver → silicon User applications Python · ROS 2 · GStreamer apps · llama.cpp · ollama · custom robotics code AI / vision frameworks PyTorch · TensorFlow · ONNX Runtime · Hugging Face Transformers · Ultralytics YOLO · OpenCV Verticals: NVIDIA Isaac (robotics) · DeepStream (vision) · Riva (speech) · Holoscan (sensors) NVIDIA accelerator libraries TensorRT (graph optimizer + INT8 quantizer) · cuDNN · cuBLAS · NCCL · VPI · NPP · cuFFT Camera: libargus · L4T multimedia API · Multimedia gst plugins (nvarguscamerasrc, nvinfer, ...) CUDA runtime & driver (CUDA 12.x in JetPack 6) PTX assembler · GPU memory manager · stream/event scheduler · UVM (unified virtual memory) Single ELF kernel module · talks directly to the GPU MMIO region in /dev/nvgpu L4T / Jetson Linux BSP Ubuntu 22.04 · Linux 5.15 kernel (RT-patched) · Tegra device tree · NVIDIA out-of-tree drivers UEFI bootloader · OP-TEE (TrustZone) · BPMP firmware · cboot fallback Orin SoC silicon — Ampere GPU + 6× A78AE + ISP/VIC/NVDEC + UPHY

The TensorRT optimization step (where the magic happens)

$ trtexec --onnx=yolov8n.onnx --saveEngine=yolov8n.engine \
          --int8 --useDLACore=-1  # -1 means GPU only — Orin Nano has no DLA
# TRT will:
#   1. Fuse Conv+BN+ReLU into a single kernel
#   2. Pick the fastest tactic per layer (Winograd? im2col? implicit GEMM?)
#   3. Calibrate INT8 scales using a small representative dataset
#   4. Layout-transform tensors to NHWC for tensor cores
#   5. Emit a serialized .engine file (~6 MB for YOLOv8n)

13 Live demo — YOLO-style inference on the GPU

Click anywhere in the scene to drop a target object. The page simulates what the Orin Nano does for every frame: capture → preprocess → run a CNN → post-process → draw boxes. The "frame timeline" shows where each millisecond is spent. This isn't a real YOLO model (we don't ship 6 MB of weights in an HTML file 😉), but the pipeline timings and behaviour mirror real benchmarks.

CAMERA VIEW · click to add object

Frame timeline

milliseconds (target = 33 ms / 30 fps) 010 203040 33 ms
Add a few objects, then press run inference.

14 Sources & further reading

All numbers in this page are pulled from primary NVIDIA documentation. If a figure looks off, the source-of-truth wins.