GeForce RTX 5080 — Hardware Architecture Deep Dive

An interactive, research-level tour of NVIDIA's high-end Blackwell GPU (GB203): hardware blocks, on-chip interconnect, memory hierarchy, and an animated walkthrough of how a CIFAR-10 CNN is trained and served on the silicon.

← Back to Modern AI Hardware

01 At a Glance — GB203 / RTX 5080

The RTX 5080 is built on TSMC's custom 4N (also called 4NP) node and uses NVIDIA's GB203 Blackwell GPU for the high-end graphics segment. It keeps the same Blackwell SM, 5th-generation Tensor Core, 4th-generation RT Core, DLSS 4, and neural-rendering feature set, but scales the chip to 7 GPCs, 42 TPCs, 84 SMs, and a 256-bit GDDR7 memory subsystem.

Architecture
BlackwellGB203
Process
TSMC 4Ncustom 5 nm class
Transistors
45.6billion
Die size
378mm²
CUDA cores
10,752
SMs / Tensor / RT
84 / 336 / 84
L2 cache
64MB
Memory
16 GBGDDR7 @ 30 Gbps
Bus / BW
256-bit960 GB/s
Boost clock
2.62GHz
AI (FP4 sparse)
1 801TOPS
TGP
360W

Numbers updated for the RTX 5080 / GB203 from NVIDIA's RTX Blackwell whitepaper and RTX 5080 specifications (see Sources).

02 Die Floorplan & Top-Level Chip Layout

The GB203 die is a monolithic high-end Blackwell GPU with 7 Graphics Processing Clusters (GPCs) wrapped in the memory/L2/IO ring. The schematic below is a topological rendering — block sizes are approximate, but positions mirror NVIDIA's published floorplan.

PCIe Gen5 ×16 edge connector (~64 GB/s each way) GB203 die — 378 mm², 45.6 B transistors L2 CACHE 64 MB Display Engine HDMI 2.1b · 3× DP 2.1b 2× NVENC · 2× NVDEC 9th-gen / 6th-gen · 4:2:2 AMP AI Management Proc. PCIe 5.0 ×16 Host interface GigaThread Engine Global scheduler

GB203 contains 7 GPCs × 6 TPCs × 2 SMs = 84 SMs. The RTX 5080 uses the full GB203 configuration with all 84 SMs enabled.

03 GPC → TPC → SM Hierarchy

A Graphics Processing Cluster (GPC) is the largest self-contained execution tile on the chip. Each GPC houses a Raster Engine, 2 ROP partitions (16 ROPs total), and 6 Texture Processing Clusters. Each TPC is a pair of SMs sharing a PolyMorph Engine. Click any GPC to zoom.

One GPC — Raster Engine + 6 TPCs + ROPs Raster Engine — triangle setup · edge/Z · tile binning · hierarchical-Z feeds pixels into SMs for shading ROP partition 0 — 8 ROPs blend · depth test · color write-out ROP partition 1 — 8 ROPs 16 ROPs / GPC × 7 GPCs enabled → 112 ROPs

04 Inside a Blackwell SM (Streaming Multiprocessor)

Each SM is split into 4 sub-cores (processing blocks). Each sub-core has its own warp scheduler, register file, and a slice of the SM's compute: CUDA cores (unified INT32/FP32), tensor core, and load/store units. A single L1 / shared memory cache and the RT core are shared across the SM. Press ▶ run warp to animate a 32-thread warp flowing through one sub-core.

Blackwell SM — 128 CUDA cores · 4 Tensor Cores · 1 RT Core · 128 KB L1/shared L1 Instruction Cache  ·  Warp dispatch 4th-Gen RT Core (shared by SM) BVH traversal · ray-triangle & ray-box intersect · Mega Geometry 2× triangle-intersect rate vs Ada L1 Data / Shared Memory — 128 KB configurable split · 4 Tex units · TMA (Tensor Memory Accelerator) backing store for warp reg spills
INT32/FP32 CUDA Tensor Core (5th gen) Warp scheduler / dispatch RT core
Idle. Press run warp — a 32-thread warp will be scheduled, operands fetched from the register file, and a fused MAC will execute across all 32 lanes of the sub-core.

Key differences vs Ada (RTX 40-series) SM

FeatureAda SMBlackwell SM
INT32/FP32 datapathSplit: 1× FP32, 1× FP32+INT32Unified — both datapaths can do INT32 or FP32 per cycle
Tensor core generation4th gen — FP16/BF16/FP85th gen — adds FP6, FP4, NVFP4, MXFP formats
RT core generation3rd gen4th gen — 2× triangle-intersect, Mega Geometry / LSS & subdivision surface primitives
Neural shader supportNoYes — shaders can inline small MLPs via Tensor cores

05 5th-Generation Tensor Core — FP4 Matrix Multiply

The Tensor Core is a systolic-array-style matrix engine. In each cycle it computes D = A·B + C over a tile of operands. On Blackwell, a single Tensor Core tile can operate on FP4 (NVFP4 / MXFP4), FP6, FP8, FP16, BF16, TF32, or FP64 — doubling throughput every time you halve the bit-width. Press ▶ step MAC to watch a 4×4 tile MAC animation.

Tensor Core — D = A·B + C (4×4 tile, single cycle) A (activations, FP4) B (weights, FP4) multiply– accumulate 16 lanes C accumulator (FP16) D (out) Blackwell adds tcgen05.mma PTX — single-thread MMA issue, removes warp-wide barrier; operands staged from TMEM Micro-scaling (NVFP4) keeps accuracy by quantizing per 16-element block with a shared scale factor

Precision/throughput ladder (relative to FP16 = 1×)

FP16 / BF16
baseline
FP8
FP6
shares FP8 datapath
FP4 / NVFP4

06 Memory Hierarchy & Data Paths

Data in a Blackwell kernel flows through six nested levels — the closer to the SM, the faster and tinier. Press ▶ animate load to watch a single 128-byte cache line be fetched from GDDR7 all the way down into a warp register.

Memory hierarchy — 6 levels, ~3 orders of magnitude latency spread GDDR7 HBM-like · 16 GB · 960 GB/s ~350 ns · off-chip L2 Cache · 64 MB · shared by all SMs ~360 cy L1 data / shared memory · 128 KB per SM ~30–40 cy TMEM · 256 KB per SM · tensor-core scratchpad 16 TB/s read Register file · 256 KB per SM · split 4 sub-cores ~1 cy Operand collector / lane 0 cy
Ready. The cache line will walk from GDDR7 → L2 → L1 → TMEM → Register → Operand collector.

Why the hierarchy matters for AI

A Tensor Core can burn through thousands of MACs per nanosecond — far faster than GDDR7 can deliver raw bytes. The job of every lower cache level is to keep the Tensor Cores fed with operand tiles. Blackwell added TMEM (Tensor Memory) — a dedicated 256 KB scratchpad per SM, accessed by new tcgen05.cp/ld/st instructions — specifically to stage matrix operands without thrashing shared memory or the register file.

07 I/O, Peripherals & Fixed-Function Blocks

Outside the compute fabric, the RTX 5080 has a rich set of fixed-function blocks that handle host communication, display, video, and — new on Blackwell — a microcontroller dedicated to AI task arbitration.

PCIe 5.0 ×16
Host interface. 32 GT/s per lane × 16 lanes = ~64 GB/s each direction. DMA engines stream tensors and texture data between system RAM and GDDR7.
GigaThread Engine
Global work distributor. Receives CUDA kernel launches, decomposes them into cooperative thread-array (CTA / thread-block) units and assigns them round-robin to SMs that have free resources.
AMP — AI Management Proc.
New on Blackwell. A small RISC microcontroller that arbitrates AI workloads (DLSS, neural shaders) vs graphics work — reduces CPU round-trips and cooperates with Windows hardware-accelerated GPU scheduling.
2× NVENC (9th-gen) · 2× NVDEC (6th-gen)
Dedicated video silicon: H.264, HEVC, AV1 encode/decode — and for the first time in GeForce, 4:2:2 chroma support for pro video.
Display Engine
1× HDMI 2.1b + 3× DisplayPort 2.1b (UHBR20). Drives 8K@165 Hz with DSC, or 4K@480 Hz.
8× GDDR7 Memory Controllers
Ring of eight 32-bit channels → 256-bit bus. PAM3 signaling at 30 Gbps per pin = 960 GB/s aggregate.

08 Case Study — Training CIFAR-10 CNN on the RTX 5080

Let's trace one training step of a compact CNN for CIFAR-10: input 32 × 32 × 3 → Conv/ReLU/Pool → Conv/ReLU/Pool → FC → 10 classes. Press ▶ run epoch step and watch how image batches flow across the chip: CPU → PCIe → GigaThread → SMs → Tensor Cores/cuDNN → L2 → GDDR7.

Host CPU PyTorch, cuDNN launch PCIe 5.0 GigaThread Engine CNN kernels → CTAs GDDR7 16 GB images, labels, filters L2 Cache (64 MB) — feature maps + filter tiles Input batch 128 × 32×32×3 Conv1 3→32, 3×3 ReLU + Pool 32×16×16 Conv2 32→64, 3×3 FC 4096→10 Loss CE phase idle step 0 · loss —
Ready. We'll walk through the CNN phases: host launch → HtoD image batch → Conv1 → ReLU/Pool → Conv2 → FC/softmax → backward update.

What actually runs on which block

PhaseWhereHow
1. Kernel launchHost CPU → PCIe 5.0 → GigaThreadcudaLaunchKernel and cuDNN descriptors select convolution kernels; GigaThread fans out CTAs across SMs.
2. HtoD data copyPCIe 5.0 DMA → GDDR7 → L2CIFAR-10 batch X: 128 × 32 × 32 × 3 images; labels and normalized tensors are staged for reuse.
3. Conv1: 3 → 32 filtersTensor cores / CUDA cores via cuDNN3×3 convolution lowered to tiled GEMM or direct convolution; feature-map tiles pass through L2/shared memory.
4. ReLU + MaxPoolCUDA cores, same SMsElementwise activation and 2×2 pooling reduce spatial size to 16 × 16 while keeping data close to SMs.
5. Conv2: 32 → 64 filtersTensor cores across many SMsLarger convolution dominates the step; input/filter tiles are staged through L2, shared memory, and TMEM.
6. FC + Softmax + cross-entropyTensor cores + special-function unitsFlattened feature map goes through a 4096 → 10 classifier; exp/sum/log compute the loss.
7. Backward + optimizer updateTensor cores + CUDA cores → L2 → GDDR7Gradient convolutions compute dFilters and dActivations; SGD/Adam updates filter and FC weights in GDDR7.

09 Inference — Classifying One CIFAR-10 Image

Click 🎲 random CIFAR-10 image to generate a small 32 × 32 RGB-style sample. We simulate inference with a compact CNN and show how activations flow through Conv1 → ReLU/Pool → Conv2 → FC → Softmax, mirroring the GPU execution path on the RTX 5080.

Input: 32 × 32 RGB image

Note: the images and logits here are procedurally generated to visualize the CNN hardware pipeline. This is a teaching demo, not a real CIFAR-10 benchmark.

Hardware pipeline & activations

32×32×3 Conv1 3→32 32 maps ReLU Pool Conv2 32→64 64 maps FC 4096→10 10 logits Softmax output over CIFAR-10 classes
Generate a CIFAR-10-style image and press classify.

Sources & Further Reading