01 At a Glance — GB203 / RTX 5080
The RTX 5080 is built on TSMC's custom 4N (also called 4NP) node and uses NVIDIA's GB203 Blackwell GPU for the high-end graphics segment. It keeps the same Blackwell SM, 5th-generation Tensor Core, 4th-generation RT Core, DLSS 4, and neural-rendering feature set, but scales the chip to 7 GPCs, 42 TPCs, 84 SMs, and a 256-bit GDDR7 memory subsystem.
Numbers updated for the RTX 5080 / GB203 from NVIDIA's RTX Blackwell whitepaper and RTX 5080 specifications (see Sources).
02 Die Floorplan & Top-Level Chip Layout
The GB203 die is a monolithic high-end Blackwell GPU with 7 Graphics Processing Clusters (GPCs) wrapped in the memory/L2/IO ring. The schematic below is a topological rendering — block sizes are approximate, but positions mirror NVIDIA's published floorplan.
GB203 contains 7 GPCs × 6 TPCs × 2 SMs = 84 SMs. The RTX 5080 uses the full GB203 configuration with all 84 SMs enabled.
03 GPC → TPC → SM Hierarchy
A Graphics Processing Cluster (GPC) is the largest self-contained execution tile on the chip. Each GPC houses a Raster Engine, 2 ROP partitions (16 ROPs total), and 6 Texture Processing Clusters. Each TPC is a pair of SMs sharing a PolyMorph Engine. Click any GPC to zoom.
04 Inside a Blackwell SM (Streaming Multiprocessor)
Each SM is split into 4 sub-cores (processing blocks). Each sub-core has its own warp scheduler, register file, and a slice of the SM's
compute: CUDA cores (unified INT32/FP32), tensor core, and load/store units. A single L1 / shared memory cache and the RT core are shared across
the SM. Press ▶ run warp to animate a 32-thread warp flowing through one sub-core.
Key differences vs Ada (RTX 40-series) SM
| Feature | Ada SM | Blackwell SM |
|---|---|---|
| INT32/FP32 datapath | Split: 1× FP32, 1× FP32+INT32 | Unified — both datapaths can do INT32 or FP32 per cycle |
| Tensor core generation | 4th gen — FP16/BF16/FP8 | 5th gen — adds FP6, FP4, NVFP4, MXFP formats |
| RT core generation | 3rd gen | 4th gen — 2× triangle-intersect, Mega Geometry / LSS & subdivision surface primitives |
| Neural shader support | No | Yes — shaders can inline small MLPs via Tensor cores |
05 5th-Generation Tensor Core — FP4 Matrix Multiply
The Tensor Core is a systolic-array-style matrix engine. In each cycle it computes D = A·B + C over a tile of operands.
On Blackwell, a single Tensor Core tile can operate on FP4 (NVFP4 / MXFP4), FP6, FP8, FP16, BF16, TF32, or FP64 —
doubling throughput every time you halve the bit-width. Press ▶ step MAC to watch a 4×4 tile MAC animation.
Precision/throughput ladder (relative to FP16 = 1×)
06 Memory Hierarchy & Data Paths
Data in a Blackwell kernel flows through six nested levels — the closer to the SM, the faster and tinier.
Press ▶ animate load to watch a single 128-byte cache line be fetched from GDDR7 all the way down into a warp register.
Why the hierarchy matters for AI
A Tensor Core can burn through thousands of MACs per nanosecond — far faster than GDDR7 can deliver raw bytes.
The job of every lower cache level is to keep the Tensor Cores fed with operand tiles. Blackwell added TMEM
(Tensor Memory) — a dedicated 256 KB scratchpad per SM, accessed by new tcgen05.cp/ld/st instructions — specifically to
stage matrix operands without thrashing shared memory or the register file.
07 I/O, Peripherals & Fixed-Function Blocks
Outside the compute fabric, the RTX 5080 has a rich set of fixed-function blocks that handle host communication, display, video, and — new on Blackwell — a microcontroller dedicated to AI task arbitration.
08 Case Study — Training CIFAR-10 CNN on the RTX 5080
Let's trace one training step of a compact CNN for CIFAR-10: input 32 × 32 × 3 → Conv/ReLU/Pool → Conv/ReLU/Pool → FC → 10 classes.
Press ▶ run epoch step and watch how image batches flow across the chip: CPU → PCIe → GigaThread → SMs → Tensor Cores/cuDNN → L2 → GDDR7.
What actually runs on which block
| Phase | Where | How |
|---|---|---|
| 1. Kernel launch | Host CPU → PCIe 5.0 → GigaThread | cudaLaunchKernel and cuDNN descriptors select convolution kernels; GigaThread fans out CTAs across SMs. |
| 2. HtoD data copy | PCIe 5.0 DMA → GDDR7 → L2 | CIFAR-10 batch X: 128 × 32 × 32 × 3 images; labels and normalized tensors are staged for reuse. |
| 3. Conv1: 3 → 32 filters | Tensor cores / CUDA cores via cuDNN | 3×3 convolution lowered to tiled GEMM or direct convolution; feature-map tiles pass through L2/shared memory. |
| 4. ReLU + MaxPool | CUDA cores, same SMs | Elementwise activation and 2×2 pooling reduce spatial size to 16 × 16 while keeping data close to SMs. |
| 5. Conv2: 32 → 64 filters | Tensor cores across many SMs | Larger convolution dominates the step; input/filter tiles are staged through L2, shared memory, and TMEM. |
| 6. FC + Softmax + cross-entropy | Tensor cores + special-function units | Flattened feature map goes through a 4096 → 10 classifier; exp/sum/log compute the loss. |
| 7. Backward + optimizer update | Tensor cores + CUDA cores → L2 → GDDR7 | Gradient convolutions compute dFilters and dActivations; SGD/Adam updates filter and FC weights in GDDR7. |
09 Inference — Classifying One CIFAR-10 Image
Click 🎲 random CIFAR-10 image to generate a small 32 × 32 RGB-style sample. We simulate inference with a compact CNN and show how
activations flow through Conv1 → ReLU/Pool → Conv2 → FC → Softmax, mirroring the GPU execution path on the RTX 5080.
Input: 32 × 32 RGB image
Note: the images and logits here are procedurally generated to visualize the CNN hardware pipeline. This is a teaching demo, not a real CIFAR-10 benchmark.
Hardware pipeline & activations
∞ Sources & Further Reading
- NVIDIA, RTX Blackwell GPU Architecture Whitepaper v1.1 — block-level GPC / TPC / SM diagrams, RT & Tensor core generation notes.
- TechPowerUp & NotebookCheck GPU Database — SKU specs, clock/memory configuration, ROP/TMU counts.
- Jarmusch et al., "Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks" (arXiv 2507.10789, Dec 2025) — SM sub-core layout,
tcgen05, TMEM, measured latency/throughput. - SemiAnalysis, "NVIDIA Tensor Core Evolution: From Volta To Blackwell" — FP4/FP6/NVFP4 format deep-dive, datapath sharing.
- Wikipedia, GeForce RTX 50 series & Blackwell (microarchitecture) — I/O and display feature set, NVENC/NVDEC generations.
- Moor Insights & Strategy RTX Blackwell coverage — AI Management Processor (AMP) description and Windows scheduler interaction.