GeForce RTX 5080 — Deep Hardware Architecture Deep Dive

01 At a Glance — GB203 / RTX 5080

The RTX 5080 is built on TSMC's custom 4N (also called 4NP) node and uses NVIDIA's GB203 Blackwell GPU for the high-end graphics segment. It keeps the same Blackwell SM, 5th-generation Tensor Core, 4th-generation RT Core, DLSS 4, and neural-rendering feature set, but scales the chip to 7 GPCs, 42 TPCs, 84 SMs, and a 256-bit GDDR7 memory subsystem.

Architecture

BlackwellGB203

Process

TSMC 4Ncustom 5 nm class

Transistors

45.6billion

Die size

378mm²

CUDA cores

10,752

SMs / Tensor / RT

84 / 336 / 84

L2 cache

64MB

Memory

16 GBGDDR7 @ 30 Gbps

Bus / BW

256-bit960 GB/s

Boost clock

2.62GHz

AI (FP4 sparse)

1 801TOPS

TGP

360W

Numbers updated for the RTX 5080 / GB203 from NVIDIA's RTX Blackwell whitepaper and RTX 5080 specifications (see Sources).

02 Die Floorplan & Top-Level Chip Layout

The GB203 die is a monolithic high-end Blackwell GPU with 7 Graphics Processing Clusters (GPCs) wrapped in the memory/L2/IO ring. The schematic below is a topological rendering — block sizes are approximate, but positions mirror NVIDIA's published floorplan.

GB203 contains 7 GPCs × 6 TPCs × 2 SMs = 84 SMs. The RTX 5080 uses the full GB203 configuration with all 84 SMs enabled.

03 GPC → TPC → SM Hierarchy

A Graphics Processing Cluster (GPC) is the largest self-contained execution tile on the chip. Each GPC houses a Raster Engine, 2 ROP partitions (16 ROPs total), and 6 Texture Processing Clusters. Each TPC is a pair of SMs sharing a PolyMorph Engine. Click any GPC to zoom.

04 Inside a Blackwell SM (Streaming Multiprocessor)

Each SM is split into 4 sub-cores (processing blocks). Each sub-core has its own warp scheduler, register file, and a slice of the SM's compute: CUDA cores (unified INT32/FP32), tensor core, and load/store units. A single L1 / shared memory cache and the RT core are shared across the SM. Press ▶ run warp to animate a 32-thread warp flowing through one sub-core.

INT32/FP32 CUDA Tensor Core (5th gen) Warp scheduler / dispatch RT core

Idle. Press run warp — a 32-thread warp will be scheduled, operands fetched from the register file, and a fused MAC will execute across all 32 lanes of the sub-core.

Key differences vs Ada (RTX 40-series) SM

Feature	Ada SM	Blackwell SM
INT32/FP32 datapath	Split: 1× FP32, 1× FP32+INT32	Unified — both datapaths can do INT32 or FP32 per cycle
Tensor core generation	4th gen — FP16/BF16/FP8	5th gen — adds FP6, FP4, NVFP4, MXFP formats
RT core generation	3rd gen	4th gen — 2× triangle-intersect, Mega Geometry / LSS & subdivision surface primitives
Neural shader support	No	Yes — shaders can inline small MLPs via Tensor cores

05 5th-Generation Tensor Core — FP4 Matrix Multiply

The Tensor Core is a systolic-array-style matrix engine. In each cycle it computes D = A·B + C over a tile of operands. On Blackwell, a single Tensor Core tile can operate on FP4 (NVFP4 / MXFP4), FP6, FP8, FP16, BF16, TF32, or FP64 — doubling throughput every time you halve the bit-width. Press ▶ step MAC to watch a 4×4 tile MAC animation.

Precision/throughput ladder (relative to FP16 = 1×)

FP16 / BF16

1× baseline

FP8

2×

FP6

2× shares FP8 datapath

FP4 / NVFP4

4×

06 Memory Hierarchy & Data Paths

Data in a Blackwell kernel flows through six nested levels — the closer to the SM, the faster and tinier. Press ▶ animate load to watch a single 128-byte cache line be fetched from GDDR7 all the way down into a warp register.

Ready. The cache line will walk from GDDR7 → L2 → L1 → TMEM → Register → Operand collector.

Why the hierarchy matters for AI

A Tensor Core can burn through thousands of MACs per nanosecond — far faster than GDDR7 can deliver raw bytes. The job of every lower cache level is to keep the Tensor Cores fed with operand tiles. Blackwell added TMEM (Tensor Memory) — a dedicated 256 KB scratchpad per SM, accessed by new tcgen05.cp/ld/st instructions — specifically to stage matrix operands without thrashing shared memory or the register file.

07 I/O, Peripherals & Fixed-Function Blocks

Outside the compute fabric, the RTX 5080 has a rich set of fixed-function blocks that handle host communication, display, video, and — new on Blackwell — a microcontroller dedicated to AI task arbitration.

PCIe 5.0 ×16

Host interface. 32 GT/s per lane × 16 lanes = ~64 GB/s each direction. DMA engines stream tensors and texture data between system RAM and GDDR7.

GigaThread Engine

Global work distributor. Receives CUDA kernel launches, decomposes them into cooperative thread-array (CTA / thread-block) units and assigns them round-robin to SMs that have free resources.

AMP — AI Management Proc.

New on Blackwell. A small RISC microcontroller that arbitrates AI workloads (DLSS, neural shaders) vs graphics work — reduces CPU round-trips and cooperates with Windows hardware-accelerated GPU scheduling.

2× NVENC (9th-gen) · 2× NVDEC (6th-gen)

Dedicated video silicon: H.264, HEVC, AV1 encode/decode — and for the first time in GeForce, 4:2:2 chroma support for pro video.

Display Engine

1× HDMI 2.1b + 3× DisplayPort 2.1b (UHBR20). Drives 8K@165 Hz with DSC, or 4K@480 Hz.

8× GDDR7 Memory Controllers

Ring of eight 32-bit channels → 256-bit bus. PAM3 signaling at 30 Gbps per pin = 960 GB/s aggregate.

08 Case Study — Training CIFAR-10 CNN on the RTX 5080

Let's trace one training step of a compact CNN for CIFAR-10: input 32 × 32 × 3 → Conv/ReLU/Pool → Conv/ReLU/Pool → FC → 10 classes. Press ▶ run epoch step and watch how image batches flow across the chip: CPU → PCIe → GigaThread → SMs → Tensor Cores/cuDNN → L2 → GDDR7.

Ready. We'll walk through the CNN phases: host launch → HtoD image batch → Conv1 → ReLU/Pool → Conv2 → FC/softmax → backward update.

What actually runs on which block

Phase	Where	How
1. Kernel launch	Host CPU → PCIe 5.0 → GigaThread	`cudaLaunchKernel` and cuDNN descriptors select convolution kernels; GigaThread fans out CTAs across SMs.
2. HtoD data copy	PCIe 5.0 DMA → GDDR7 → L2	CIFAR-10 batch X: 128 × 32 × 32 × 3 images; labels and normalized tensors are staged for reuse.
3. Conv1: 3 → 32 filters	Tensor cores / CUDA cores via cuDNN	3×3 convolution lowered to tiled GEMM or direct convolution; feature-map tiles pass through L2/shared memory.
4. ReLU + MaxPool	CUDA cores, same SMs	Elementwise activation and 2×2 pooling reduce spatial size to 16 × 16 while keeping data close to SMs.
5. Conv2: 32 → 64 filters	Tensor cores across many SMs	Larger convolution dominates the step; input/filter tiles are staged through L2, shared memory, and TMEM.
6. FC + Softmax + cross-entropy	Tensor cores + special-function units	Flattened feature map goes through a 4096 → 10 classifier; exp/sum/log compute the loss.
7. Backward + optimizer update	Tensor cores + CUDA cores → L2 → GDDR7	Gradient convolutions compute dFilters and dActivations; SGD/Adam updates filter and FC weights in GDDR7.

09 Inference — Classifying One CIFAR-10 Image

Click 🎲 random CIFAR-10 image to generate a small 32 × 32 RGB-style sample. We simulate inference with a compact CNN and show how activations flow through Conv1 → ReLU/Pool → Conv2 → FC → Softmax, mirroring the GPU execution path on the RTX 5080.

Input: 32 × 32 RGB image

Note: the images and logits here are procedurally generated to visualize the CNN hardware pipeline. This is a teaching demo, not a real CIFAR-10 benchmark.

Hardware pipeline & activations

Generate a CIFAR-10-style image and press classify.

∞ Sources & Further Reading

NVIDIA, RTX Blackwell GPU Architecture Whitepaper v1.1 — block-level GPC / TPC / SM diagrams, RT & Tensor core generation notes.
TechPowerUp & NotebookCheck GPU Database — SKU specs, clock/memory configuration, ROP/TMU counts.
Jarmusch et al., "Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks" (arXiv 2507.10789, Dec 2025) — SM sub-core layout, tcgen05, TMEM, measured latency/throughput.
SemiAnalysis, "NVIDIA Tensor Core Evolution: From Volta To Blackwell" — FP4/FP6/NVFP4 format deep-dive, datapath sharing.
Wikipedia, GeForce RTX 50 series & Blackwell (microarchitecture) — I/O and display feature set, NVENC/NVDEC generations.
Moor Insights & Strategy RTX Blackwell coverage — AI Management Processor (AMP) description and Windows scheduler interaction.