// DOC.ID: NV-CUDA-ARCH-0xA100 STATUS: COMPILED REV 4.6 / 2026

PARALLEL
by DESIGN.

A technical walkthrough of how GPU hardware architecture processes thousands of CUDA cores simultaneously to accelerate AI workloads — and why a CPU, with its handful of fat cores, fundamentally cannot keep up.

← Back to Modern AI Hardware
01 · Topology

A few smart cores. vs. Ten thousand simple ones.

x86 / Latency-Optimized

CPU

Central Processing Unit
CORE 0ALU + L1 + BRANCH
CORE 1ALU + L1 + BRANCH
CORE 2ALU + L1 + BRANCH
CORE 3ALU + L1 + BRANCH
Cores
4–64
Clock
5.0 GHz
Cache/core
~1 MB
Optimized
Latency
CUDA / Throughput-Optimized

GPU

Graphics Processing Unit · AI Accelerator
CUDA Cores
10,752
Tensor Cores
432
SMs
132
Optimized
Throughput
02 · Execution Model

Sequential threads SIMT swarms.

CPU · Serial
~4 ops/cyc
GPU · SIMT
~10⁴ ops/cyc
03 · The Streaming Multiprocessor

Inside an SM.

SM // The Building Block

A modern GPU is a fabric of Streaming Multiprocessors. Each SM bundles dozens of CUDA cores, special-function units, tensor cores, register files, and shared memory into one tightly coupled compute tile.

Threads are dispatched in groups of 32 — a warp — and execute the same instruction on different data lanes (SIMT).

  • 64–128 CUDA cores per SM
  • 4 Tensor Cores per SM (FP16/BF16/FP8 GEMM)
  • 256 KB register file — zero-cost context switch
  • Warp scheduler issues 1 instruction → 32 threads
  • Memory stalls hidden by thousands of in-flight warps
STREAMING_MULTIPROCESSOR // x16 SHOWNWARP_SIZE = 32
L1 / TEX
192 KB
SHARED
228 KB
REG FILE
256 KB
04 · The AI Workload

Why this matters: GEMM.

Matrix Multiplication on Tensor Cores

C = A · B → ~90% of all transformer FLOPs
MATRIX A · activations
×
MATRIX B · weights
=
MATRIX C · output
1 cycle
Tensor Core: 4×4×4 fused multiply-add
312 TFLOP/s
FP16 throughput per device
~250×
vs. equivalent CPU GEMM
05 · Memory Subsystem

Feeding the beast.

REGISTERS
~20 TB/sBANDWIDTH
1 cycleLATENCY
SHARED / L1
~19 TB/sBANDWIDTH
~30 cyclesLATENCY
L2 CACHE
~5 TB/sBANDWIDTH
~200 cyclesLATENCY
HBM3
3.35 TB/sBANDWIDTH
~500 cyclesLATENCY
HOST DRAM
64 GB/sPCIE 5.0
~10⁴ cyclesLATENCY