// DOC.ID: NV-CUDA-ARCH-0xA100 STATUS: COMPILED █ REV 4.6 / 2026

PARALLEL
by DESIGN.

A technical walkthrough of how GPU hardware architecture processes thousands of CUDA cores simultaneously to accelerate AI workloads — and why a CPU, with its handful of fat cores, fundamentally cannot keep up.

← Back to Modern AI Hardware

01 · Topology

A few smart cores. vs. Ten thousand simple ones.

x86 / Latency-Optimized

CPU

Central Processing Unit

CORE 0ALU + L1 + BRANCH

CORE 1ALU + L1 + BRANCH

CORE 2ALU + L1 + BRANCH

CORE 3ALU + L1 + BRANCH

Cores

4–64

Clock

5.0 GHz

Cache/core

~1 MB

Optimized

Latency

CUDA / Throughput-Optimized

GPU

Graphics Processing Unit · AI Accelerator

CUDA Cores

10,752

Tensor Cores

432

SMs

132

Optimized

Throughput

02 · Execution Model

Sequential threads → SIMT swarms.

CPU · Serial

~4 ops/cyc

GPU · SIMT

~10⁴ ops/cyc

03 · The Streaming Multiprocessor

Inside an SM.

SM // The Building Block

A modern GPU is a fabric of Streaming Multiprocessors. Each SM bundles dozens of CUDA cores, special-function units, tensor cores, register files, and shared memory into one tightly coupled compute tile.

Threads are dispatched in groups of 32 — a warp — and execute the same instruction on different data lanes (SIMT).

64–128 CUDA cores per SM
4 Tensor Cores per SM (FP16/BF16/FP8 GEMM)
256 KB register file — zero-cost context switch
Warp scheduler issues 1 instruction → 32 threads
Memory stalls hidden by thousands of in-flight warps

STREAMING_MULTIPROCESSOR // x16 SHOWNWARP_SIZE = 32

L1 / TEX

192 KB

SHARED

228 KB

REG FILE

256 KB

04 · The AI Workload

Why this matters: GEMM.

Matrix Multiplication on Tensor Cores

C = A · B → ~90% of all transformer FLOPs

MATRIX A · activations

MATRIX B · weights

MATRIX C · output

1 cycle

Tensor Core: 4×4×4 fused multiply-add

312 TFLOP/s

FP16 throughput per device

~250×

vs. equivalent CPU GEMM

05 · Memory Subsystem

Feeding the beast.

REGISTERS

~20 TB/sBANDWIDTH

1 cycleLATENCY

SHARED / L1

~19 TB/sBANDWIDTH

~30 cyclesLATENCY

L2 CACHE

~5 TB/sBANDWIDTH

~200 cyclesLATENCY

HBM3

3.35 TB/sBANDWIDTH

~500 cyclesLATENCY

HOST DRAM

64 GB/sPCIE 5.0

~10⁴ cyclesLATENCY