AI Foundations • Memory Hierarchy

Memory Hierarchy for AI

AI performance and AI security are both shaped by where tensors live and how often they move. Registers, SRAM, on-chip caches, HBM or DRAM, and host memory do not merely differ in speed; they define observability, contention, spill behavior, and trust boundaries.

Overview

What this topic covers

Modern AI execution spans several memory tiers. Small local structures near compute provide low latency and high reuse, while larger off-chip tiers hold model weights, activations, KV-cache state, and intermediate tensors that cannot stay close to the compute array. The entire stack tries to maximize reuse near the compute blocks and minimize expensive movement.

From a security perspective, memory movement is often more revealing than arithmetic itself. Cache residency, burst length, reuse distance, bank conflicts, bandwidth saturation, paging, and host-device copies can expose model structure, sequence length, batch size, runtime phase, or multi-tenant interference. A memory-aware threat model is therefore central to hardware-aware AI security.

Why it matters

Security significance

  • Data movement often dominates energy and latency, making it a natural measurement surface.
  • Spill behavior and cache pressure can expose model size, sequence length, and execution phase.
  • HBM, DRAM, and host-memory transfers broaden the attack surface beyond on-chip arithmetic.
  • Understanding memory tiers is essential for both performance analysis and leakage interpretation.
Diagram of the AI memory hierarchy from registers to host memory and its security observation points.
Memory hierarchy for AI execution, emphasizing that placement and movement together shape both performance bottlenecks and leakage opportunities.
Key concepts

Key memory tiers

Each tier contributes differently to throughput, latency hiding, and the visibility of an AI workload.

Registers and local SRAM

These structures sit closest to the MAC or tensor units and support tile-level reuse. They are difficult to observe directly from software, but their occupancy and refill behavior strongly influence cycle-level timing.

On-chip cache and shared fabric buffers

Caches and shared buffers mediate access among compute engines, DMA, and coherence or fabric logic. They create contention points where co-scheduled tasks or tenants can indirectly influence one another.

HBM, DRAM, and host memory

Large memory tiers hold bulk model state and overflow tensors. They often dominate off-chip bandwidth, reveal burst structure, and become visible to system software, memory controllers, or interconnect monitors.

Security lens

Memory-centric security observations

A good AI threat model should be able to explain what memory effects an attacker might observe or manipulate.

Leakage through movement

Repeated transfers can reveal prompt length, batch size, sequence growth, or operator mix. Even when arithmetic stays constant, the movement profile may change enough to identify the workload phase.

Faults through stress and placement

Timing stress, row disturbances, buffer corruption, or voltage faults can have very different impact depending on whether the target state resides in SRAM, cache, HBM, or host memory.

Isolation through residency control

Defensive design must consider buffer ownership, zeroization, memory partitioning, cache flushing, and what happens when unsupported operators force data into new memory domains.

Next Step

Back to the research map

Return to the structured research overview or continue browsing the other AI foundations and AI security themes.