VMEM (Vector Memory)

Vector Memory (VMEM) is an on-chip scratchpad memory in TPUs that serves as a high-bandwidth intermediary between HBM and compute units like the MXU and VPU.

Key Characteristics

Capacity:
- TPU v5e: 128 MiB per core
- Much smaller than HBM (which is typically 16-96 GB)
Bandwidth:
- ~22× higher bandwidth to compute units than HBM
- Critical for reducing memory bandwidth bottlenecks
- Enables compute-bound operation at lower arithmetic intensities
Programmer Control:
- Explicitly managed by software (unlike automatic CPU caches)
- Requires deliberate loading/unloading of tensors
- Enables fine-grained optimization of memory access patterns

Role in TPU Architecture

Data Flow:
- HBM → VMEM → compute units (MXU/VPU) → VMEM → HBM
- All computation operates on data in VMEM, never directly from HBM
Storage Hierarchy:
- Conceptually similar to L1/L2 cache in CPUs
- But much larger and programmer-controlled
- Contains Vector Registers (VREGs) that directly interface with compute units
Performance Impact:
- Operations accessing data from VMEM need only 10-20 arithmetic intensity to be compute-bound
- Operations accessing data from HBM need ~240 arithmetic intensity to be compute-bound

Optimization Strategies

Weight Prefetching:
- Load weights ahead of time while other operations execute
- Mask loading cost when memory bandwidth bound
- Example: Load feed-forward weights into VMEM during attention computation
Data Reuse:
- Keep frequently accessed tensors in VMEM
- Minimize redundant loading from HBM
- Particularly valuable for weights used across multiple batches
Memory Management:
- Strategic allocation to maximize utilization of limited VMEM space
- Prioritize tensors with highest reuse potential
- Challenge: Fitting model parameters in limited capacity

Limitations

Size Constraints: Limited capacity means selective use for critical tensors
Management Overhead: Requires explicit programming compared to automatic caches
Sharing: No direct sharing between cores (must go through HBM)

Impact on Model Design

The availability of VMEM significantly influences model architecture and sharding decisions:

Keeping activation and weight tensors below VMEM capacity improves efficiency
Balancing compute, HBM, and VMEM utilization is critical for optimal performance
Sharding strategies often aim to fit per-device model portions within VMEM

Shashank Shekhar

Explorer

VMEM (Vector Memory)

VMEM (Vector Memory)

Key Characteristics

Role in TPU Architecture

Optimization Strategies

Limitations

Impact on Model Design

Graph View

Table of Contents

Backlinks