VMEM (Vector Memory)
Vector Memory (VMEM) is an on-chip scratchpad memory in TPUs that serves as a high-bandwidth intermediary between HBM and compute units like the MXU and VPU.
Key Characteristics
-
Capacity:
- TPU v5e: 128 MiB per core
- Much smaller than HBM (which is typically 16-96 GB)
-
Bandwidth:
- ~22× higher bandwidth to compute units than HBM
- Critical for reducing memory bandwidth bottlenecks
- Enables compute-bound operation at lower arithmetic intensities
-
Programmer Control:
- Explicitly managed by software (unlike automatic CPU caches)
- Requires deliberate loading/unloading of tensors
- Enables fine-grained optimization of memory access patterns
Role in TPU Architecture
-
Data Flow:
- HBM → VMEM → compute units (MXU/VPU) → VMEM → HBM
- All computation operates on data in VMEM, never directly from HBM
-
Storage Hierarchy:
- Conceptually similar to L1/L2 cache in CPUs
- But much larger and programmer-controlled
- Contains Vector Registers (VREGs) that directly interface with compute units
-
Performance Impact:
- Operations accessing data from VMEM need only 10-20 arithmetic intensity to be compute-bound
- Operations accessing data from HBM need ~240 arithmetic intensity to be compute-bound
Optimization Strategies
-
Weight Prefetching:
- Load weights ahead of time while other operations execute
- Mask loading cost when memory bandwidth bound
- Example: Load feed-forward weights into VMEM during attention computation
-
Data Reuse:
- Keep frequently accessed tensors in VMEM
- Minimize redundant loading from HBM
- Particularly valuable for weights used across multiple batches
-
Memory Management:
- Strategic allocation to maximize utilization of limited VMEM space
- Prioritize tensors with highest reuse potential
- Challenge: Fitting model parameters in limited capacity
Limitations
- Size Constraints: Limited capacity means selective use for critical tensors
- Management Overhead: Requires explicit programming compared to automatic caches
- Sharing: No direct sharing between cores (must go through HBM)
Impact on Model Design
The availability of VMEM significantly influences model architecture and sharding decisions:
- Keeping activation and weight tensors below VMEM capacity improves efficiency
- Balancing compute, HBM, and VMEM utilization is critical for optimal performance
- Sharding strategies often aim to fit per-device model portions within VMEM