VMEM (Vector Memory)

Vector Memory (VMEM) is an on-chip scratchpad memory in TPUs that serves as a high-bandwidth intermediary between HBM and compute units like the MXU and VPU.

Key Characteristics

  • Capacity:

    • TPU v5e: 128 MiB per core
    • Much smaller than HBM (which is typically 16-96 GB)
  • Bandwidth:

    • ~22× higher bandwidth to compute units than HBM
    • Critical for reducing memory bandwidth bottlenecks
    • Enables compute-bound operation at lower arithmetic intensities
  • Programmer Control:

    • Explicitly managed by software (unlike automatic CPU caches)
    • Requires deliberate loading/unloading of tensors
    • Enables fine-grained optimization of memory access patterns

Role in TPU Architecture

  • Data Flow:

    • HBM → VMEM → compute units (MXU/VPU) → VMEM → HBM
    • All computation operates on data in VMEM, never directly from HBM
  • Storage Hierarchy:

    • Conceptually similar to L1/L2 cache in CPUs
    • But much larger and programmer-controlled
    • Contains Vector Registers (VREGs) that directly interface with compute units
  • Performance Impact:

    • Operations accessing data from VMEM need only 10-20 arithmetic intensity to be compute-bound
    • Operations accessing data from HBM need ~240 arithmetic intensity to be compute-bound

Optimization Strategies

  • Weight Prefetching:

    • Load weights ahead of time while other operations execute
    • Mask loading cost when memory bandwidth bound
    • Example: Load feed-forward weights into VMEM during attention computation
  • Data Reuse:

    • Keep frequently accessed tensors in VMEM
    • Minimize redundant loading from HBM
    • Particularly valuable for weights used across multiple batches
  • Memory Management:

    • Strategic allocation to maximize utilization of limited VMEM space
    • Prioritize tensors with highest reuse potential
    • Challenge: Fitting model parameters in limited capacity

Limitations

  • Size Constraints: Limited capacity means selective use for critical tensors
  • Management Overhead: Requires explicit programming compared to automatic caches
  • Sharing: No direct sharing between cores (must go through HBM)

Impact on Model Design

The availability of VMEM significantly influences model architecture and sharding decisions:

  • Keeping activation and weight tensors below VMEM capacity improves efficiency
  • Balancing compute, HBM, and VMEM utilization is critical for optimal performance
  • Sharding strategies often aim to fit per-device model portions within VMEM