Memory-Bound
An algorithm or operation is said to be memory-bound (or memory-limited) when its performance is primarily constrained by the memory bandwidth or communication speed rather than by computational throughput of the hardware.
Characteristics
- Low Arithmetic Intensity: Performs relatively few operations per byte of data accessed
- Partial FLOPs Utilization: Cannot utilize the full theoretical peak FLOPs/s of the hardware
- Waiting States: Processing units frequently wait for data to arrive
Types of Memory Bounds
- HBM Bound: Limited by on-chip high bandwidth memory transfers
- Communication Bound: Limited by inter-chip data transfers (e.g., over ICI, DCN)
- Host-Device Bound: Limited by transfers between CPU and accelerator (e.g., over PCIe)
In Roofline Analysis
In a roofline plot, memory-bound operations fall in the sloped region of the graph, where:
- Performance scales linearly with memory bandwidth
- Increasing compute capacity yields no performance improvement
- The arithmetic intensity is below the “critical intensity” threshold
Examples
- Vector operations like dot products and element-wise operations
- Matrix multiplications with small batch sizes (B < 240 on TPU v5e)
- Embedding lookups with large vocabulary sizes
- Data preprocessing operations
Identification
An operation is memory-bound when:
Or equivalently, when:
Optimization Strategies
- Data Reuse: Maximize computations performed on each byte loaded from memory
- Tiling/Blocking: Break operations into cache-friendly chunks
- Compression/Quantization: Reduce memory footprint with lower precision or compression
- Fusion: Combine multiple operations to avoid intermediate data transfers
- Asynchronous Prefetching: Load data before it’s needed to hide latency
- Bandwidth Optimization: Coalesced or sequential memory access patterns
Performance Implications
Being memory-bound often means that expensive computational resources are being underutilized. In these cases, adding more compute capacity doesn’t improve performance - only increasing memory bandwidth or reducing memory access can help.