Memory-Bound

An algorithm or operation is said to be memory-bound (or memory-limited) when its performance is primarily constrained by the memory bandwidth or communication speed rather than by computational throughput of the hardware.

Characteristics

  • Low Arithmetic Intensity: Performs relatively few operations per byte of data accessed
  • Partial FLOPs Utilization: Cannot utilize the full theoretical peak FLOPs/s of the hardware
  • Waiting States: Processing units frequently wait for data to arrive

Types of Memory Bounds

  • HBM Bound: Limited by on-chip high bandwidth memory transfers
  • Communication Bound: Limited by inter-chip data transfers (e.g., over ICI, DCN)
  • Host-Device Bound: Limited by transfers between CPU and accelerator (e.g., over PCIe)

In Roofline Analysis

In a roofline plot, memory-bound operations fall in the sloped region of the graph, where:

  • Performance scales linearly with memory bandwidth
  • Increasing compute capacity yields no performance improvement
  • The arithmetic intensity is below the “critical intensity” threshold

Examples

  • Vector operations like dot products and element-wise operations
  • Matrix multiplications with small batch sizes (B < 240 on TPU v5e)
  • Embedding lookups with large vocabulary sizes
  • Data preprocessing operations

Identification

An operation is memory-bound when:

Or equivalently, when:

Optimization Strategies

  • Data Reuse: Maximize computations performed on each byte loaded from memory
  • Tiling/Blocking: Break operations into cache-friendly chunks
  • Compression/Quantization: Reduce memory footprint with lower precision or compression
  • Fusion: Combine multiple operations to avoid intermediate data transfers
  • Asynchronous Prefetching: Load data before it’s needed to hide latency
  • Bandwidth Optimization: Coalesced or sequential memory access patterns

Performance Implications

Being memory-bound often means that expensive computational resources are being underutilized. In these cases, adding more compute capacity doesn’t improve performance - only increasing memory bandwidth or reducing memory access can help.