Compute-Bound

An algorithm or operation is said to be compute-bound (or compute-limited) when its performance is primarily constrained by the computational throughput of the hardware rather than by memory bandwidth or communication speed.

Characteristics

  • High Arithmetic Intensity: Performs many operations per byte of data accessed
  • Full FLOPs Utilization: Uses close to 100% of the theoretical peak FLOPs/s of the hardware
  • Minimal Waiting: Processing units rarely wait for data to arrive

In Roofline Analysis

In a roofline plot, compute-bound operations fall in the horizontal region of the graph, where:

  • Performance is capped by the hardware’s peak computational capacity
  • Increasing memory or network bandwidth yields no performance improvement
  • The arithmetic intensity is above the “critical intensity” threshold

Examples

  • Matrix multiplication with large batch sizes (B > 240 on TPU v5e)
  • Deep convolutions with many channels
  • Training large Transformer layers with sufficient batch size

Identification

An operation is compute-bound when:

Or equivalently, when:

Optimization Strategies

  • Algorithmic Optimizations: More efficient algorithms with fewer FLOPs
  • Specialized Hardware: Use of tensor cores, MXUs, or other accelerated compute units
  • Lower Precision: Using reduced precision (e.g., BF16 instead of FP32) to increase throughput
  • Code Optimization: Better use of vectorization, fusion, or specialized kernels
  • Sparsity: Leveraging structured or unstructured sparsity to reduce computation

Relationship to Efficiency

Being compute-bound is generally desirable because it means the hardware is being used efficiently. When an operation is compute-bound, the hardware achieves close to its peak theoretical performance in terms of FLOPs/s.