Compute-Bound
An algorithm or operation is said to be compute-bound (or compute-limited) when its performance is primarily constrained by the computational throughput of the hardware rather than by memory bandwidth or communication speed.
Characteristics
- High Arithmetic Intensity: Performs many operations per byte of data accessed
- Full FLOPs Utilization: Uses close to 100% of the theoretical peak FLOPs/s of the hardware
- Minimal Waiting: Processing units rarely wait for data to arrive
In Roofline Analysis
In a roofline plot, compute-bound operations fall in the horizontal region of the graph, where:
- Performance is capped by the hardware’s peak computational capacity
- Increasing memory or network bandwidth yields no performance improvement
- The arithmetic intensity is above the “critical intensity” threshold
Examples
- Matrix multiplication with large batch sizes (B > 240 on TPU v5e)
- Deep convolutions with many channels
- Training large Transformer layers with sufficient batch size
Identification
An operation is compute-bound when:
Or equivalently, when:
Optimization Strategies
- Algorithmic Optimizations: More efficient algorithms with fewer FLOPs
- Specialized Hardware: Use of tensor cores, MXUs, or other accelerated compute units
- Lower Precision: Using reduced precision (e.g., BF16 instead of FP32) to increase throughput
- Code Optimization: Better use of vectorization, fusion, or specialized kernels
- Sparsity: Leveraging structured or unstructured sparsity to reduce computation
Relationship to Efficiency
Being compute-bound is generally desirable because it means the hardware is being used efficiently. When an operation is compute-bound, the hardware achieves close to its peak theoretical performance in terms of FLOPs/s.