Roofline Model Fundamentals

When running algorithms on hardware, we’re bound by three fundamental constraints:

Compute Speed: How fast the hardware can do math operations (FLOPs/second)
Bandwidth: How quickly data can be moved around (bytes/second)
Memory Capacity: Total available storage for data (bytes)

These “Roofline” constraints help us establish upper and lower bounds for computation time.

Time Estimation Formulas

Computation Time

$T_{ma t h} = \frac{Computation FLOPs}{Accelerator FLOPs/s}$

Example:

H100 GPU: ~9.89e14 BFloat16 FLOPs/s
TPU v6e: ~9.1e14 FLOPs/s

Communication Time

$T_{co mm s} = \frac{Communication Bytes}{Network/Memory Bandwidth Bytes/s}$

Communication happens in two main contexts:

Within a chip: Transfers between on-chip memory (HBM) and compute cores
Between chips: When distributing models across multiple accelerators

Time Bounds

Lower bound: $T_{l o w er} = max (T_{ma t h}, T_{co mm s})$
Upper bound: $T_{u pp er} = T_{ma t h} + T_{co mm s}$

In practice, we optimize toward the lower bound by overlapping communication and computation.

Arithmetic-Intensity

Definition: The ratio of total FLOPs performed to the number of bytes communicated.

$\text{Arithmetic Intensity} = \frac{\text{Computation FLOPs}}{\text{Communication Bytes}}$$

This measure (FLOPs/byte) determines whether an operation will be compute-bound or communication-bound:

High intensity: Fully utilizes available FLOPs (compute-bound)
Low intensity: Wastes FLOPs while waiting for data (communication-bound)

The crossover point is the “peak arithmetic intensity” of the hardware:

$Peak Arithmetic Intensity = \frac{Peak Accelerator FLOPs/s}{Bandwidth Bytes/s}$

For TPU v5e MXU, this is ~240 FLOPs/byte (1.97e14 FLOPs/s ÷ 8.2e11 bytes/s).

Example: Dot Product

For a dot product operation (x • y), where x and y are bfloat16 vectors of length N:

$Intensity(dot product) = \frac{N + N - 1}{2 N + 2 N + 2} = \frac{2 N - 1}{4 N + 2} \approx \frac{1}{2}$

With an intensity of 0.5 FLOPs/byte, dot product operations are communication-bound on most hardware.

Visualizing Rooflines

Source: How To Scale Your Model - Rooflines

A roofline plot shows the relationship between arithmetic intensity (x-axis) and achievable performance in FLOPs/s (y-axis). The plot has three regions:

Red area: Bandwidth-bound at all bandwidths
Yellow area: Bandwidth-bound at lower bandwidths only
Green area: Compute-bound at all bandwidths

Performance can be improved by:

Increasing arithmetic intensity of algorithms
Increasing available memory bandwidth

Matrix-Multiplication Analysis

For matrix multiplication X * Y → Z where:

X has shape bf16[B, D]
Y has shape bf16[D, F]
Z has shape bf16[B, F]

$Intensity(matmul) = \frac{2 B D F}{2 B D + 2 D F + 2 BF} = \frac{B D F}{B D + D F + BF}$

When the local batch size B is small relative to D and F (typical in Transformers):

$Intensity(matmul) \approx B$

Key Takeaway

For a bfloat16 matrix multiplication to be compute-bound on most TPUs, the local batch size in tokens needs to be greater than 240 (~300 for GPUs)

Network Communication Rooflines

When matrices are split across multiple accelerators, we need to consider inter-chip communication rooflines.

e.g. For a matrix multiplication split across 2 TPUs:

Each TPU does half the work: $T_{ma t h} = \frac{B D F}{1.97 e 14}$
Communication time is: $T_{co mm s} = \frac{2 BF}{4.5 e 10}$

The operation becomes compute-bound when: $\frac{D}{2} > \frac{1.97 e 14}{4.5 e 10} = 4377$ or $I n t e n s i t y (ma t m u l (2 - c hi p s)) > I n t e n s i t y (C hi p w . r . t in t er - c hi p n e tw or k)$

Unlike the single-chip case, the critical threshold depends on D, not B. Why?

Cause D is now no longer factors in the data being moved, but B and F still do. Since B and F are both a factor in # FLOPs and data being moved, only B determines bottleneck.

Shashank Shekhar

Explorer

Scaling Book Part 1. All About Rooflines

Roofline Model Fundamentals

Time Estimation Formulas

Computation Time

Communication Time

Time Bounds

Arithmetic-Intensity

Example: Dot Product

Visualizing Rooflines

Matrix-Multiplication Analysis

Network Communication Rooflines

Graph View

Table of Contents