Roofline Model Practice Questions

Question 1: INT8 Matrix Multiplication

Say we want to do in int8 precision (1 byte per parameter) instead of bfloat16.

a) How many bytes need to be loaded from memory? How many need to be written back to memory?

b) How many total OPs are performed?

c) What is the arithmetic intensity?

d) What is a roofline estimate for and ? What are reasonable upper and lower bounds for the runtime of the whole operation?

Assume our HBM bandwidth is 8.1e11 bytes/s and our int8 peak OPs/s is 3.94e14.

a. bytes loaded. bytes written back b. Total ops = c. Arithmetic Intensity = d. Roofline:

Assuming B << D, F;

For B > 487/2 = 243, compute bound. For B < 487/2 = 243, comms bound

Question 2: INT8 + BF16 Matrix Multiplication

In practice we often do different weight vs. activation quantization, so we might store our weights in very low precision but keep activations (and compute) in a higher precision. Say we want to quantize our weights in int8 but keep activations (and compute) in bfloat16. At what batch size do we become compute bound? Assume 1.97e14 bfloat16 FLOPs/s.

Hint: this means specifically bfloat16[B, D] * int8[D, F] -> bfloat16[B, F] where is the “batch size”.

FLOPs = Memcpy =

Arithmetic Intensity =

Assuming B << D and B << F, Arithmetic Intensity = 2B,

For 2B > 243 compute bound, i.e. B > 122

Question 3: Roofline Plot

For the problem in Question 2, make a roofline plot of peak FLOPs vs. B for several values of D and F.

Question 4: Batched Matrix Multiplication

What if we wanted to perform where we imagine having a different matrix for each batch element. What is the arithmetic intensity of this operation?

FLOPs = Memcpy =

Arithmetic Intensity = Arithmetic Intensity = = (assuming B<<D and B<<F)

Error

I got this wrong, FLOPs = as before So, Arithmetic Intensity = = 2

This is bad since we are always comms bound

Question 5: Memory Rooflines for GPUs

Using the spec sheet provided by NVIDIA for the H100, calculate the batch size at which a matrix multiplication will become compute-bound. Note that the Tensor Core FLOPs numbers are twice the true value since they’re only achievable with structured sparsity.

FP16 = 1979 TFLOPs = 1.98e15 (with sparsity)

Memory bandwidth = 3.35TB/s = 3.35e12

Thus, Peak AI = 591 FLOPs/Byte

So, B > 591 compute bound

Note on FP16 throughput

1e15 without sparsity, so actual B > 296 for compute bound