Roofline Model Practice Questions
Question 1: INT8 Matrix Multiplication
Say we want to do in int8 precision (1 byte per parameter) instead of bfloat16.
a) How many bytes need to be loaded from memory? How many need to be written back to memory?
b) How many total OPs are performed?
c) What is the arithmetic intensity?
d) What is a roofline estimate for and ? What are reasonable upper and lower bounds for the runtime of the whole operation?
Assume our HBM bandwidth is
8.1e11bytes/s and our int8 peak OPs/s is3.94e14.
a. bytes loaded. bytes written back b. Total ops = c. Arithmetic Intensity = d. Roofline:
Assuming B << D, F;
For B > 487/2 = 243, compute bound. For B < 487/2 = 243, comms bound
Question 2: INT8 + BF16 Matrix Multiplication
In practice we often do different weight vs. activation quantization, so we might store our weights in very low precision but keep activations (and compute) in a higher precision. Say we want to quantize our weights in int8 but keep activations (and compute) in bfloat16. At what batch size do we become compute bound? Assume
1.97e14bfloat16 FLOPs/s.Hint: this means specifically
bfloat16[B, D] * int8[D, F] -> bfloat16[B, F]where is the “batch size”.
FLOPs = Memcpy =
Arithmetic Intensity =
Assuming B << D and B << F, Arithmetic Intensity = 2B,
For 2B > 243 compute bound, i.e. B > 122
Question 3: Roofline Plot
For the problem in Question 2, make a roofline plot of peak FLOPs vs. B for several values of D and F.
Question 4: Batched Matrix Multiplication
What if we wanted to perform where we imagine having a different matrix for each batch element. What is the arithmetic intensity of this operation?
FLOPs = Memcpy =
Arithmetic Intensity = Arithmetic Intensity = = (assuming B<<D and B<<F)
Error
I got this wrong, FLOPs = as before So, Arithmetic Intensity = = 2
This is bad → since we are always comms bound
Question 5: Memory Rooflines for GPUs
Using the spec sheet provided by NVIDIA for the H100, calculate the batch size at which a matrix multiplication will become compute-bound. Note that the Tensor Core FLOPs numbers are twice the true value since they’re only achievable with structured sparsity.
FP16 = 1979 TFLOPs = 1.98e15 (with sparsity)
Memory bandwidth = 3.35TB/s = 3.35e12
Thus, Peak AI = 591 FLOPs/Byte
So, B > 591 compute bound
Note on FP16 throughput
1e15 without sparsity, so actual B > 296 for compute bound