TPU Architecture Practice Questions

Question 1: LLM Latency Bounds

Say you want to sample from a 200B parameter model in bf16 that’s split across 32 TPU v4p. How long would it take to load all the parameters from HBM into the systolic array?

Model size = $200 * 1 0^{9} * 2 bytes = 400 GB = 4 e 11 bytes$

Model size per chip= $4 e 11/32 = 1.25 e 10$

TPU v4p HBM BW = $1.23 e 12 bytes/s$

Time = $1.25 e 10/1.23 e 12 = 0.01 s = 10 m s$

Question 2: TPU Pod Specifications

Consider a full TPU v5e pod. How many total CPU hosts are there? How many TPU TensorCores? What is the total FLOPs/s for the whole pod? What is the total HBM?

Do the same exercise for a TPU v5p pod.

Chip Type	Pod Size	Host Size	Total Hosts	Total FLOPs/s	Total HBM
TPU v5e	16×16	4×2	8	$9.2 e 14 * 16 * 16 = 2.35 e 17$	$16 * 16 * 16 GB = 4.096 TB$
TPU v5p	16x20x28	2x2x1	4	$4.59 e 14 * 16 * 20 * 28 = 4.11 e 18$	$16 * 20 * 28 * 96 GB = 0.86 PB$

Question 3: PCIe Operational Intensity

Imagine we’re forced to store a big weight matrix $A$ of type $bfloat16 [D, F]$ , and a batch of activations $x$ of type $bfloat16 [B, D]$ in host DRAM and want to do a matrix multiplication on them. This is running on a single host, and we’re using a single TPU v6e chip attached to it.

You can assume $B ≪ D$ , and $F = 4 D$ (we’ll see in future chapters why these are reasonable assumptions). What is the smallest batch size $B$ we need to remain FLOPs bound over PCIe? Assume PCIe bandwidth of 1.5e10 bytes/second.

Intensity(matmul) = $B D F / (B D + D F + BF)$

Assuming $B << D and F = 4 D$ :

Intensity(matmul) = $B D * 4 D / D * 4 D = B$

For flop-bound, $B > 9.20 e 14/1.5 e 10 => B > 6.13 e 4$

Correct answer and approach, incorrect thought process

Since the data needs to be first loaded into DRAM, I should not have thought of the data read operation ‘directly’ in these calculations above.

operations time = $2 B D F /9.2 e 14$ seconds weight read+write time = $2 (B D + D F + BF) /1.5 e 10$ seconds

This is assuming we can overload operations with weight loading (either overlap reading/writing data from DRAM while load from host happens, or overlap compute while read/write from DRAM happens)

Question 4: General MatMul Latency

Let’s say we want to multiply a weight matrix int8[16384, 4096] by an activation matrix of size int8[B, 4096] where B is some unknown batch size. Let’s say we’re on 1 TPUv5e to start.

a) How long will this multiplication take as a function of B?

b) What if we wanted to run this operation out of VMEM? How long would it take as a function of B?

Data size:

W = $16384 * 4096 bytes = 6.7 e 7 bytes$
X = $B * 4096 bytes = B * .4 e 5 bytes$
X*W = $B * 16384 bytes = B * 1.6 e 5 bytes$

Ops = $2 * 16384 * 4096 * B = B * 1.3 e 8$

For v5e chip:

Time (ops) = $B * 1.3 e 8/3.94 e 14 = B * 3.3 e - 7$

Out of HBM:

Time (comm) = $(6.7 e 7 + 2 B e 4) /8.1 e 11 = (827 + 0.24 * B) e - 7$

For FLOP-bound:

$3.3 B > (827 + 0.246 B) => 3.05 B > 827 => B > 270$

Out of VMEM:

Time (comm) = 1/22 * $Time(comm)_{HBM}$

For FLOP-bound:

$3.3 B * 22 > (827 + 0.246 B) => 72.35 B > 827 => B > 11.42$

Question 5: ICI Bandwidth

Let’s say we have a TPU v5e 4x4 slice. Let’s say we want to send an array of type bfloat16[8, 128, 8192] from TPU{0,0} to TPU{3, 3}. Let’s say the per-hop latency for TPU v5e is $1 μ s$ .

a) How soon will the first byte arrive at its destination?

b) How long will the total transfer take?

For v5e no wrap-around

Can send data down + right

Total bytes = $2 * 8 * 128 * 8192 = 1.7 e 7$

Bytes per second = $2 * 4.5 e 10 = 9 e 10$

First byte = 6 hops = $6 μ s$

Total time = $1.7 e 7/9 e 10 = 188 μ s$

Question 6: Multi-Component Performance (Hard)

Imagine you have a big matrix A: int8[128 * 1024, 128 * 1024] sharded evenly across a TPU v5e 4x4 slice but offloaded to host DRAM on each chip. Let’s say you want to copy the entire array to TPU{0, 0} and multiply it by a vector bf16[8, 128 * 1024]. How long will this take?

Steps:

Calculate size of array on each slice
Calculate data transfer time from each slice to T{0, 0}
Calculate time to read from T{0, 0} DRAM (both matrix and vector)
Calculate time to perform multiplication
Calculate time to write output to DRAM

Shashank Shekhar

Explorer

Scaling Book Part 2. TPU Questions

TPU Architecture Practice Questions

Graph View