Systolic Array
A Systolic Array is a specialized hardware structure consisting of a grid of processing elements (PEs) that rhythmically compute and pass data, forming the core of the MXU in TPUs.

Basic Structure
-
Grid Organization:
- TPU v2-v5: 128×128 (16,384 PEs)
- TPU v6e: 256×256 (65,536 PEs)
-
Processing Element:
- Each PE contains:
- Multiply-accumulate (MAC) unit
- Small register for weight storage
- Input/output connections to neighboring PEs
- Each PE contains:
-
Data Flow:
- Weights (RHS matrix): Flow top to bottom
- Activations (LHS matrix): Flow left to right
- Partial sums: Flow diagonally and accumulate
Animation showing data flow through a systolic array with results streaming out. Source: How to Scale Your Model
Operation Sequence
- Weight Loading: Matrix loaded diagonally into the array
- Activation Streaming: Inputs fed row by row from the left
- Computation: Each PE:
- Multiplies incoming activation with stored weight
- Adds result to accumulated value from upstream PE
- Passes result to next PE in pipeline
- Result Collection: Accumulated products emerge from bottom or right edge
Performance Characteristics
-
Pipeline Structure:
- Initial loading creates pipeline “bubble”
- Once filled, array produces results every cycle
- Efficiency increases with matrix size due to amortized startup cost
-
Parallelism: Performs
128×128(or256×256) multiply-accumulate operations simultaneously -
Clock Efficiency: Completes one
bf16[8,128] @ bf16[128,128] → f32[8,128]operation every 8 cycles -
Throughput: ~5e13 FLOPs/s per systolic array at 1.5GHz (TPU v5e)
Advantages
-
Energy Efficiency:
- Data reuse through propagation (each value used multiple times)
- Minimal control overhead (simple, repeated operations)
- Reduced memory access (compared to traditional architectures)
-
Hardware Simplicity:
- Regular, repeating structure
- Simple control logic
- Efficient VLSI implementation
-
Perfect for Matrix Multiplication:
- Exploits the O(n³) compute to O(n²) memory ratio
- Natural fit for deep learning operations
Optimization Considerations
-
Dimension Requirements:
- Matrices should be padded to multiples of array dimensions (128 or 256)
- Small matrices waste systolic array capacity
-
Batch Processing:
- Larger batches better amortize pipeline filling cost
- Typical LHS shape is
[B,128]where larger B improves efficiency
-
Pipelining Across Operations:
- Multiple matrix multiplications can be chained efficiently
- Overlapping weight loading with computation improves throughput
Diagram showing how matrix multiplications can be pipelined across multiple input/weight pairs. Source: How to Scale Your Model