Systolic Array

A Systolic Array is a specialized hardware structure consisting of a grid of processing elements (PEs) that rhythmically compute and pass data, forming the core of the MXU in TPUs.

Systolic array matmul

Basic Structure

Grid Organization:
- TPU v2-v5: 128×128 (16,384 PEs)
- TPU v6e: 256×256 (65,536 PEs)
Processing Element:
- Each PE contains:
  - Multiply-accumulate (MAC) unit
  - Small register for weight storage
  - Input/output connections to neighboring PEs
Data Flow:
- Weights (RHS matrix): Flow top to bottom
- Activations (LHS matrix): Flow left to right
- Partial sums: Flow diagonally and accumulate

Data flow through systolic array Animation showing data flow through a systolic array with results streaming out. Source: How to Scale Your Model

Operation Sequence

Weight Loading: Matrix loaded diagonally into the array
Activation Streaming: Inputs fed row by row from the left
Computation: Each PE:
- Multiplies incoming activation with stored weight
- Adds result to accumulated value from upstream PE
- Passes result to next PE in pipeline
Result Collection: Accumulated products emerge from bottom or right edge

Performance Characteristics

Pipeline Structure:
- Initial loading creates pipeline “bubble”
- Once filled, array produces results every cycle
- Efficiency increases with matrix size due to amortized startup cost
Parallelism: Performs 128×128 (or 256×256) multiply-accumulate operations simultaneously
Clock Efficiency: Completes one bf16[8,128] @ bf16[128,128] → f32[8,128] operation every 8 cycles
Throughput: ~5e13 FLOPs/s per systolic array at 1.5GHz (TPU v5e)

Advantages

Energy Efficiency:
- Data reuse through propagation (each value used multiple times)
- Minimal control overhead (simple, repeated operations)
- Reduced memory access (compared to traditional architectures)
Hardware Simplicity:
- Regular, repeating structure
- Simple control logic
- Efficient VLSI implementation
Perfect for Matrix Multiplication:
- Exploits the O(n³) compute to O(n²) memory ratio
- Natural fit for deep learning operations

Optimization Considerations

Dimension Requirements:
- Matrices should be padded to multiples of array dimensions (128 or 256)
- Small matrices waste systolic array capacity
Batch Processing:
- Larger batches better amortize pipeline filling cost
- Typical LHS shape is [B,128] where larger B improves efficiency
Pipelining Across Operations:
- Multiple matrix multiplications can be chained efficiently
- Overlapping weight loading with computation improves throughput

Matrix multiplication pipelining Diagram showing how matrix multiplications can be pipelined across multiple input/weight pairs. Source: How to Scale Your Model

Shashank Shekhar

Explorer

Systolic Array

Systolic Array

Basic Structure

Operation Sequence

Performance Characteristics

Advantages

Optimization Considerations

Graph View

Table of Contents