Systolic Array

A Systolic Array is a specialized hardware structure consisting of a grid of processing elements (PEs) that rhythmically compute and pass data, forming the core of the MXU in TPUs.

Systolic array matmul

Basic Structure

  • Grid Organization:

    • TPU v2-v5: 128×128 (16,384 PEs)
    • TPU v6e: 256×256 (65,536 PEs)
  • Processing Element:

    • Each PE contains:
      • Multiply-accumulate (MAC) unit
      • Small register for weight storage
      • Input/output connections to neighboring PEs
  • Data Flow:

    • Weights (RHS matrix): Flow top to bottom
    • Activations (LHS matrix): Flow left to right
    • Partial sums: Flow diagonally and accumulate

Data flow through systolic array Animation showing data flow through a systolic array with results streaming out. Source: How to Scale Your Model

Operation Sequence

  1. Weight Loading: Matrix loaded diagonally into the array
  2. Activation Streaming: Inputs fed row by row from the left
  3. Computation: Each PE:
    • Multiplies incoming activation with stored weight
    • Adds result to accumulated value from upstream PE
    • Passes result to next PE in pipeline
  4. Result Collection: Accumulated products emerge from bottom or right edge

Performance Characteristics

  • Pipeline Structure:

    • Initial loading creates pipeline “bubble”
    • Once filled, array produces results every cycle
    • Efficiency increases with matrix size due to amortized startup cost
  • Parallelism: Performs 128×128 (or 256×256) multiply-accumulate operations simultaneously

  • Clock Efficiency: Completes one bf16[8,128] @ bf16[128,128] → f32[8,128] operation every 8 cycles

  • Throughput: ~5e13 FLOPs/s per systolic array at 1.5GHz (TPU v5e)

Advantages

  • Energy Efficiency:

    • Data reuse through propagation (each value used multiple times)
    • Minimal control overhead (simple, repeated operations)
    • Reduced memory access (compared to traditional architectures)
  • Hardware Simplicity:

    • Regular, repeating structure
    • Simple control logic
    • Efficient VLSI implementation
  • Perfect for Matrix Multiplication:

    • Exploits the O(n³) compute to O(n²) memory ratio
    • Natural fit for deep learning operations

Optimization Considerations

  • Dimension Requirements:

    • Matrices should be padded to multiples of array dimensions (128 or 256)
    • Small matrices waste systolic array capacity
  • Batch Processing:

    • Larger batches better amortize pipeline filling cost
    • Typical LHS shape is [B,128] where larger B improves efficiency
  • Pipelining Across Operations:

    • Multiple matrix multiplications can be chained efficiently
    • Overlapping weight loading with computation improves throughput

Matrix multiplication pipelining Diagram showing how matrix multiplications can be pipelined across multiple input/weight pairs. Source: How to Scale Your Model