Expected Learning Outcomes

  • Understand how close model is to theoretical optimum
  • Choose which parallelism to use at what scale
  • Estimate cost/time required to train/run large Transformer models
  • Hardware limitation aware algorithm design 🔄 Algorithm limitation aware hardware design

Why do we care?

Model sizes and scaling laws have reached a point where cutting edge research on models and architectures needs to take into account hardware constraints in order to improve performance.

“A 20% better model which comes at a 20% cost to Roofline efficiency is irrelevant.”

Goal of scaling

Strong scaling is when increasing # chips used for training/inference leads to proportional increase in throughput.

Communication bound is when an algorithm’s speed gets bound by data communication between chips, and as a result cannot strongly scale.

Outline

  • Sec 1-3: Intro and basics
  • Sec 4: Transformers
  • Sec 5 & 7: Training and Inference (Most Important)
  • Sec 6 & 8: Practical examples of 5, 7 to Llama 3
  • Section 9 & 10: How to implement, profile and debug