Expected Learning Outcomes
- Understand how close model is to theoretical optimum
- Choose which parallelism to use at what scale
- Estimate cost/time required to train/run large Transformer models
- Hardware limitation aware algorithm design 🔄 Algorithm limitation aware hardware design
Why do we care?
Model sizes and scaling laws have reached a point where cutting edge research on models and architectures needs to take into account hardware constraints in order to improve performance.
“A 20% better model which comes at a 20% cost to Roofline efficiency is irrelevant.”
Goal of scaling
Strong scaling is when increasing # chips used for training/inference leads to proportional increase in throughput.
Communication bound is when an algorithm’s speed gets bound by data communication between chips, and as a result cannot strongly scale.
Outline
- Sec 1-3: Intro and basics
- Sec 4: Transformers
- Sec 5 & 7: Training and Inference (Most Important)
- Sec 6 & 8: Practical examples of 5, 7 to Llama 3
- Section 9 & 10: How to implement, profile and debug