Scaling Book Part 0. Introductions

Expected Learning Outcomes

Understand how close model is to theoretical optimum
Choose which parallelism to use at what scale
Estimate cost/time required to train/run large Transformer models
Hardware limitation aware algorithm design 🔄 Algorithm limitation aware hardware design

Why do we care?

Model sizes and scaling laws have reached a point where cutting edge research on models and architectures needs to take into account hardware constraints in order to improve performance.

“A 20% better model which comes at a 20% cost to Roofline efficiency is irrelevant.”

Goal of scaling

Strong scaling is when increasing # chips used for training/inference leads to proportional increase in throughput.

Communication bound is when an algorithm’s speed gets bound by data communication between chips, and as a result cannot strongly scale.

Outline

Sec 1-3: Intro and basics
Sec 4: Transformers
Sec 5 & 7: Training and Inference (Most Important)
Sec 6 & 8: Practical examples of 5, 7 to Llama 3
Section 9 & 10: How to implement, profile and debug

Shashank Shekhar

Explorer

Scaling Book Part 0. Introductions

Goal of scaling

Outline

Graph View

Table of Contents