Courses
- High Performance LLMs in JAX Session 2: Single-Chip Performance & Rooflines
- High Performance LLMs in JAX Session 3: Multi-Chip Performance & Rooflines
Exploratory
How CUDA Programming Works https://www.nvidia.com/en-us/on-demand/session/gtcspring22-s41487/
- discusses some of the physics reasons behind why memory i/o works the way it days, why coalescing is fast, why 128 threads/4 warps are important (1024 Byes page size) etc etc