Exploratory
Scale compute workloads across Apple GPUs https://developer.apple.com/videos/play/wwdc2022/10159/
- Identify bottleneck (compute, memory)
- Leverage MPS/MPSGraph
- Minimize GPU gaps
- Improve work distribution
- Eliminate GPU timeline gaps
- Kernel synchronization bottleneck
- Atomic Operations
- SIMD group instructions: allow memory operations in a SIMD group
simd_prefix_eclusive_sumsimd_min- many more
- Threadgroup atomics
- SIMD group instructions: allow memory operations in a SIMD group
- Optimize GPU limiters
- e.g. inefficient memory access
- threadgroup shape should align with memory layout
Metal Compute on MacBook Pro https://developer.apple.com/videos/play/tech-talks/10580
- Kernel optimizations
- Use signed ints to index int instead of unsigned
- Disables vectorized loads
- Minimize atomic operations
- use thread-group atomics
- can use GPU profiling: ALU utilization , Kernel occupancy (thrads active relative to max)
- don’t exhaust thread memory (register) or thread group memory
- can set max_{total} _threads_per_threadgroup
- Register pressure
- Use less precision (e.g. short/half instead of int/float)
- Reduce stack data
- constant address space
- Avoid dynamic indexing for stack or constant data
- Use signed ints to index int instead of unsigned
Optimize Metal performance on Mac https://developer.apple.com/videos/play/wwdc2020/10632/