Shashank Shekhar

❯

❯

28 March, 2025 Readings

28 March, 2025 Readings

Mar 28, 20251 min read

Exploratory

Scale compute workloads across Apple GPUs https://developer.apple.com/videos/play/wwdc2022/10159/

Identify bottleneck (compute, memory)
Leverage MPS/MPSGraph
Minimize GPU gaps
- Improve work distribution
- Eliminate GPU timeline gaps
  - Kernel synchronization bottleneck
- Atomic Operations
  - SIMD group instructions: allow memory operations in a SIMD group
    - simd_prefix_eclusive_sum
    - simd_min
    - many more
  - Threadgroup atomics
Optimize GPU limiters
- e.g. inefficient memory access
- threadgroup shape should align with memory layout

Metal Compute on MacBook Pro https://developer.apple.com/videos/play/tech-talks/10580

Kernel optimizations
- Use signed ints to index int instead of unsigned
  - Disables vectorized loads
- Minimize atomic operations
  - use thread-group atomics
  - can use GPU profiling: ALU utilization , Kernel occupancy (thrads active relative to max)
    - don’t exhaust thread memory (register) or thread group memory
    - can set max_{total} _threads_per_threadgroup
  - Register pressure
    - Use less precision (e.g. short/half instead of int/float)
    - Reduce stack data
    - constant address space
    - Avoid dynamic indexing for stack or constant data

Optimize Metal performance on Mac https://developer.apple.com/videos/play/wwdc2020/10632/

Graph View

Created with Quartz v4.4.0 © 2025

GitHub
Discord Community