Exploratory

Scale compute workloads across Apple GPUs https://developer.apple.com/videos/play/wwdc2022/10159/

  • Identify bottleneck (compute, memory)
  • Leverage MPS/MPSGraph
  • Minimize GPU gaps
    • Improve work distribution
    • Eliminate GPU timeline gaps
      • Kernel synchronization bottleneck
    • Atomic Operations
      • SIMD group instructions: allow memory operations in a SIMD group
        • simd_prefix_eclusive_sum
        • simd_min
        • many more
      • Threadgroup atomics
  • Optimize GPU limiters
    • e.g. inefficient memory access
    • threadgroup shape should align with memory layout

Metal Compute on MacBook Pro https://developer.apple.com/videos/play/tech-talks/10580

  • Kernel optimizations
    • Use signed ints to index int instead of unsigned
      • Disables vectorized loads
    • Minimize atomic operations
      • use thread-group atomics
      • can use GPU profiling: ALU utilization , Kernel occupancy (thrads active relative to max)
        • don’t exhaust thread memory (register) or thread group memory
        • can set max_{total} _threads_per_threadgroup
      • Register pressure
        • Use less precision (e.g. short/half instead of int/float)
        • Reduce stack data
        • constant address space
        • Avoid dynamic indexing for stack or constant data

Optimize Metal performance on Mac https://developer.apple.com/videos/play/wwdc2020/10632/