Apple Silicon Metal vs NVIDIA CUDA

Translate

Metal TermCUDA EquivalentDescription
GPU CoreStreaming Multiprocessor (SM)Basic compute unit containing multiple ALUs, schedulers, and caches
GridGridOverall structure of work to be processed by the GPU
ThreadgroupThread BlockGroup of threads that can synchronize and share memory
ThreadThreadIndividual execution unit that processes a single element of work
SIMD-groupWarpGroup of 32 threads executed in lockstep
Threadgroup MemoryShared MemoryFast memory accessible by all threads in a threadgroup/block
Device MemoryGlobal MemoryMain GPU memory accessible by all threads
Constant MemoryConstant MemoryRead-only memory optimized for broadcast access
Texture MemoryTexture MemorySpecialized memory with spatial caching for image access
Command BufferCUDA StreamSequence of commands executed in order
Command QueueCUDA ContextContainer for scheduling command buffers/streams
Compute PassKernel LaunchExecution of a function on the GPU
Metal KernelCUDA KernelFunction executed in parallel on the GPU
Threadgroup Barrier__syncthreads()Synchronization point for all threads in a threadgroup/block
SIMD-group Barrier__syncwarp()Synchronization within a warp/SIMD-group
Memory FenceMemory FenceEnsures memory operations are visible to other threads
Device FunctionDevice FunctionHelper function called by kernels, executed on GPU
BufferMemory PointerReference to allocated memory
MTLSizedim33D structure for specifying dimensions
Argument BufferKernel ParametersWay to pass structured data to kernels
Threadgroup SizeBlock DimensionsNumber of threads in each threadgroup/block
Grid SizeGrid DimensionsNumber of threadgroups/blocks in the grid
Metal Compute PipelineCUDA Module/ProgramCompiled GPU code ready for execution
simdgroup_shuffle()__shfl()Exchange values between threads in a SIMD group/warp
Metal HeapCUDA Memory PoolAllocation mechanism for device memory
Indirect Command BufferCUDA GraphPrecompiled sequence of commands for repeated execution
  • Sync threads in a threadgroup/block:

  • threadgroup_barrier(mem_flags::mem_none) = __syncthreads();

  • Create shared

  • threadgroup = __shared__

Architecture of a CUDA Capable GPU

Architecture of a Metal Capable GPU

CUDA Device Memory Model

Apple Silicon GPU Memory Model

FeatureNVIDIA RTX 3090 (Ampere)Apple M1 Max
Warp/SIMD Size32 threads per warp32 threads per SIMD-group
Execution ModelSIMT – each thread has its own contextSIMD – 32 lanes share one instruction stream
Compute Units82 Streaming Multiprocessors (SMs)32 GPU cores
ALUs per Compute Unit128 FP32 ALUs per SM + 4 Tensor Cores128 ALUs per core
Total ALUs~10,496 FP32 ALUs~4,096 ALUs
Clock Frequency1.7 GHz (boost up to 1.8 GHz)1.3 GHz
Theoretical FP32 Performance35.6 TFLOPS10.4 TFLOPS
Low-Precision MathFP16: 71.2 TFLOPS (via Tensor Cores)FP16: 10.4 TFLOPS (no acceleration)
Shared Memory per Block/Group48 KB configurable32 KB
L1 Cache per Compute Unit128 KB per SM32 KB per core
L2 Cache6 MB total24 MB System Level Cache (shared)
Memory24 GB GDDR6X (dedicated)Up to 64 GB unified LPDDR5
Memory Bandwidth936 GB/s (384-bit bus)400 GB/s (512-bit bus)
TDP350W60W (entire SoC including CPU)
Max Threads Per Block/Group1024 threads1024 threads
Thread Dimensions1024 × 1024 × 641024 × 1024 × 1024
Grid/Dispatch Dimensions2³¹-1 × 65535 × 655352³²-1 × 65535 × 65535
Registers Per Block/Group6553665536
Max Blocks Per Compute Unit16 blocks per SM24 threadgroups per core
Registers Per ThreadUp to 255Up to 128