Apple Silicon Metal vs NVIDIA CUDA

Translate

Metal Term	CUDA Equivalent	Description
GPU Core	Streaming Multiprocessor (SM)	Basic compute unit containing multiple ALUs, schedulers, and caches
Grid	Grid	Overall structure of work to be processed by the GPU
Threadgroup	Thread Block	Group of threads that can synchronize and share memory
Thread	Thread	Individual execution unit that processes a single element of work
SIMD-group	Warp	Group of 32 threads executed in lockstep
Threadgroup Memory	Shared Memory	Fast memory accessible by all threads in a threadgroup/block
Device Memory	Global Memory	Main GPU memory accessible by all threads
Constant Memory	Constant Memory	Read-only memory optimized for broadcast access
Texture Memory	Texture Memory	Specialized memory with spatial caching for image access
Command Buffer	CUDA Stream	Sequence of commands executed in order
Command Queue	CUDA Context	Container for scheduling command buffers/streams
Compute Pass	Kernel Launch	Execution of a function on the GPU
Metal Kernel	CUDA Kernel	Function executed in parallel on the GPU
Threadgroup Barrier	__syncthreads()	Synchronization point for all threads in a threadgroup/block
SIMD-group Barrier	__syncwarp()	Synchronization within a warp/SIMD-group
Memory Fence	Memory Fence	Ensures memory operations are visible to other threads
Device Function	Device Function	Helper function called by kernels, executed on GPU
Buffer	Memory Pointer	Reference to allocated memory
MTLSize	dim3	3D structure for specifying dimensions
Argument Buffer	Kernel Parameters	Way to pass structured data to kernels
Threadgroup Size	Block Dimensions	Number of threads in each threadgroup/block
Grid Size	Grid Dimensions	Number of threadgroups/blocks in the grid
Metal Compute Pipeline	CUDA Module/Program	Compiled GPU code ready for execution
simdgroup_shuffle()	__shfl()	Exchange values between threads in a SIMD group/warp
Metal Heap	CUDA Memory Pool	Allocation mechanism for device memory
Indirect Command Buffer	CUDA Graph	Precompiled sequence of commands for repeated execution

Sync threads in a threadgroup/block:
threadgroup_barrier(mem_flags::mem_none) = __syncthreads();
Create shared
threadgroup = __shared__

Architecture of a CUDA Capable GPU

Architecture of a Metal Capable GPU

CUDA Device Memory Model

Apple Silicon GPU Memory Model

Feature	NVIDIA RTX 3090 (Ampere)	Apple M1 Max
Warp/SIMD Size	32 threads per warp	32 threads per SIMD-group
Execution Model	SIMT – each thread has its own context	SIMD – 32 lanes share one instruction stream
Compute Units	82 Streaming Multiprocessors (SMs)	32 GPU cores
ALUs per Compute Unit	128 FP32 ALUs per SM + 4 Tensor Cores	128 ALUs per core
Total ALUs	~10,496 FP32 ALUs	~4,096 ALUs
Clock Frequency	1.7 GHz (boost up to 1.8 GHz)	1.3 GHz
Theoretical FP32 Performance	35.6 TFLOPS	10.4 TFLOPS
Low-Precision Math	FP16: 71.2 TFLOPS (via Tensor Cores)	FP16: 10.4 TFLOPS (no acceleration)
Shared Memory per Block/Group	48 KB configurable	32 KB
L1 Cache per Compute Unit	128 KB per SM	32 KB per core
L2 Cache	6 MB total	24 MB System Level Cache (shared)
Memory	24 GB GDDR6X (dedicated)	Up to 64 GB unified LPDDR5
Memory Bandwidth	936 GB/s (384-bit bus)	400 GB/s (512-bit bus)
TDP	350W	60W (entire SoC including CPU)
Max Threads Per Block/Group	1024 threads	1024 threads
Thread Dimensions	1024 × 1024 × 64	1024 × 1024 × 1024
Grid/Dispatch Dimensions	2³¹-1 × 65535 × 65535	2³²-1 × 65535 × 65535
Registers Per Block/Group	65536	65536
Max Blocks Per Compute Unit	16 blocks per SM	24 threadgroups per core
Registers Per Thread	Up to 255	Up to 128

Shashank Shekhar

Explorer

CUDAvMetal

Apple Silicon Metal vs NVIDIA CUDA

Translate

Architecture of a CUDA Capable GPU

Architecture of a Metal Capable GPU

CUDA Device Memory Model

Apple Silicon GPU Memory Model

Graph View

Table of Contents