Mostly read up on Tensor Memory Accelerator, multicast, and tensor cores
CUTLASS Tutorial: Mastering the NVIDIA® Tensor Memory Accelerator (TMA) https://research.colfax-intl.com/tutorial-hopper-tma/
Outperforming cuBLAS on H100: a Worklog https://cudaforfun.substack.com/p/outperforming-cublas-on-h100-a-worklog