Once you’ve read through the notes, you can check out the Chapter 5 exercise solutions, although I strongly suggest trying them on your own first.
Background
-
Accessing DRAM global memory is usually slow
-
While having multiple threads available will still provide speedup, without optimizing memory access we cannot exploit the CUDA paradigm well
-
On-chip memory resources available to mitigate this
Importance of memory access efficiency
-
Naive matmul has 0.5B/op read/write from global memory:
- 2 floats from both matrices for mul, and 1 for add
-
But computational throughout of GPUs is significantly higher and such poor ‘arithmetic intensity’ will not utilize ability to parallelize well.
-
e.g. Ampere A100
Roofline Analysis
- Computational throughput on y-axis
- Memory efficiency on x-axis
- Horizontal line is computational bottleneck (due to number of cores)
- The closer to y=x line the better
- Below: memory bound
- Above: compute bound
CUDA Memory Types
Registers - Each thread has its own, shared across
Local memory - Per thread local memory
Shared memory - Common across blocks
Constant memory - Device read only, Host can read/write per grid
Global memory - Both host and device can read/write per grid
registers also store temporary vars
- quick to write
- arithmetic operations can be implemented with constant time loading into ALU from register

Tiling for reduced memory traffic
-
Loading from global memory into shared memory is computationally expensive
-
Can be broken down in a way that each thread in a block loads a different element, thus sharing the load
- e.g. operation will be broken into ‘tiles’ where during each time step different threads will read different items in the tile into shared memory, and all threads will perform operations on shared elements
- Thus, it is only effective when a lot of the values being used across threads are being shared
-
e.g. Each tile can be operated on by one block
Tiled matrix multiplication kernel
CUDA code for tiled kernel where block size corresponds to tile size
#define TILE_WDITH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
// Shared memory arrays scoped to block
__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
// Save thread and block indexes in register variables scoped to individual thread
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the P element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
// Loop over the M and N tiles required to compute P element
float Pvalue = 0;
for (int ph = 0; ph < Width/TILE_WIDTH; ++ph){
// Collaborative loading of M and N tiles into shared memory
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
__syncthreads();
for (int k = 0; k < TILE_WIDTH; ++k){
Pvalue += Mds[ty][tx] * Nds[ty][tx];
}
__syncthreads();
}
P[Row*Width + Col] = Pvalue;
}- Read-after-write dependence in first
__syncthreads()(true dependence) - Write-after-read dependence in second
__syncthreads()(false dependence)
Strip mining: Breaking loop into smaller phases
Global memory access reduced by TILE_WIDTH factor → significant
- With 16x16 tiles, global memory access ratio goes from 0.25 OP/B to 4 OP/B
- e.g. A100 GPU, global memory bandwidth of 1555 GB/second, 1555*4 = 6220 GFLOPS
- Still only 32% of peak throughput (19,500 GFLOPS)
Block algorithms are common in parallel programming, in CPUs keep data to be reused in particular time window in cache (implicitly), in GPUs keep data in shared memory explicitly. In GPUs SM runs many threads simultaneously, so cache is less reliable and keeping in shared memory better.
Kernel with boundary checks
Previous kernel assumed matrix dimensions were multiple of block width, and that matrices are square matrices.
Accessing incorrect elements due to shape mismatch can lead to:
- Incorrect element access due to linearized array storage
- Incorrect memory access to storage not allotted to array (behaviour undefined)
Cannot simply block threads which calculate no valid P element:
- Might still be important for loading data used by other threads
float Pvalue = 0;
for (int ph=0; ph < ceil(Width/(float)TILE_WIDTH); ++ph){
// Collaborative loading of M and N tiles into shared memory
if ((Row < Width) && (ph*TILE_WIDTH+tx) < WIDTH)
Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
else
Mds[ty][tx] = 0.0f;
if ((ph*TILE_WIDTH+ty) < WIDTH && Col < WIDTH)
Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
else
Nds[ty][tx] = 0.0f;
__syncthreads();
for (int k=0; k < TILE_WIDTH; ++k){
Pvalue += Mds[ty][k] * Nds[k][tx];
}
__syncthreads();
}
if (Row < Width) && (Col < Width):
P[Row*Width + Col] = Pvalue;
For extending this kernel to general matrix multiplication kernel:
- Replace Width with
j, k, lunsigned ints - Where Width refers to height of M or height of P, replace it with j
- Where Width refers to width of M or height of N, replace it with k
- Where Width refers to width of N or width of P, replace it with l
Impact of memory usage on occupancy
Shared memory usage (like register usage) can limit the number of threads that can be assigned to each SM.
- In A100, maximum of 164 KB of shared memory per SM and 2048 threads per SM
- For all 2048 thread slots to be used, each thread shouldn’t use more than 164 KB/2048 = 82 B/thread
For tiled matmul, each thread block uses $TILE_WIDTH^2 * 4B + TILE_WIDTH^2* 4B)/(TILE_WIDTH^2 threads)=8 B/thread$$, which means that the kernel’s occupancy is not limited by shared memory
Shared memory available varies by device generation, so we might want to assign it dynamically
In order to enable kernel to adjust its shared memory usage at runtime, we need the extern keyword:
extern __shared__ Mds_Nds[];
Since there is only one merged array, we will also need to manually define where Mds section of array starts and where Nds section of array starts
At runtime, dynamically configure amount of shared memory assigned
size_t size = calculate_appropriate_SM_usage(devProp.sharedMemPerBlock, ...);
matrixMulKernel<<<dimGrid, dimBlock, size>>>(Md, Nd, Pd, Width, size/2, size/2);#define TILE_WIDTH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width, unsigned Mdz_sz, unsigned Nds_sz) {
extern __shared__ char float Mds_Nds[];
float *Mds = (float *) Mds_Nds;
float *Nds = (float *) Mds_Nds + Mds_sz;