Once you’ve read through the notes, you can check out the Chapter 5 exercise solutions, although I strongly suggest trying them on your own first.

Background

Accessing DRAM global memory is usually slow
While having multiple threads available will still provide speedup, without optimizing memory access we cannot exploit the CUDA paradigm well
On-chip memory resources available to mitigate this

Importance of memory access efficiency

Naive matmul has 0.5B/op read/write from global memory:
- 2 floats from both matrices for mul, and 1 for add
But computational throughout of GPUs is significantly higher and such poor ‘arithmetic intensity’ will not utilize ability to parallelize well.
e.g. Ampere A100

Roofline Analysis

Computational throughput on y-axis
Memory efficiency on x-axis
Horizontal line is computational bottleneck (due to number of cores)
The closer to y=x line the better
Below: memory bound
Above: compute bound

CUDA Memory Types

Registers - Each thread has its own, shared across

Local memory - Per thread local memory

Shared memory - Common across blocks

Constant memory - Device read only, Host can read/write per grid

Global memory - Both host and device can read/write per grid

registers also store temporary vars

quick to write
arithmetic operations can be implemented with constant time loading into ALU from register

Tiling for reduced memory traffic

Loading from global memory into shared memory is computationally expensive
Can be broken down in a way that each thread in a block loads a different element, thus sharing the load
- e.g. operation will be broken into ‘tiles’ where during each time step different threads will read different items in the tile into shared memory, and all threads will perform operations on shared elements
- Thus, it is only effective when a lot of the values being used across threads are being shared
e.g. Each tile can be operated on by one block

Tiled matrix multiplication kernel

CUDA code for tiled kernel where block size corresponds to tile size

#define TILE_WDITH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width) {
 
	// Shared memory arrays scoped to block
	__shared__ float Mds[TILE_WIDTH][TILE_WIDTH];
	__shared__ float Nds[TILE_WIDTH][TILE_WIDTH];
 
	// Save thread and block indexes in register variables scoped to individual thread
	int bx = blockIdx.x; int by = blockIdx.y;
	int tx = threadIdx.x; int ty = threadIdx.y;
	
	// Identify the row and column of the P element to work on
	int Row = by * TILE_WIDTH + ty;
	int Col = bx * TILE_WIDTH + tx;
	
	// Loop over the M and N tiles required to compute P element
	float Pvalue = 0;
	for (int ph = 0; ph < Width/TILE_WIDTH; ++ph){
	
		// Collaborative loading of M and N tiles into shared memory
		Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
		Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
		__syncthreads();
		
		for (int k = 0; k < TILE_WIDTH; ++k){
			Pvalue += Mds[ty][tx] * Nds[ty][tx];
		}
		__syncthreads();
		 
	}
	P[Row*Width + Col] = Pvalue;
	
}

Read-after-write dependence in first __syncthreads() (true dependence)
Write-after-read dependence in second __syncthreads() (false dependence)

Strip mining: Breaking loop into smaller phases

Global memory access reduced by TILE_WIDTH factor → significant

With 16x16 tiles, global memory access ratio goes from 0.25 OP/B to 4 OP/B
e.g. A100 GPU, global memory bandwidth of 1555 GB/second, 1555*4 = 6220 GFLOPS
Still only 32% of peak throughput (19,500 GFLOPS)

Block algorithms are common in parallel programming, in CPUs keep data to be reused in particular time window in cache (implicitly), in GPUs keep data in shared memory explicitly. In GPUs SM runs many threads simultaneously, so cache is less reliable and keeping in shared memory better.

Kernel with boundary checks

Previous kernel assumed matrix dimensions were multiple of block width, and that matrices are square matrices.

Accessing incorrect elements due to shape mismatch can lead to:

Incorrect element access due to linearized array storage
Incorrect memory access to storage not allotted to array (behaviour undefined)

Cannot simply block threads which calculate no valid P element:

Might still be important for loading data used by other threads

float Pvalue = 0;
 
for (int ph=0; ph < ceil(Width/(float)TILE_WIDTH); ++ph){
	
		// Collaborative loading of M and N tiles into shared memory
		if ((Row < Width) && (ph*TILE_WIDTH+tx) < WIDTH)
			Mds[ty][tx] = M[Row*Width + ph*TILE_WIDTH + tx];
		else
			 Mds[ty][tx] = 0.0f;
		
		if ((ph*TILE_WIDTH+ty) < WIDTH && Col < WIDTH)
			Nds[ty][tx] = N[(ph*TILE_WIDTH + ty)*Width + Col];
		else
			 Nds[ty][tx] = 0.0f;
		__syncthreads();
		
		for (int k=0; k < TILE_WIDTH; ++k){
			Pvalue += Mds[ty][k] * Nds[k][tx];
		}
		__syncthreads();
	}	
	if (Row < Width) && (Col < Width):	
		P[Row*Width + Col] = Pvalue;

For extending this kernel to general matrix multiplication kernel:

Replace Width with j, k, l unsigned ints
Where Width refers to height of M or height of P, replace it with j
Where Width refers to width of M or height of N, replace it with k
Where Width refers to width of N or width of P, replace it with l

Impact of memory usage on occupancy

Shared memory usage (like register usage) can limit the number of threads that can be assigned to each SM.

In A100, maximum of 164 KB of shared memory per SM and 2048 threads per SM
For all 2048 thread slots to be used, each thread shouldn’t use more than 164 KB/2048 = 82 B/thread

For tiled matmul, each thread block uses $TILE_WIDTH^2 * 4B + TILE_WIDTH^2* 4B)/(TILE_WIDTH^2 threads)=8 B/thread$$, which means that the kernel’s occupancy is not limited by shared memory

Shared memory available varies by device generation, so we might want to assign it dynamically

In order to enable kernel to adjust its shared memory usage at runtime, we need the extern keyword:

extern __shared__ Mds_Nds[];

Since there is only one merged array, we will also need to manually define where Mds section of array starts and where Nds section of array starts

At runtime, dynamically configure amount of shared memory assigned

size_t size = calculate_appropriate_SM_usage(devProp.sharedMemPerBlock, ...);
 
matrixMulKernel<<<dimGrid, dimBlock, size>>>(Md, Nd, Pd, Width, size/2, size/2);

#define TILE_WIDTH 16
__global__ void matrixMulKernel(float* M, float* N, float* P, int Width, unsigned Mdz_sz, unsigned Nds_sz) {
 
	extern __shared__ char float Mds_Nds[];
	float *Mds = (float *) Mds_Nds;
	float *Nds = (float *) Mds_Nds + Mds_sz;

Shashank Shekhar

Explorer

Chapter 5 Memory Architecture and Data Locality Notes