Question 1
Write a matrix multiplication kernel function that corresponds to the design illustrated in Fig. 6.4.
Answer 1
__global__ void matmul(const float* a, const float* b_transposed, float* c, size_t m, size_t n, size_t k) {
__shared__ float As[TILE_WIDTH][TILE_WIDTH];
__shared__ float Bs[TILE_WIDTH][TILE_WIDTH];
int bx = blockIdx.x; int by = blockIdx.y;
int tx = threadIdx.x; int ty = threadIdx.y;
// Identify the row and column of the C element to work on
int Row = by * TILE_WIDTH + ty;
int Col = bx * TILE_WIDTH + tx;
// Loop over the A and B tiles required to compute C element
float Cvalue = 0;
for (int ph = 0; ph < ceil(k/(float)TILE_WIDTH); ++ph) {
// Collaborative loading of A and B tiles into shared memory
if ((Row < m) && (ph*TILE_WIDTH+tx) < k)
As[ty][tx] = a[Row*k + ph*TILE_WIDTH + tx];
else
As[ty][tx] = 0.0f;
// For transposed B: b_transposed[Col*k + (ph*TILE_WIDTH+ty)]
// This is now coalesced access since consecutive tx -> consecutive Col values
if ((ph*TILE_WIDTH+ty) < k && Col < n)
Bs[ty][tx] = b_transposed[Col*k + (ph*TILE_WIDTH+ty)];
else
Bs[ty][tx] = 0.0f;
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i) {
Cvalue += As[ty][i] * Bs[i][tx];
}
__syncthreads();
}
if (Row < m && Col < n)
c[Row*n + Col] = Cvalue;
}Question 2
For tiled matrix multiplication, of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory? (You need to consider only square blocks.)
Answer 2
For BLOCK_SIZE 32
Question 3
Consider the following CUDA kernel:
01 __global__ void foo_kernel(float* a, float* b, float* c, float* d, float* e) { 02 unsigned int i = blockIdx.x*blockDim.x + threadIdx.x; 03 __shared__ float a_s[256]; 04 __shared__ float bc_s[4*256]; 05 a_s[threadIdx.x] = a[i]; 06 for(unsigned int j = 0; j < 4; ++j) { 07 bc_s[j*256 + threadIdx.x] = b[j*blockDim.x*gridDim.x + i] + c[j*4 + j]; 08 } 09 __syncthreads(); 10 d[i + 8] = a_s[threadIdx.x]; 11 e[i*8] = bc_s[threadIdx.x*4]; 12 }For each of the following memory accesses, specify whether they are coalesced or uncoalesced or coalescing is not applicable:
- (A) The access to array a of line 05
- (B) The access to array a_s of line 05
- (C) The access to array b of line 07
- (D) The access to array c of line 07
- (E) The access to array bc_s of line 07
- (F) The access to array a_s of line 10
- (G) The access to array d of line 10
- (H) The access to array bc_s of line 11
- (I) The access to array e of line 11
Answer 3
- A) coalesced
- B) NA (smem)
- C) coalesced
- D) uncoalesced (idx 0, 5, 10, 15)
- E) NA (smem)
- F) NA (smem)
- G) coalesced
- H) NA (smem)
- I) uncoalesced (e.g. for threadIdx.x=0,1,2,3 it will be 0, 8, 16, 24)
[! error]- My original incorrect answer
For C I thought uncoalesced, but it is actually coalesced - blockDim and gridDim are just constants, so for each j the values of across threads are coalesced due to i
Question 4
What is the floating point to global memory access ratio (in OP/B) of each of the following matrix-matrix multiplication kernels?
- (A) The simple kernel described in Chapter 3, Multidimensional Grids and Data, without any optimizations applied.
- (B) The kernel described in Chapter 5, Memory Architecture and Data Locality, with shared memory tiling applied using a tile size of 32 × 32.
- (C) The kernel described in this chapter with shared memory tiling applied using a tile size of 32 × 32 and thread coarsening applied using a coarsening factor of 4.
Answer 4
- A) Per thread: 2 OPs / (4 * 2 Bytes) = 0.25
- B) Per tile: 32 * 32 * 32 Ops / (4 * 32 * 32 Bytes) = 8