Question 1

Write a matrix multiplication kernel function that corresponds to the design illustrated in Fig. 6.4.

Answer 1

__global__ void matmul(const float* a, const float* b_transposed, float* c, size_t m, size_t n, size_t k) {
    __shared__ float As[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Bs[TILE_WIDTH][TILE_WIDTH];
    
    int bx = blockIdx.x;  int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;
    
    // Identify the row and column of the C element to work on
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;
    
    // Loop over the A and B tiles required to compute C element
    float Cvalue = 0;
    for (int ph = 0; ph < ceil(k/(float)TILE_WIDTH); ++ph) {
        
        // Collaborative loading of A and B tiles into shared memory
        if ((Row < m) && (ph*TILE_WIDTH+tx) < k)
            As[ty][tx] = a[Row*k + ph*TILE_WIDTH + tx];
        else 
            As[ty][tx] = 0.0f;
        
        // For transposed B: b_transposed[Col*k + (ph*TILE_WIDTH+ty)]
        // This is now coalesced access since consecutive tx -> consecutive Col values
        if ((ph*TILE_WIDTH+ty) < k && Col < n)
            Bs[ty][tx] = b_transposed[Col*k + (ph*TILE_WIDTH+ty)];
        else 
            Bs[ty][tx] = 0.0f;
        
        __syncthreads();
        
        for (int i = 0; i < TILE_WIDTH; ++i) {
            Cvalue += As[ty][i] * Bs[i][tx];
        }
        __syncthreads();
    }
    
    if (Row < m && Col < n)
        c[Row*n + Col] = Cvalue;
}

Question 2

For tiled matrix multiplication, of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory? (You need to consider only square blocks.)

Answer 2

For BLOCK_SIZE 32

Question 3

Consider the following CUDA kernel:

01  __global__ void foo_kernel(float* a, float* b, float* c, float* d, float* e) {
02      unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
03      __shared__ float a_s[256];
04      __shared__ float bc_s[4*256];
05      a_s[threadIdx.x] = a[i];
06      for(unsigned int j = 0; j < 4; ++j) {
07          bc_s[j*256 + threadIdx.x] = b[j*blockDim.x*gridDim.x + i] + c[j*4 + j];
08      }
09      __syncthreads();
10      d[i + 8] = a_s[threadIdx.x];
11      e[i*8] = bc_s[threadIdx.x*4];
12  }

For each of the following memory accesses, specify whether they are coalesced or uncoalesced or coalescing is not applicable:

  • (A) The access to array a of line 05
  • (B) The access to array a_s of line 05
  • (C) The access to array b of line 07
  • (D) The access to array c of line 07
  • (E) The access to array bc_s of line 07
  • (F) The access to array a_s of line 10
  • (G) The access to array d of line 10
  • (H) The access to array bc_s of line 11
  • (I) The access to array e of line 11

Answer 3

  • A) coalesced
  • B) NA (smem)
  • C) coalesced
  • D) uncoalesced (idx 0, 5, 10, 15)
  • E) NA (smem)
  • F) NA (smem)
  • G) coalesced
  • H) NA (smem)
  • I) uncoalesced (e.g. for threadIdx.x=0,1,2,3 it will be 0, 8, 16, 24)

[! error]- My original incorrect answer

For C I thought uncoalesced, but it is actually coalesced - blockDim and gridDim are just constants, so for each j the values of across threads are coalesced due to i

Question 4

What is the floating point to global memory access ratio (in OP/B) of each of the following matrix-matrix multiplication kernels?

  • (A) The simple kernel described in Chapter 3, Multidimensional Grids and Data, without any optimizations applied.
  • (B) The kernel described in Chapter 5, Memory Architecture and Data Locality, with shared memory tiling applied using a tile size of 32 × 32.
  • (C) The kernel described in this chapter with shared memory tiling applied using a tile size of 32 × 32 and thread coarsening applied using a coarsening factor of 4.

Answer 4

  • A) Per thread: 2 OPs / (4 * 2 Bytes) = 0.25
  • B) Per tile: 32 * 32 * 32 Ops / (4 * 32 * 32 Bytes) = 8