Chapter 6 Performance Considerations Exercise Solutions

Question 1

Write a matrix multiplication kernel function that corresponds to the design illustrated in Fig. 6.4.

Answer 1

__global__ void matmul(const float* a, const float* b_transposed, float* c, size_t m, size_t n, size_t k) {
    __shared__ float As[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Bs[TILE_WIDTH][TILE_WIDTH];
    
    int bx = blockIdx.x;  int by = blockIdx.y;
    int tx = threadIdx.x; int ty = threadIdx.y;
    
    // Identify the row and column of the C element to work on
    int Row = by * TILE_WIDTH + ty;
    int Col = bx * TILE_WIDTH + tx;
    
    // Loop over the A and B tiles required to compute C element
    float Cvalue = 0;
    for (int ph = 0; ph < ceil(k/(float)TILE_WIDTH); ++ph) {
        
        // Collaborative loading of A and B tiles into shared memory
        if ((Row < m) && (ph*TILE_WIDTH+tx) < k)
            As[ty][tx] = a[Row*k + ph*TILE_WIDTH + tx];
        else 
            As[ty][tx] = 0.0f;
        
        // For transposed B: b_transposed[Col*k + (ph*TILE_WIDTH+ty)]
        // This is now coalesced access since consecutive tx -> consecutive Col values
        if ((ph*TILE_WIDTH+ty) < k && Col < n)
            Bs[ty][tx] = b_transposed[Col*k + (ph*TILE_WIDTH+ty)];
        else 
            Bs[ty][tx] = 0.0f;
        
        __syncthreads();
        
        for (int i = 0; i < TILE_WIDTH; ++i) {
            Cvalue += As[ty][i] * Bs[i][tx];
        }
        __syncthreads();
    }
    
    if (Row < m && Col < n)
        c[Row*n + Col] = Cvalue;
}

Question 2

For tiled matrix multiplication, of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will the kernel completely avoid uncoalesced accesses to global memory? (You need to consider only square blocks.)

Answer 2

For BLOCK_SIZE 32

Question 3

Consider the following CUDA kernel:
01  __global__ void foo_kernel(float* a, float* b, float* c, float* d, float* e) {
02      unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
03      __shared__ float a_s[256];
04      __shared__ float bc_s[4*256];
05      a_s[threadIdx.x] = a[i];
06      for(unsigned int j = 0; j < 4; ++j) {
07          bc_s[j*256 + threadIdx.x] = b[j*blockDim.x*gridDim.x + i] + c[j*4 + j];
08      }
09      __syncthreads();
10      d[i + 8] = a_s[threadIdx.x];
11      e[i*8] = bc_s[threadIdx.x*4];
12  }
For each of the following memory accesses, specify whether they are coalesced or uncoalesced or coalescing is not applicable:

(A) The access to array a of line 05

(B) The access to array a_s of line 05

(C) The access to array b of line 07

(D) The access to array c of line 07

(E) The access to array bc_s of line 07

(F) The access to array a_s of line 10

(G) The access to array d of line 10

(H) The access to array bc_s of line 11

(I) The access to array e of line 11

Answer 3

A) coalesced
B) NA (smem)
C) coalesced
D) uncoalesced (idx 0, 5, 10, 15)
E) NA (smem)
F) NA (smem)
G) coalesced
H) NA (smem)
I) uncoalesced (e.g. for threadIdx.x=0,1,2,3 it will be 0, 8, 16, 24)

[! error]- My original incorrect answer

For C I thought uncoalesced, but it is actually coalesced - blockDim and gridDim are just constants, so for each j the values of across threads are coalesced due to i

Question 4

What is the floating point to global memory access ratio (in OP/B) of each of the following matrix-matrix multiplication kernels?

(A) The simple kernel described in Chapter 3, Multidimensional Grids and Data, without any optimizations applied.

(B) The kernel described in Chapter 5, Memory Architecture and Data Locality, with shared memory tiling applied using a tile size of 32 × 32.

(C) The kernel described in this chapter with shared memory tiling applied using a tile size of 32 × 32 and thread coarsening applied using a coarsening factor of 4.

Answer 4

A) Per thread: 2 OPs / (4 * 2 Bytes) = 0.25
B) Per tile: 32 * 32 * 32 Ops / (4 * 32 * 32 Bytes) = 8

Shashank Shekhar

Explorer

Chapter 6 Performance Considerations Exercise Solutions

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Question 4

Answer 4

Graph View

Table of Contents