Chapter 5 Memory Architecture and Data Locality Exercise Solutions

Before you go through the exercise solutions, you might wanna check out the Chapter 5 Notes

Question 1

Consider matrix addition. Can one use shared memory to reduce the global memory bandwidth consumption? Hint: Analyze the elements that are accessed by each thread and see whether there is any commonality between threads.

Answer 1

Each operation involves a singular element from input matrices, and there is no information that needs to be shared across threads. Hence, element-wise matrix addition will not benefit from using shared memory to reduce global bandwidth memory consumption.

Question 2

Draw the equivalent of Fig. 5.7 for a 8 × 8 matrix multiplication with 2 × 2 tiling and 4 × 4 tiling. Verify that the reduction in global memory bandwidth is indeed proportional to the dimension size of the tiles.

Answer 2

Question 3

What type of incorrect execution behavior can happen if one forgot to use one or both __syncthreads() in the kernel of Fig. 5.9?

Answer 3

Two types of incorrect execution behaviour:

Missing first __syncthreads();
- Data may not have been loaded from global device memory into shared memory by all threads in the warp yet, so if any thread proceeds to perform the calculation it will read an incorrect value causing calculation errors.
Missing second __syncthreads();
- Not all threads might have completed their tile multiplication and addition to the Pvalue in register yet, so the value does not reflect the correct value of the local tile multiplication. Writing this back to the global memory will reflect an incorrect calculation.
- The reading from device memory into shared memory can continue for next iteration of the inner loop, causing threads which are still on the previous iteration and rely on Mds, Nds values to read incorrect values

Question 4

Assuming that capacity is not an issue for registers or shared memory, give one important reason why it would be valuable to use shared memory instead of registers to hold values fetched from global memory? Explain your answer.

Answer 4

For operations where threads need to share data between them, it will still be faster to use shared memory instead of threads simply due to reuse. Say, if register I/O speed is 10x faster than Shared Memory. But 16 threads in a block are sharing the same data to operate on. Even if read 10x slower from the shared memory, we will have a 16x speedup when loading from device memory to shared memory versus device memory to registers since each thread register will need to copy all the data whereas with shared memory we will need to copy the data only once.

Question 5

For our tiled matrix-matrix multiplication kernel, if we use a 32 × 32 tile, what is the reduction of memory bandwidth usage for input matrices M and N?

Answer 5

32X speedup

Question 6

Assume that a CUDA kernel is launched with 1000 thread blocks, each of which has 512 threads. If a variable is declared as a local variable in the kernel, how many versions of the variable will be created through the lifetime of the execution of the kernel?

Answer 6

1000 * 512 (each thread will have a local copy on its register memory)

Question 7

In the previous question, if a variable is declared as a shared memory variable, how many versions of the variable will be created through the lifetime of the execution of the kernel?

Answer 7

1000 (each block will have a local copy on its shared memory)

Question 8

Consider performing a matrix multiplication of two input matrices with dimensions N × N. How many times is each element in the input matrices requested from global memory when:

(A) There is no tiling?

(B) Tiles of size T × T are used?

Answer 8

A) No tiling: Each thread requests one full row and one full column from global memory. So each element gets accessed $N$ times, since it gets called every time its row is accessed (from A in $A * B$ ) or its column is accessed (from B in $A * B$ )
B) Tiling: Each block loads an $T * T$ tile from global memory. Thus each element is part of $N / T$ tiles i.e. gets accessed $N / T$ times

Question 9

A kernel performs 36 floating-point operations and seven 32-bit global memory accesses per thread. For each of the following device properties, indicate whether this kernel is compute-bound or memory-bound.

(A) Peak FLOPS=200 GFLOPS, peak memory bandwidth=100 GB/second

(B) Peak FLOPS=300 GFLOPS, peak memory bandwidth=250 GB/second

Answer 9

A) Per thread: FLOPS = 36, Memory access = $7 * 4 = 28$ bytes
- $200 e 6//36 = 5.55 e 6$ thread bound from FLOPS
- $100 e 6//28 = 3.57 e 6$ thread bound from memory bandwidth ✅ memory-bound
B) Per thread: FLOPS = 36, Memory access = $7 * 4 = 28$ bytes
- $300 e 6//36 = 8.33 e 6$ thread bound from FLOPS ✅ compute-bound
- $250 e 6//28 = 8.92 e 6$ thread bound from memory bandwidth

Question 10

To manipulate tiles, a new CUDA programmer has written a device kernel that will transpose each tile in a matrix. The tiles are of size BLOCK_WIDTH by BLOCK_WIDTH, and each of the dimensions of matrix A is known to be a multiple of BLOCK_WIDTH. The kernel invocation and code are shown below. BLOCK_WIDTH is known at compile time and could be set anywhere from 1 to 20.
01  dim3 blockDim(BLOCK_WIDTH,BLOCK_WIDTH);
02  dim3 gridDim(A_width/blockDim.x,A_height/blockDim.y);
03  BlockTranspose<<<gridDim, blockDim>>>(A, A_width, A_height);
04  
05  __global__ void BlockTranspose(float* A_elements, int A_width, int A_height)
06  {
07      __shared__ float blockA[BLOCK_WIDTH][BLOCK_WIDTH];
08      
09      int baseIdx = blockIdx.x * BLOCK_SIZE + threadIdx.x;
10      baseIdx += (blockIdx.y * BLOCK_SIZE + threadIdx.y) * A_width;
11      
12      blockA[threadIdx.y][threadIdx.x] = A_elements[baseIdx];
13      
14      A_elements[baseIdx] = blockA[threadIdx.x][threadIdx.y];
15  }
(A) Out of the possible range of values for BLOCK_SIZE, for what values of BLOCK_SIZE will this kernel function execute correctly on the device?

(B) If the code does not execute correctly for all BLOCK_SIZE values, what is the root cause of this incorrect execution behavior? Suggest a fix to the code to make it work for all BLOCK_SIZE values.

Answer 10

A) None
B)__syncthreads() not used. blockA values will not be synced correctly between reading and writing to A_elements

Question 11

Consider the following CUDA kernel and the corresponding host function that calls it:
01  __global__ void foo_kernel(float* a, float* b) {
02      unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
03      float x[4];
04      __shared__ float y_s;
05      __shared__ float b_s[128];
06      for(unsigned int j = 0; j < 4; ++j) {
07          x[j] = a[j*blockDim.x*gridDim.x + i];
08      }
09      if(threadIdx.x == 0) {
10          y_s = 7.4f;
11      }
12      b_s[threadIdx.x] = b[i];
13      __syncthreads();
14      b[i] = 2.5f*x[0] + 3.7f*x[1] + 6.3f*x[2] + 8.5f*x[3]
15            + y_s*b_s[threadIdx.x] + b_s[(threadIdx.x + 3)%128];
16  }
17  void foo(int* a_d, int* b_d) {
18      unsigned int N = 1024;
19      foo_kernel <<< (N + 128 - 1)/128, 128 >>>(a_d, b_d);
20  }
(A) How many versions of the variable i are there?

(B) How many versions of the array x[] are there?

(C) How many versions of the variable y_s are there?

(D) How many versions of the array b_s[] are there?

(E) What is the amount of shared memory used per block (in bytes)?

(F) What is the floating-point to global memory access ratio of the kernel (in OP/B)?

Answer 11

Num threads = $1024$
Num blocks = $128$
A) One i per thread = 1024 total
B) One x[] per thread = 1024 total
C) One y_s per block = 8 total
D) One b_s[] per block = 8 total
E) $1 * 4$ byte (for y_s) + $128 * 4$ byte (for b_s[] ) = 516 bytes
F) Per thread
- Memory access
  - $4 * 4$ bytes = 16 (4 a elements)
  - 4 bytes (1 b element)
- Operations
  - 5 mul + 5 add = 10 FLOPs
- Ratio = 10/20 = 1/2

Question 12

Consider a GPU with the following hardware limits: 2048 threads/SM, 32 blocks/SM, 64K (65,536) registers/SM, and 96 KB of shared memory/SM. For each of the following kernel characteristics, specify whether the kernel can achieve full occupancy. If not, specify the limiting factor.

(A) The kernel uses 64 threads/block, 27 registers/thread, and 4 KB of shared memory/SM.

(B) The kernel uses 256 threads/block, 31 registers/thread, and 8 KB of shared memory/SM.

Answer 12

A)
- Threads/block = 64 i.e. max 2048/64= 32 blocks per SM ✅
- Registers/thread = 27 i.e. $27 * 2048 = 55296$ registers per SM ✅
- Shared memory/SM = $4 KB * 32$ blocks = 128 KB per SM ❌
B)
- Threads/block = 256 i.e. max 2048/256= 8 blocks per SM ✅
- Registers/thread = 31 i.e. $31 * 2048 = 63348$ registers per SM ✅
- Shared memory/SM = $8 KB * 8$ blocks = 64 KB per SM ✅

Shashank Shekhar

Explorer

Chapter 5 Memory Architecture and Data Locality Exercise Solutions

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Question 4

Answer 4

Question 5

Answer 5

Question 6

Answer 6

Question 7

Answer 7

Question 8

Answer 8

Question 9

Answer 9

Question 10

Answer 10

Question 11

Answer 11

Question 12

Answer 12

Graph View

Table of Contents