Chapter 4 Compute Architecture and Scheduling Exercise Solutions

Before you go through the exercise solutions, you might wanna check out the Chapter 4 Notes

Question 1

Consider the following CUDA kernel and the corresponding host function that calls it:
01      __global__ void foo_kernel(int* a, int* b) {
02          unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
03          if(threadIdx.x < 40 || threadIdx.x >= 104) {
04              b[i] = a[i] + 1;
05          }
06          if(i%2 == 0) {
07              a[i] = b[i]*2;
08          }
09          for(unsigned int j = 0; j < 5 - (i%3); ++j) {
10              b[i] += j;
11          }
12      }
13      void foo(int* a_d, int* b_d) {
14          unsigned int N = 1024;
15          foo_kernel <<< (N + 128 - 1)/128, 128 >>>(a_d, b_d);
16      }
a. What is the number of warps per block?

b. What is the number of warps in the grid?

c. For the statement on line 04: i. How many warps in the grid are active? ii. How many warps in the grid are divergent? iii. What is the SIMD efficiency (in %) of warp 0 of block 0? iv. What is the SIMD efficiency (in %) of warp 1 of block 0? v. What is the SIMD efficiency (in %) of warp 3 of block 0?

d. For the statement on line 07: i. How many warps in the grid are active? ii. How many warps in the grid are divergent? iii. What is the SIMD efficiency (in %) of warp 0 of block 0?

e. For the loop on line 09: i. How many iterations have no divergence? ii. How many iterations have divergence?

Answer 1

a. # warps/block = (# threads/ block)/(# threads/warp) = 128/32 = 4

b. # warps in grid = # warps/block * # blocks = 4* 8 = 32

c. (i) threadsIdx.x = [40, 103] are inactive ⇒ warp 2 in each block completely inactive 3 * 8 = 24 blocks are active

(ii) warp 1 (threadsIdx.x 32-63) and warp 4 (threadsIdx.x 96-127) in each block are divergent ⇒ 2 * 8 = 16 blocks are divergent

[! error]- My original incorrect answer

I did not consider warps per block, and was just considering warps in the first block

(i) threadsIdx.x = [40, 103] are inactive ⇒ warp 2 completely inactive = 31 warps in grid are active

(ii) warp 1 (threadsIdx.x 32-63) and warp 4 (threadsIdx.x 96-127) are divergent

(iii) SIMD efficiency of warp 0 block 0 (threadIdx.x [0, 31]) = 100%

(iv) SIMD efficiency of warp 1 block 0 (threadIdx.x [32, 63]) = 8/32 = 25%

(v) SIMD efficiency of warp 3 block 0 (threadIdx.x [96, 127]) = 24/32 = 75%

d. (i) All warps are active (each warp has threads with even global indexes)

(ii) All warps are divergent (all warps have 2 odd 2 even threadIdx)

(iii) SIMD efficiency = 16/32 = 50%

(i%3) $\in$ [0, 1, 2] ⇒ j < 5 - (i%3) $\in$ [5, 4, 3] (i) (ii)

Question 2

For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid?

Answer 2

minimum 4 blocks to accommodate full grid ⇒ 4 * 512 = 2048 threads

Question 3

For the previous question, how many warps do you expect to have divergence due to the boundary check on vector length?

Answer 3

The last but one warp (global idx = [1984, 2015]) will have divergence

Question 4

Consider a hypothetical block with 8 threads executing a section of code before reaching a barrier. The threads require the following amount of time (in microseconds) to execute the sections: 2.0, 2.3, 3.0, 2.8, 2.4, 1.9, 2.6, and 2.9; they spend the rest of their time waiting for the barrier. What percentage of the threads' total execution time is spent waiting for the barrier?

Answer 4

All threads must wait till the slowest thread is done = 3 ms

Num # 0 thread wait time = 1ms
Num # 1 thread wait time = 0.7ms
Num # 2 thread wait time = 0ms
Num # 3 thread wait time = 0.2ms
Num # 4 thread wait time = 0.6ms
Num # 5 thread wait time = 1.1ms
Num # 6 thread wait time = 0.4ms
Num # 7 thread wait time = 0.1ms

Total wait time = 3.1ms = 4.1/(3 * 8) = 17%

Question 5

A CUDA programmer says that if they launch a kernel with only 32 threads in each block, they can leave out the __syncthreads() instruction wherever barrier synchronization is needed. Do you think this is a good idea? Explain.

Answer 5

Not a good idea: control divergence could lead to unexpected behaviour and incorrect calculation

Question 6

If a CUDA device's SM can take up to 1536 threads and up to 4 thread blocks, which of the following block configurations would result in the most number of threads in the SM?

(A) 128 threads per block

(B) 256 threads per block

(C) 512 threads per block

(D) 1024 threads per block

Answer 6

Question 7

Assume a device that allows up to 64 blocks per SM and 2048 threads per SM. Indicate which of the following assignments per SM are possible. In the cases in which it is possible, indicate the occupancy level.

(A) 8 blocks with 128 threads each

(B) 16 blocks with 64 threads each

(C) 32 blocks with 32 threads each

(D) 64 blocks with 32 threads each

(E) 32 blocks with 64 threads each

Answer 7

A. Total threads = 8 * 128 = 1024 = 50% occupancy Total warps = 8 * 4 = 32 warps ✅

B. Total threads = 16 * 64 = 1024 = 50% occupancy Total warps = 16* 2 = 32 warps ✅

C. Total threads = 32 * 32 = 1024 = 50% occupancy Total warps = 32* 1 = 32 warps ✅

D. Total threads = 64 * 32 = 2048 = 100% occupancy Total warps = 64* 1 = 64 warps ✅

E. Total threads = 32 * 64 = 2048 = 100% occupancy Total warps = 32* 2 = 64 warps ✅

Question 8

Consider a GPU with the following hardware limits: 2048 threads per SM, 32 blocks per SM, and 64K (65,536) registers per SM. For each of the following kernel characteristics, specify whether the kernel can achieve full occupancy. If not, specify the limiting factor.

(A) The kernel uses 128 threads per block and 30 registers per thread.

(B) The kernel uses 32 threads per block and 29 registers per thread.

(C) The kernel uses 256 threads per block and 34 registers per thread.

Answer 8

A. Threads per SM = 128 * 16 = 2048 ✅ Registers per SM = 30 * 128 * 16 = 61,440 ✅

B. Threads per SM = 32 * 32 = 1024 ❌ Limiting factor = Max 32 blocks per SM

C. Threads per SM = 256 * 16 = 2048 ✅ Registers per SM = 34 * 256 * 16 = 139,264 ❌ Limiting factor = Max 64K registers per SM

Question 9

A student mentions that they were able to multiply two 1024 × 1024 matrices using a matrix multiplication kernel with 32 × 32 thread blocks. The student is using a CUDA device that allows up to 512 threads per block and up to 8 blocks per SM. The student further mentions that each thread in a thread block calculates one element of the result matrix. What would be your reaction and why?

Answer 9

In student’s kernel

Block size = 32 x 32 (=1024 > 512)

Student needs to reduce block size

Shashank Shekhar

Explorer

Chapter 4 Compute Architecture and Scheduling Exercise Solutions

Question 1

Answer 1

Question 2

Answer 2

Question 3

Answer 3

Question 4

Answer 4

Question 5

Answer 5

Question 6

Answer 6

Question 7

Answer 7

Question 8

Answer 8

Question 9

Answer 9

Graph View

Table of Contents