c. For the statement on line 04:
i. How many warps in the grid are active?
ii. How many warps in the grid are divergent?
iii. What is the SIMD efficiency (in %) of warp 0 of block 0?
iv. What is the SIMD efficiency (in %) of warp 1 of block 0?
v. What is the SIMD efficiency (in %) of warp 3 of block 0?
d. For the statement on line 07:
i. How many warps in the grid are active?
ii. How many warps in the grid are divergent?
iii. What is the SIMD efficiency (in %) of warp 0 of block 0?
e. For the loop on line 09:
i. How many iterations have no divergence?
ii. How many iterations have divergence?
For a vector addition, assume that the vector length is 2000, each thread calculates one output element, and the thread block size is 512 threads. How many threads will be in the grid?
Answer 2
minimum 4 blocks to accommodate full grid ⇒ 4 * 512 = 2048 threads
Question 3
For the previous question, how many warps do you expect to have divergence due to the boundary check on vector length?
Answer 3
The last but one warp (global idx = [1984, 2015]) will have divergence
Question 4
Consider a hypothetical block with 8 threads executing a section of code before reaching a barrier. The threads require the following amount of time (in microseconds) to execute the sections: 2.0, 2.3, 3.0, 2.8, 2.4, 1.9, 2.6, and 2.9; they spend the rest of their time waiting for the barrier. What percentage of the threads' total execution time is spent waiting for the barrier?
Answer 4
All threads must wait till the slowest thread is done = 3 ms
Num # 0 thread wait time = 1ms
Num # 1 thread wait time = 0.7ms
Num # 2 thread wait time = 0ms
Num # 3 thread wait time = 0.2ms
Num # 4 thread wait time = 0.6ms
Num # 5 thread wait time = 1.1ms
Num # 6 thread wait time = 0.4ms
Num # 7 thread wait time = 0.1ms
Total wait time = 3.1ms = 4.1/(3 * 8) = 17%
Question 5
A CUDA programmer says that if they launch a kernel with only 32 threads in each block, they can leave out the __syncthreads() instruction wherever barrier synchronization is needed. Do you think this is a good idea? Explain.
Answer 5
Not a good idea: control divergence could lead to unexpected behaviour and incorrect calculation
Question 6
If a CUDA device's SM can take up to 1536 threads and up to 4 thread blocks, which of the following block configurations would result in the most number of threads in the SM?
(A) 128 threads per block
(B) 256 threads per block
(C) 512 threads per block
(D) 1024 threads per block
Answer 6
(C) 512 threads per block = 1536 threads in 3 blocks = 100% SM thread utilization
Question 7
Assume a device that allows up to 64 blocks per SM and 2048 threads per SM. Indicate which of the following assignments per SM are possible. In the cases in which it is possible, indicate the occupancy level.
(A) 8 blocks with 128 threads each
(B) 16 blocks with 64 threads each
(C) 32 blocks with 32 threads each
(D) 64 blocks with 32 threads each
(E) 32 blocks with 64 threads each
Answer 7
A. Total threads = 8 * 128 = 1024 = 50% occupancy
Total warps = 8 * 4 = 32 warps ✅
B. Total threads = 16 * 64 = 1024 = 50% occupancy
Total warps = 16* 2 = 32 warps ✅
C. Total threads = 32 * 32 = 1024 = 50% occupancy
Total warps = 32* 1 = 32 warps ✅
D. Total threads = 64 * 32 = 2048 = 100% occupancy
Total warps = 64* 1 = 64 warps ✅
E. Total threads = 32 * 64 = 2048 = 100% occupancy
Total warps = 32* 2 = 64 warps ✅
Question 8
Consider a GPU with the following hardware limits: 2048 threads per SM, 32 blocks per SM, and 64K (65,536) registers per SM. For each of the following kernel characteristics, specify whether the kernel can achieve full occupancy. If not, specify the limiting factor.
(A) The kernel uses 128 threads per block and 30 registers per thread.
(B) The kernel uses 32 threads per block and 29 registers per thread.
(C) The kernel uses 256 threads per block and 34 registers per thread.
Answer 8
A.
Threads per SM = 128 * 16 = 2048 ✅
Registers per SM = 30 * 128 * 16 = 61,440 ✅
B.
Threads per SM = 32 * 32 = 1024 ❌
Limiting factor = Max 32 blocks per SM
C.
Threads per SM = 256 * 16 = 2048 ✅
Registers per SM = 34 * 256 * 16 = 139,264 ❌
Limiting factor = Max 64K registers per SM
Question 9
A student mentions that they were able to multiply two 1024 × 1024 matrices using a matrix multiplication kernel with 32 × 32 thread blocks. The student is using a CUDA device that allows up to 512 threads per block and up to 8 blocks per SM. The student further mentions that each thread in a thread block calculates one element of the result matrix. What would be your reaction and why?