Post-PASI training: Week 2
Syllabus
August 15th:
- Holiday
Lab 3 (August 16th):
- Using shared memory as cache
- Implement 2D explicit heat transfer with shared memory
- Comparison of each implementation: timings vs programming effort
Class 4 (August 17th):
- Control flow
- Warp divergence
- Memory coalescing
- Latency hiding
- Occupancy
- Measuring effective performance
Lab 4 (August 18th):
- Implement an efficient A A_transpose multiplication
- References:
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors. (Ch. 4, Ch. 5)
- NVIDIA Advanced CUDA Webminar. Memory Optimizations (http://developer.nvidia.com/gpu-computing-webinars)
- CUDA C Best Practices Guide. Ch. 2-6.
- Ruetsch, G. and Micikevicius, P. Optimizing Matrix Transpose in CUDA.
Class 5 (August 19th):
- Further optimization techniques:
- Data prefetching
- Instruction optimization
- Loop unrolling
- Thread and block heuristics
- Example: optimizing a parallel reduction
- References:
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors. (Ch. 4, Ch. 5)
- Mark Harris, NVIDIA. Optimizing Parallel Reduction in CUDA