Post-PASI training: Week 2

Syllabus

August 15th:

Holiday

Lab 3 (August 16th):

Lab3 – Slides

FD_2D_shared.cu

FD_2D_shared_ghost.cu

Using shared memory as cache
- Implement 2D explicit heat transfer with shared memory
Comparison of each implementation: timings vs programming effort

References:
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors.(Ch. 4, Ch. 5)

Class 4 (August 17th):

Lecture 4 – Slides

Control flow
- Warp divergence
Memory coalescing
Latency hiding
Occupancy
Measuring effective performance

Lab 4 (August 18th):

Lab4 – Slides

AAt_tiled.cu

Implement an efficient A A_transpose multiplication
References:
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors. (Ch. 4, Ch. 5)
- NVIDIA Advanced CUDA Webminar. Memory Optimizations (http://developer.nvidia.com/gpu-computing-webinars)
- CUDA C Best Practices Guide. Ch. 2-6.
- Ruetsch, G. and Micikevicius, P. Optimizing Matrix Transpose in CUDA.

Class 5 (August 19th):

Lecture 5 – Slides

Further optimization techniques:
- Data prefetching
- Instruction optimization
- Loop unrolling
Thread and block heuristics
Example: optimizing a parallel reduction

References:
- Kirk, D. and Hwu, W. Programming Massively Parallel Processors. (Ch. 4, Ch. 5)
- Mark Harris, NVIDIA. Optimizing Parallel Reduction in CUDA