OpenACC is a directives-based API for code parallelization with GPUs. In contrast, OpenMP is the API for shared-memory parallel processing with CPUs. For applications, programmers insert OpenACC directives before specific code sections, typically with loops, to engage the GPUs. This approach enables existing — especially legacy — codes to be parallelized without extensive rewriting using Nvidia’s CUDA programming language for GPU. However, while a code parallelized with OpenACC directives may require significantly less effort than its equivalent CUDA counterpart, it may also result in less stellar computational performance. For many large existing codes, rewriting them with CUDA is impractical if not impossible. For those cases, OpenACC offers a pragmatic alternative.
What you need to know or do on the SCC
- To use OpenACC, compile your Fortran code with the Portland Group compiler,
pgcc. Here is an example on how to compile you C or Fortran code embedded with OpenACC directives:
scc1% pgfortran -o mycode -acc -Minfo mycode.f90
In the above,
-accturns on the OpenACC feature while
-Minforeturns additional information on the compilation. For details, see this.
- To submit your code (with OpenACC directives) to a SCC node with GPUs:
scc1% qsub -l gpus=1 -b y mycode
In the above, 1 GPU (and in the absence of multiprocessor request, 1 CPU) is requested.
- The following examples demonstrate a matrix multiply (C = A * B) using either multi-threaded OpenMP or OpenACC on a single GPU. The environment variables _OPENMP and _OPENACC are used to determine if the Fortran 90 program was compiled for OpenMP or OpenACC directives.
- To enable the #ifdef C preprocessor statements in 1, either name your Fortran code with the
.F90suffix or use the compiler flag
- For OpenMP application:
scc1% pgfortran -o mm_omp matrix_multiply.f90 -mp -Mpreprocess
For OpenACC application (with a single GPU device):
scc1% pgfortran -o mm_acc matrix_multiply.f90 -acc -Mpreprocess
For OpenACC application (with multiple GPU devices):
Not supported yet on the SCC, awaiting PGI compiler version 13.3 for bug fix.
The above figure shows the timings comparison of a matrix multiply using a single GPU (via OpenACC) against two other parallel methods: OpenMP and MPI. The figure below shows the timings of matrix multiply using 1, 2, and 3 GPU devices.