Programming for GPUs using CUDA in Fortran
CUDA is a parallel programming model and software environment developed by NVIDIA. It provides programmers with a set of instructions that enable GPU acceleration for data-parallel computations. The computing performance of many applications can be dramatically increased by using CUDA directly or by linking to GPU-accelerated libraries.
Setting up your environment
Your environment should by default be all set to take advantage of CUDA Fortran using the Portland Group Fortran compiler pgfortran.
Compiling and running interactively a simple CUDA program using Portland Group CUDA Fortran
Run man pgfortran for usage instructions.
There are two CUDA Fortran free-format source file suffixes; .cuf and .CUF. The .CUF files require preprocessing.
As a test, you can download the CUDA Fortran matrix multiply example matmul.cuf and transfer it to the directory where you are working on the SCC.
You should do your compiling of CUDA Fortran programs on one of our nodes with GPUs, not on the login nodes. You can get access to a GPU-equipped node by running the command below. This command requests an xterm with access to 1 GPU for 24 hours (this command requires X to be running on your local machine):
scc1:~ % qsh -V -l h_rt=24:00:00 -l gpus=1
After that, to compile this CUDA Fortran program, run:
scc-je2:~ % pgfortran -fast -o matmul matmul.cuf
To run a CUDA program interactively, you then type in the name of the program at the command prompt:
gpunode:~ % matmul
Submit a CUDA program Batch Job
The following line shows how to submit the matmul executable to run in batch mode on a single CPU with access to a single GPU:
scc1:~ % qsub -l gpus=1 -b y matmul
where the –l gpus=# option indicates the number of GPUs requested for each processor (possibly a fraction). To learn about all options that could be used for submitting a job, please visit the running jobs page.
Several scientific libraries that make use of CUDA are available:
- cuBLAS – Linear Algebra Subroutines. A GPU accelerated version of the complete standard BLAS library.
- cuFFT – Fast Fourier Transform library. Provides a simple interface for computing FFTs up to 10x faster.
- cuRAND – Random Number Generation library. Delivers high performance random number generation.
- cuSparse – Sparse Matrix library. Provides a collection of basic linear algebra subroutines used for sparse matrices.
- NPP – Performance Primitives library. A collection of image and signal processing primitives.
Architecture specific options
There are currently two types of GPU cards available on the SCC – NVIDIA Tesla M2050 GPU cards (3 per node) with 3GB of memory on the nodes scc-e* and scc-f* and NVIDIA Tesla M2070 cards (8 per node) with 6GB of memory on the nodes scc-h* and scc-j*.
Architecture specific features can be enabled using –arch sm_## flag during compilation. The “sm” stands for “streaming multiprocessor” and the number following sm_ indicates the features supported by the architecture. For example, for a CUDA program running on the SCC you can add the –arch sm_20 flag to allow for functionality available on GPUs that have Compute Capability 2.0 (Fermi architecture). See the CUDA Toolkit documentation for more information on this.
Additional CUDA training resources
NVIDIA provides a number of resources to learn CUDA programming at
SCV staff scientific programmers can help you with your CUDA code tuning. For assistance, please send email to email@example.com.