CUDA is a parallel programming model and software environment developed by NVIDIA. It provides programmers with a set of instructions that enable GPU acceleration for data-parallel computations. The computing performance of many applications can be dramatically increased by using CUDA directly or by linking to GPU-accelerated libraries.

### Setting up your environment

Your environment should by default be all set to take advantage of CUDA Fortran using the Portland Group Fortran compiler `pgfortran`

.

### Compiling and running interactively a simple CUDA program using Portland Group CUDA Fortran

Run `man pgfortran`

for usage instructions.

There are two CUDA Fortran free-format source file suffixes; .cuf and .CUF. The .CUF files require preprocessing.

As a test, you can download the CUDA Fortran matrix multiply example matmul.cuf and transfer it to the directory where you are working on the SCC.

You should do your compiling of CUDA Fortran programs on one of our nodes with GPUs, **not on the login nodes**. You can get access to a GPU-equipped node by running the command below.

`scc1% qrsh -l gpus=1`

After that, to compile this CUDA Fortran program, run:

`scc-je2% pgfortran -fast -o matmul matmul.cuf`

To run a CUDA program interactively, you then type in the name of the program at the command prompt:

`gpunode% matmul`

### Submit a CUDA program Batch Job

The following line shows how to submit the `matmul`

executable to run in batch mode on a single CPU with access to a single GPU:

`scc1% qsub -l gpus=1 -b y matmul`

where the `–l gpus=#`

option indicates the number of GPUs requested for each processor (possibly a fraction). To learn about all options that could be used for submitting a job, please visit the running jobs page.

### CUDA Libraries

Several scientific libraries that make use of CUDA are available:

- cuBLAS – Linear Algebra Subroutines. A GPU accelerated version of the complete standard BLAS library.
- cuFFT – Fast Fourier Transform library. Provides a simple interface for computing FFTs up to 10x faster.
- cuRAND – Random Number Generation library. Delivers high performance random number generation.
- cuSparse – Sparse Matrix library. Provides a collection of basic linear algebra subroutines used for sparse matrices.
- NPP – Performance Primitives library. A collection of image and signal processing primitives.

### Architecture specific options

There are currently 3 sets of nodes that incorporate GPUs and are available to the SCF users. All three are part of the Shared Computing Cluster (SCC).

The first set includes 18 nodes. Each of these nodes has an Intel Xeon X5675 CPU with 12 cores running at 3.07Ghz and 48 GB of memory. Each node also has 8 NVIDIA Tesla M2070 GPU cards with 6 GB of Memory.

The second set includes 2 nodes. Each of these nodes has E5-2650v2 processors with 16 cores. Each node also has 2 NVIDIA Tesla K40m GPU cards with 12 GB of Memory each and 3.5 compute capability.

The third set includes 4 nodes. Each of these nodes has E5-2680 v4 processors with 28 cores. Each node also has 2 NVIDIA Tesla P100 GPU cards with 12 GB of Memory each and 6.0 compute capability.

For more details on nodes available on the SCC, please visit the Technical Summary page.

### Additional CUDA training resources

PGI CUDA Fortran Programming Guide and Reference

PGI Insider: Introduction to PGI CUDA Fortran

NVIDIA provides a number of resources to learn CUDA programming at

https://developer.nvidia.com/cuda-training.

### CUDA Consulting

SCV staff scientific programmers can help you with your CUDA code tuning. For assistance, please send email to help@scc.bu.edu.