CUDA is a parallel programming model and software environment developed by NVIDIA. It provides programmers with a set of instructions that enable GPU acceleration for data-parallel computations. The computing performance of many applications can be dramatically increased by using CUDA directly or by linking to GPU-accelerated libraries.

Setting up your environment

To link and run applications using CUDA you will need to make some changes to your path and environment. Load the appropriate version of cuda:

module load cuda/8.0

The list of available versions of cuda can be obtained by executing the module avail cuda command.

Compiling a simple CUDA C/C++ program

Consider the following simple CUDA program that prints out information about GPUs installed on the system:

Download the source code of and transfer it to the directory where you are working on the SCC.
Then execute the following command to compile

scc1% nvcc -o gpu_info

Running a CUDA program interactively on a GPU-enabled node

To execute a CUDA code, you have to login via interactive batch to a GPU-enabled node on the SCC. To request an interactive session with access to 1 GPU:

scc1% qrsh -l gpus=1

To run a CUDA program interactively, you then type in the name of the program at the command prompt:

gpunode% gpu_info

Submit a CUDA program Batch Job

The following line shows how to submit the gpu_info program to run in batch mode on a single CPU with access to a single GPU:

scc1% qsub -l gpus=1 -b y gpu_info

where the –l gpus=# option indicates the number of GPUs requested for each processor (possibly a fraction). To learn about all options that could be used for submitting a job, please visit the running jobs page.

CUDA Libraries

Several scientific libraries that make use of CUDA are available:

  • cuBLAS – Linear Algebra Subroutines. A GPU accelerated version of the complete standard BLAS library.
  • cuFFT – Fast Fourier Transform library. Provides a simple interface for computing FFTs up to 10x faster.
  • cuRAND – Random Number Generation library. Delivers high performance random number generation.
  • cuSparse – Sparse Matrix library. Provides a collection of basic linear algebra subroutines used for sparse matrices.
  • NPP – Performance Primitives library. A collection of image and signal processing primitives.

Architecture specific options

There are currently 3 sets of nodes that incorporate GPUs and are available to SCC users.

The first set includes 18 nodes. Each of these nodes has an Intel Xeon X5675 CPU with 12 cores running at 3.07Ghz and 48 GB of memory. Each node also has 8 NVIDIA Tesla M2070 GPU cards with 6 GB of Memory each and 2.0 compute capability.

The second set includes 2 nodes. Each of these nodes has E5-2650v2 processors with 16 cores running at 2.6Ghz and 128 GB of memory. Each node also has 2 NVIDIA Tesla K40m GPU cards with 12 GB of Memory each and 3.5 compute capability.

The third set includes 4 nodes. Each of these nodes has E5-2680v4 processors with 28 cores running at 2.4Ghz and 256 GB of memory. Each node also has 2 NVIDIA Tesla P100 GPU cards with 12 GB of Memory each and 6.0 compute capability.

For more details on all the SCC nodes, please visit the Technical Summary page.

Architecture specific features can be enabled using the –arch sm_## flag during compilation. The “sm” stands for “streaming multiprocessor” and the number following sm_ indicates the features supported by the architecture. For example, for a CUDA program running on the SCC you can add the –arch sm_20 flag to allow for functionality available on GPUs that have Compute Capability 2.0 (Fermi architecture). See the CUDA Toolkit documentation for more information on this.

Additional CUDA training resources

NVIDIA provides resources for learning CUDA programming at

CUDA Consulting

RCS staff scientific programmers can help you with your CUDA code tuning. For assistance, please send email to