The image below shows timing comparisons for running a double-precision matrix multiplication program on SCC Budge nodes (Intel Xeon X5675 with Tesla 2070 GPU cards); these nodes are the scc-h* and scc-j* nodes on the SCC. Three versions of the program, all written in C, were tested – Serial (running on one core), MPI with 8 parallel processors, and a CUDA version (using 1 GPU kernel). With large matrices, the CUDA GPU version is significantly faster than both the single-processor serial program (729 times as fast on a 10K by 10K matrix) and its parallel version running using 8 threads (116 times as fast on the large matrix).
Obviously, not all codes will experience nearly this dramatic a speedup or be practical to port to GPUs.
Note: For this performance analysis all overhead, including copying memory from the host to the GPU and executing the kernel, is included in the times shown.