SCC Nodes: scc-ea1..ea4, scc-eb1..eb4, scc-ec1..ec4, scc-fa1..fa4, scc-fb1..fb4, scc-fc1..fc4

Below is the result from running the pgaccelinfo on the SCC E&F nodes.
CUDA Device Number reports the GPU device number. For E/F nodes with 3 GPUs, their device numbers are: 0, 1, 2. Here is a fortran example on associating each of 3 OpenMP threads (i.e., CPU) to a specific GPU device:

call omp_set_num_threads(3) ! compile code with -mp to turn on OpenMP
!$omp PARALLEL private(i)
   i = omp_get_thread_num()  
   call acc_set_device_num(i, acc_device_nvidia)
!$omp end parallel
============================================================
CUDA Driver Version:           4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module  295.71  Thu Aug  2 19:22:08 PDT 2012

CUDA Device Number:            0
Device Name:                   Tesla M2050
Device Revision Number:        2.0
Global Memory Size:            2817982464
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1546 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           87731 microseconds
Current free memory:           2748571648
Upload time (4MB):             1782 microseconds (1417 ms pinned)
Download time:                 1523 microseconds (1307 ms pinned)
Upload bandwidth:              2353 MB/sec (2959 MB/sec pinned)
Download bandwidth:            2753 MB/sec (3209 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            1
Device Name:                   Tesla M2050
Device Revision Number:        2.0
Global Memory Size:            2817982464
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1546 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           87731 microseconds
Current free memory:           2748817408
Upload time (4MB):             1770 microseconds (1425 ms pinned)
Download time:                 1532 microseconds (1312 ms pinned)
Upload bandwidth:              2369 MB/sec (2943 MB/sec pinned)
Download bandwidth:            2737 MB/sec (3196 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            2
Device Name:                   Tesla M2050
Device Revision Number:        2.0
Global Memory Size:            2817982464
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1546 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           87731 microseconds
Current free memory:           2748817408
Upload time (4MB):             1789 microseconds (1421 ms pinned)
Download time:                 1533 microseconds (1307 ms pinned)
Upload bandwidth:              2344 MB/sec (2951 MB/sec pinned)
Download bandwidth:            2736 MB/sec (3209 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc2