SCC H&J/Budge Nodes GPU Specs

SCC H&J/Budge nodes (scc-ha1..scc-he2, scc-ja1..scc-je2)

Below is the result from running the pgaccelinfo command on the SCC Budge nodes.
CUDA Device Number reports the GPU device number. For H&J nodes with 8 GPUs, their device numbers are: 0, 1, 2, 3, 4, 5, 6, and 7. Here is a fortran example on associating each of 3 OpenMP threads (i.e., CPU) to a specific GPU device:

!$omp parallel
   call omp_set_num_threads(3) ! compile code with -mp to turn on OpenMP
!$omp end parallel

!$omp PARALLEL private(i)
   i = omp_get_thread_num()  
   call acc_set_device_num(i, acc_device_nvidia)
!$omp end parallel
============================================================
CUDA Driver Version:           5000
NVRM version: NVIDIA UNIX x86_64 Kernel Module  310.32  Mon Jan 14 14:41:13 PST 2013

CUDA Device Number:            0
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1270 microseconds ( 721 ms pinned)
Download time:                  962 microseconds ( 661 ms pinned)
Upload bandwidth:              3302 MB/sec (5817 MB/sec pinned)
Download bandwidth:            4359 MB/sec (6345 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            1
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1261 microseconds ( 725 ms pinned)
Download time:                  969 microseconds ( 656 ms pinned)
Upload bandwidth:              3326 MB/sec (5785 MB/sec pinned)
Download bandwidth:            4328 MB/sec (6393 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            2
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1264 microseconds ( 722 ms pinned)
Download time:                  963 microseconds ( 660 ms pinned)
Upload bandwidth:              3318 MB/sec (5809 MB/sec pinned)
Download bandwidth:            4355 MB/sec (6355 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            3
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1322 microseconds ( 820 ms pinned)
Download time:                 1159 microseconds ( 947 ms pinned)
Upload bandwidth:              3172 MB/sec (5115 MB/sec pinned)
Download bandwidth:            3618 MB/sec (4429 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            4
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1338 microseconds ( 816 ms pinned)
Download time:                 1148 microseconds ( 948 ms pinned)
Upload bandwidth:              3134 MB/sec (5140 MB/sec pinned)
Download bandwidth:            3653 MB/sec (4424 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            5
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1339 microseconds ( 818 ms pinned)
Download time:                 1151 microseconds ( 947 ms pinned)
Upload bandwidth:              3132 MB/sec (5127 MB/sec pinned)
Download bandwidth:            3644 MB/sec (4429 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            6
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1311 microseconds ( 820 ms pinned)
Download time:                 1148 microseconds ( 947 ms pinned)
Upload bandwidth:              3199 MB/sec (5115 MB/sec pinned)
Download bandwidth:            3653 MB/sec (4429 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20

CUDA Device Number:            7
Device Name:                   Tesla M2070
Device Revision Number:        2.0
Global Memory Size:            5636554752
Number of Multiprocessors:     14
Number of Cores:               448
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           32768
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       65535 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1147 MHz
Execution Timeout:             No
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  exclusive
Concurrent Kernels:            Yes
ECC Enabled:                   Yes
Memory Clock Rate:             1566 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 786432 bytes
Max Threads Per SMP:           1536
Async Engines:                 2
Unified Addressing:            Yes
Initialization time:           1555055 microseconds
Current free memory:           5570027520
Upload time (4MB):             1427 microseconds ( 819 ms pinned)
Download time:                 1142 microseconds ( 947 ms pinned)
Upload bandwidth:              2939 MB/sec (5121 MB/sec pinned)
Download bandwidth:            3672 MB/sec (4429 MB/sec pinned)
PGI Compiler Option:           -ta=nvidia,cc20