Content

Running N single-processor jobs on a compute node with N (or more) cores

Below is an example batch script which runs 4 programs:

#!/bin/bash -l
prog1 < myinput1 > myoutput1 &
prog2 < myinput2 > myoutput2 &
prog3 < myinput3 > myoutput3 &
prog4 < myinput4 > myoutput4 &
wait

When you submit your job to the queue, you should request the matching number of processors:

scc1$ qsub -pe omp 4  myscript

You can run up to N jobs, where N is the number of requested processors (please see accepted values of N for omp Parallel Environment (PE) in the table below in the section Parallel environment resources and time limits).

Running shared-memory multithreaded batch jobs

Multithreaded jobs are, in general, to be submitted to the shared-memory queue using the omp (or smp ) PE. Applications belonging to this category include any jobs using multiple processors on a single node, such as MATLAB, pthreads, Stata, and OpenMP.

scc1$ qsub -pe omp 4 -b y a.out

The PE command line option (i.e., -pe omp 4 , or equivalently, -pe smp 4) lets you request resources with the batch scheduler; you are still responsible for making sure that the proper number of threads is specified for the underlying parallel paradigm.

Running an OpenMP program

Use the omp PE to run OpenMP applications. There are three ways to define the number of processors required by an OpenMP application:

  1. The number of threads is set by the function omp_set_num_threads in the source code and then the executable is submitted with the qsub command requesting the matching number of threads:
    scc1$ qsub -pe omp 4 -b y a.out
    
  2. The environment variable OMP_NUM_THREADS is set prior to the job submission and then passed to the qsub command using the -V option:
    scc1$ export OMP_NUM_THREADS=4
    scc1$ qsub -pe omp 4 -V -b y a.out
    
  3. The environment variable OMP_NUM_THREADS is passed through the qsub command:
    scc1% qsub -pe omp 4 -v OMP_NUM_THREADS=4 -b y a.out
    

Running an MPI program

MPI jobs should be submitted with the PE option appropriately set to request the desired number of processors needed for the job. The following is an example of an abbreviated batch script for the MPI job submission:

#!/bin/bash -l
#
#$ -pe mpi_16_tasks_per_node 32
#
# Invoke mpirun.
# SGE sets $NSLOTS as the total number of processors (32 for this example) 
#
mpirun -np $NSLOTS ./mpi_program arg1 arg2 ...

See the programming page for information on how to compile MPI programs.

Parallel Environment (PE) resources and time limits

Table 2. The -pe parallel environment.
parallel-environment Purpose Allocation Rule values of N Maximum runtime
omp (or smp) Multiple processors on a single node All N requested processors on a single node
(node may be shared with other jobs)
1, 2, 3, …, 28; 36 720 hrs
mpi_64_tasks_per_node MPI Whole 64-processor node(s) 128,…, 1024 120 hrs
mpi_28_tasks_per_node MPI Whole 28-processor node(s) 28,56,84, …, 448 120 hrs (if N>=56), 720 hrs (if N=28)
mpi_16_tasks_per_node MPI Whole 16-processor node(s) 16,32,48, …, 256 120 hrs (if N>=32), 720 hrs (if N=16)
    • The omp PE is primarily intended for any jobs using multiple processors on a single node. The value of N can be set to any number between 1 and 28 and can also be set to 36. Use N=36 is to request a very large-memory (1024 GB) node. To make best use of available resources on the SCC, the optimal choices are N=1, 4, 8, 16, 28, or 36.
    • The mpi_64_tasks_per_node PE can be used for N as a multiple of 64. This leads to allocations of whole 64-processor nodes. For jobs sensitive to memory availability, this PE will guaranteed the maximum memory promised for each assigned node. In addition, because intra-node communication is usually more efficient than inter-node communication, this PE might provide better overall performance. The maximum N is 1024 (that is 16 nodes). The maximum runtime is 120 hours for multiple nodes. Note there is a minimum of 2 nodes (N=128) for this PE.
    • The mpi_28_tasks_per_node PE can be used for N as a multiple of 28. This leads to allocations of whole 28-processor nodes. For jobs sensitive to memory availability, this PE will guaranteed the maximum memory promised for each assigned node. In addition, because intra-node communication is usually more efficient than inter-node communication, this PE might provide better overall performance. The maximum N is 448 (that is 16 nodes). The maximum runtime is 120 hours for multiple nodes (N>=56), while it is 720 hours for a single node (N=28).
    • The mpi_16_tasks_per_node PE can be used for N as a multiple of 16. This leads to allocations of whole 16-processor nodes. For jobs sensitive to memory availability, this PE will guaranteed the maximum memory promised for each assigned node. In addition, because intra-node communication is usually more efficient than inter-node communication, this PE might provide better overall performance. The maximum N is 256 (that is 16 nodes). The maximum runtime is 120 hours for multiple nodes (N>=32), while it is 720 hours for a single node (N=16).
  • If your application can run on multiple nodes but doesn’t use MPI you will need a specialized PE. Send mail to help@scc.bu.edu and we’ll create an appropriate PE for you.

Running GPU jobs

Access to GPU enabled nodes is via the batch system (qsub/qsh/qrsh/qlogin). The GPU enabled nodes support all of the standard batch options in addition to the GPU specific options.