Getting started with MPI
Note: the publicly accessible MPI cluster on the ENG-Grid was retired in 2018.
Running the job script
Now, we can run any MPI program we choose on any queue that has the “mpi” parallel environment configured, and Grid Engine will dynamically allocate hosts to the PE. There are many ways to invoke this and many options that you can pass, but here’s a simple example:
bungee:/mnt/nokrb/username/MPI$ qsub -q bungee.q mpi-example.sh
Where mpi-example.sh is this:
#$ -cwd #$ -pe mpi 4 hostname date mpirun -np $NSLOTS hostname
If you haven’t gone through the first-time setup instructions for the grid, review those first, particularly the Full Instructions section.
Parallel Environment Choices
Note that we specified the parallel environment (PE) “mpi”, with 4 slots. There are two useful PE’s: mpi and openmpi. The mpi parallel environment uses the allocation rule ‘fill_up’. This rule effectively allocates all available slots on a single host, for each host, until the number of allocated slots has been reached. If you have a job that is requires a significant amount of inter-node communication it may be advantageous to use this environment. The openmpi environment uses the ’round_robin’ allocation rule and is ideal for lightly coupled mpi jobs (where inter-node communication is at a minimum).
Ethernet vs. InfiniBand
On the ENG-Grid, some of the queues have InfiniBand (currently only bungee.q) but others (such as budge.q) don’t, and will default to the vmnet (virtual memory networking) interface instead of falling back to ethernet. To explicitly tell openmpi to NOT use the vmnet interfaces add the “–mca btl_tcp_if_include eth0” switch to your mpirun syntax within your qsub script, as below:
mpirun --mca btl_tcp_if_include eth0 ...
GPU warning
If you are using GPUs in your MPI job, note that the “-l gpu=#” complex is allocated slotwise, not jobwise! So if you specify “-pe mpi 8 -l gpu=1” in your job, the system will allocate one GPU per CPU slot — so a total of 8 CPUs and 8 GPUs for the job. This makes things tricky if you wish to allocate more GPUs than CPU slots in an MPI job. A newer version of Grid Engine, to be installed soon, will allow jobwise GPU allocation.
Monitoring the job
Use “qstat” (or qmon) to see the job waiting, then running, and once it’s finished, you should have several files in your output directory, including a .o “output” file that looks something like this:
bungee:/mnt/nokrb/username/MPI$ more mpi-example.sh.o1886509 Warning: no access to tty (Bad file descriptor). Thus no job control in this shell. bungee03 Wed Apr 27 04:35:25 EDT 2011 bungee03 bungee01 bungee04 bungee02
Don’t worry about the no job control; we don’t care. Notice that the part of our code that printed the hostname and then the date is running on the node that Grid Engine designated as the MPI master, and then the other four hostnames were printed by mpirun’s invocation of “hostname” on the 4 slaves.
See Also
- Using R with MPI on the Grid
- Using Lumerical with MPI on the Grid
- Using Python with MPI on the Grid
- Using the threaded parallel environment instead of MPI
References
- RHEL6 Technical Notes: mpi-selector deprecated
- Read the FAQ: Running MPI jobs page for details about how to run MPI jobs in general.