The mpirun command is used to execute a job on the Blue Gene. Because we do not allow interactive use of the mpirun command on the login nodes (except by special arrangement), one must submit it as a job to the LoadLeveler batch system. There are two ways to submit batch jobs. The standard method is for the user to set up a job command file (
jcf), in which the mpirun command along with the executable name and other parameters are specified. This
jcf is then submitted to the batch system. Here is a sample jcf. The alternative method is to run a script that accepts input parameters such as the executable file name from the user on the command line. A
jcf is then generated based on the command line input parameters and submitted to the batch queue.
To run a Blue Gene executable via the mpirun command, you should compile an executable and place it on a file system that can be accessed by the Blue Gene (/project, /project2, /projectnb1, /projectnb2, /projectnb3, or any of /usr1-/usr4). You should then modify the sample
jcf, particularly by changing the “@ arguments” line to reference your executable and set appropriate command line arguments for mpirun.
The main command line arguments to mpirun are:
|-np N||Mandatory||N = number of MPI tasks (See section below)|
|-cwd start_dir||Mandatory||start_dir = full pathname of directory where
|-exe path_to_executable||Mandatory||Full pathname of your executable.|
|-verbose 0-4||Optional||controls diagnostic output, default is 0, 1 is recommended|
|-args “list_of_args”||Optional||“list of args” = list of args to executable (enclosed in quotes)|
|-mode CO|VN||Optional||CO|VN = COprocessor(default) or Virtual Node mode (See section below)|
|-connect MESH|TORUS||Optional||MESH|TORUS = defaults to MESH, N must be
multiple of 512 to use TORUS
MPI parallel tasks and tasks per node
In the above table, N is defined as the number of MPI parallel tasks instead of the traditional definition of number of processors to more appropriately represent the Blue Gene nodes’ hardware configuration. A Blue Gene node consists of 2 processors. In the default COprocessor mode (CO), one processor is used for computation and the other is dedicated to communication. This results in 1 MPI task per node and the node’s entire 512 MB of memory is available to the task. In the Virtual Node mode (VN), both processors are used for computations. In this case, there are 2 tasks per node and both tasks share the node’s 512 MB of memory. Our Blue Gene has a total of 1024 nodes. In the CO mode (1 task per node), the maximum number of tasks you can request is 1024. In the VN mode (2 tasks per node), the maximum number of tasks is 2048.
Important note on the number of MPI tasks, N. Although you can choose N to be any value up to 1024 (CO) or 2048 (VN), the system will only allocate 32, 128, 512, or 1024 physical nodes to a job. The system allocates the smallest number of allowed physical nodes necessary to run one task per node (or two per node in VN mode.)
The following examples demonstrate some typical arguments to mpirun that are specified in the
jcf as well as how the CO|VN mode affects the number of nodes allocated.
Example 1. How to run executable a.out.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out
In this case, the system allocates 1024 nodes for the job because the job runs under the default CO mode and 1024 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to accommodate the requested 1000 tasks.
Example 2. How to run a.out in co-processor mode.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out -mode VN
In this case, the system allocates 512 nodes for the job because the job runs under the VN mode and 512 is the smallest allowable number of nodes (among 32, 128, 512, 1024) necessary to accomodate the requested 1000 tasks.
Example 3. How to run a.out that require a commandline input file.
mpirun -np 1000 -cwd /project/xyz/abc -exe /project/xyz/exe/a.out -args "my_input_file"
In this case, the mpirun’s -args switch enables an input file to be provided to a.out. Many other mpirun options may be specified through the -args switch, such as wall clock limit. Be sure to enclose all entries in double quotes.
For additional mpirun options, enter mpirun -h at the system prompt.
The scheduler implements the following usage limits:
|Limit||During Business Hours*||During Off Hours|
|Maximum Runtime per Job||5 hours||5 hours|
|Maximum Nodes used per User||512||1024|
(*Business Hours: 9am – 5pm Eastern Time, Monday – Friday.)
After enforcing the above limits, the scheduler prioritizes the runnable jobs and runs the highest priority one if the necessary resources are available. If the necessary resources are not yet available the scheduler uses a backfilling strategy to run lower priority, short duration jobs on any available resources as long as doing so will not delay the starting of the highest priority job.
The primary ordering criterion used to prioritize jobs is the amount of recent runtime accumulated by the user. This quantity is displayed by the qstat command under the SYSPRI column as a negative value. Jobs with the same SYSPRI are ordered by submission time. Finally, a user can alter the relative ordering of their own jobs with the llprio command. This command modifies the “user priority” which is displayed in the PRI column of the llq and qstat commands. The qstat command lists the waiting jobs in the above described scheduling order.
This method essentially involves a wrapper script bglsub which accepts command line input such as the number of tasks and executable name from the user to generate a
jcf as required in the standard method. The last operation of bglsub is to submit this newly generated
jcf to the batch queue. (More …)
Sometimes it is convenient (e.g. during program development or debugging) to execute the mpirun directly on the login node rather than through the batch system. This is normally not permitted but it can be arranged by sending a request to firstname.lastname@example.org. We will allocate a partition of the machine for your exclusive use and you will be able to use it by invoking mpirun in the following way:
levi% mpirun -noallocate -partition YOUR_PARTITION ...
YOUR_PARTITION is the name of the partition assigned to you and “…” represents all the other flags you would normally pass to mpirun.
Batch Job Management Commands
- To submit a batch job
Lee:~ % llsubmit . . .
- To query the status of batch jobs
Lee:~ % llq . . .
- LoadLeveler’s own command to query the machine status
Lee:~ % llstatus . . .
- To delete a batch job from the system
Lee:~ % llcancel . . .
- To hold or release a submitted job
Lee:~ % llhold . . .
- To change the job priority of a submitted job
Lee:~ % llprio . . .
- To charge a batch job to a project
A batch job is normally charged to the user’s default project. If the user works on a single project or if the charge should be levied against the default project, no user action is required. On the other hand, users working on multiple projects may, at times, need to charge a batch job to a non-default project. Note that the charging procedure varies among all SCV machines ( See FAQ, Project Accounting). Please consult the respective machine’s runningjobs webpage for the correct procedure. Described below is the charging procedure for the Blue Gene.
- Charging to the dafault project
No actions required.
- Charging to a non-default project
Add the following line to your batch script
. . . . . # @ group = project_name . . . . .
- Charging to the dafault project
- To find out the projects of which you are a member
Lee:~ % groups my_default_project my_second_project my_third_project . . .
The first on the list is always the default project which can be changed.