{"id":137980,"date":"2021-12-03T15:43:22","date_gmt":"2021-12-03T20:43:22","guid":{"rendered":"http:\/\/www.bu.edu\/tech\/?page_id=137980"},"modified":"2025-11-05T12:12:41","modified_gmt":"2025-11-05T17:12:41","slug":"parallel-batch","status":"publish","type":"page","link":"https:\/\/www.bu.edu\/tech\/support\/research\/system-usage\/running-jobs\/parallel-batch\/","title":{"rendered":"Running Parallel Batch Jobs"},"content":{"rendered":"<h2>Content<\/h2>\n<ul>\n<li><a href=\"#single\">Running <i>N<\/i> single-processor jobs on a compute node with <i>N<\/i> (or more) cores<\/a><\/li>\n<li><a href=\"#mthread\">Running shared-memory multithreaded batch jobs<\/a><\/li>\n<li><a href=\"#openmp\">Running an OpenMP program<\/a><\/li>\n<li><a href=\"#mpi\">Running an MPI program<\/a><\/li>\n<li><a href=\"#pe\">Parallel Environment resources and time limits<\/a><\/li>\n<li><a href=\"#gpu\">Running GPU jobs<\/a><\/li>\n<\/ul>\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"single\" id=\"single\"><\/a>Running <i>N<\/i> single-processor jobs on a compute node with <i>N<\/i> (or more) cores<\/h2>\n<p>Below is an example batch script which runs 4 programs:<\/p>\n<pre class=\"code-block\"><code>#!\/bin\/bash -l\r\nprog1 &lt; myinput1 &gt; myoutput1 &amp;\r\nprog2 &lt; myinput2 &gt; myoutput2 &amp;\r\nprog3 &lt; myinput3 &gt; myoutput3 &amp;\r\nprog4 &lt; myinput4 &gt; myoutput4 &amp;\r\n<strong>wait<\/strong><\/code><\/pre>\n<p>When you submit your job to the queue, you should request the matching number of processors:<\/p>\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1$<\/span> <span class=\"command\">qsub -pe omp 4<span class=\"placeholder\">  myscript\r\n<\/span><\/span><\/code><\/pre>\n<p>You can run up to N jobs, where N is the number of requested processors (please see accepted values of N for <strong>omp<\/strong> Parallel Environment (PE) in the table below in the section <a href=\"#pe\">Parallel environment resources and time limits<\/a>).<\/p>\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"mthread\" id=\"mthread\"><\/a>Running shared-memory multithreaded batch jobs<\/h2>\n<p>Multithreaded jobs are, in general, to be submitted to the shared-memory queue using the <strong>omp<\/strong> (or <strong>smp<\/strong> ) PE. Applications belonging to this category include any jobs using multiple processors on a single node, such as MATLAB, pthreads, Stata, and OpenMP.<\/p>\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1$<\/span> <span class=\"command\">qsub -pe omp 4 -b y a.out\r\n<\/span><\/code><\/pre>\n<p>The PE command line option (<i>i.e.,<\/i> <code>-pe omp 4<\/code> , or equivalently<i>,<\/i> <code>-pe smp 4<\/code>) lets you request resources with the batch scheduler; you are still responsible for making sure that the proper number of threads is specified for the underlying parallel paradigm.<\/p>\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"openmp\" id=\"openmp\"><\/a>Running an OpenMP program<\/h2>\n<p>Use the <strong>omp<\/strong> PE to run OpenMP applications. There are a couple of ways to define the number of processors required by an OpenMP application:<\/p>\n<ol>\n<li>The number of threads is set by the function <code>omp_set_num_threads<\/code> in the source code and then the executable is submitted with the <code><span class=\"command\">qsub<\/span><\/code> command requesting the matching number of threads:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1$<\/span> <span class=\"command\">qsub -pe omp 4 -b y a.out\r\n<\/span><\/code><\/pre>\n<\/li>\n<li>The most convenient way is to set the <code>OMP_NUM_THREADS<\/code> environment variable inside a job script. The number of requested cores for a job is stored in the environment variable <code>NSLOTS<\/code>, so within a job script these can be used together:\n<pre class=\"code-block\"><code>#!\/bin\/bash -l\r\n#$ -pe omp 8\r\nexport OMP_NUM_THREADS=$NSLOTS\r\nyour_prog ...args...<\/code><\/pre>\n<\/li>\n<li>The environment variable <code>OMP_NUM_THREADS<\/code> is set prior to the job submission and then passed to the <code><span class=\"command\">qsub<\/span><\/code> command using the <code>-V<\/code> option:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1$<\/span> export<span class=\"command\"> OMP_NUM_THREADS=4<\/span>\r\n<span class=\"prompt\">scc1$<\/span> <span class=\"command\">qsub -pe omp 4 -V -b y a.out\r\n<\/span><\/code><\/pre>\n<\/li>\n<li>The environment variable <code>OMP_NUM_THREADS<\/code> is passed through the <code><span class=\"command\">qsub<\/span><\/code> command:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1$<\/span> <span class=\"command\">qsub -pe omp 4 -v OMP_NUM_THREADS=4 -b y a.out\r\n<\/span><\/code><\/pre>\n<\/li>\n<\/ol>\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"mpi\" id=\"mpi\"><\/a>Running an MPI program<\/h2>\n<p>MPI jobs should be submitted with the PE option appropriately set to request the desired number of processors needed for the job. The following is an example of an abbreviated batch script for the MPI job submission:<\/p>\n<pre class=\"code-block\"><code>#!\/bin\/bash -l\r\n<span class=\"comment\">#<\/span>\r\n#$ -pe mpi_28_tasks_per_node 56\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># Invoke mpirun.<\/span>\r\n<span class=\"comment\"># SGE sets $NSLOTS as the total number of processors (32 for this example) <\/span>\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"command\">module load openmpi\/4.1.5<\/span>\r\n<span class=\"command\">mpirun -np $NSLOTS .\/mpi_program arg1 arg2 ...<\/span>\r\n<\/code><\/pre>\n<p>See the <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/#MPI\/\">programming<\/a> page for information on how to compile MPI programs.<\/p>\n<div class=\"bu_collapsible_container \" aria-live=\"polite\" data-customize-animation=\"false\"><h4 class=\"bu_collapsible\" aria-expanded=\"false\"tabindex=\"0\" role=\"button\">Full version of a sample MPI script<\/h4><div class=\"bu_collapsible_section\" style=\"display: none;\"><\/p>\n<pre class=\"code-block\"><code>#!\/bin\/bash -l\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># Sample SGE script for running mpi jobs on Boston University's SCC<\/span>\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># How to use this script: qsub mpi_batch_script<\/span>\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># Note: A line of the form \"#$ qsub_option\" is interpreted<\/span>\r\n<span class=\"comment\">#       by qsub as if \"qsub_option\" was passed to qsub on<\/span>\r\n<span class=\"comment\">#       the command line.<\/span>\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># Set the hard runtime (aka wallclock) limit for this job,<\/span>\r\n<span class=\"comment\"># default is 12 hours. Format: -l h_rt=HH:MM:SS<\/span>\r\n#$ -l h_rt=24:00:00\r\n<span class=\"comment\">#<\/span>\r\n<span class=\"comment\"># Invoke the mpi Parallel Environment for N processors.<\/span>\r\n<span class=\"comment\"># There is no default value for N, it must be specified.<\/span>\r\n<span class=\"comment\"># -pe parallel-environment N<\/span>\r\n#$ -pe mpi_28_tasks_per_node 56\r\n\r\n<span class=\"comment\"># Merge stderr into the stdout file, to reduce clutter.<\/span>\r\n#$ -j y\r\n\r\n<span class=\"comment\"># Have the system send you mail when your job is aborted or ends<\/span>\r\n#$ -m ae\r\n\r\n<span class=\"comment\">## end of qsub options<\/span>\r\n\r\n# openmpi is the standard MPI library\r\nmodule load openmpi\/4.1.5\r\n<span class=\"comment\"># By default, the script is executed in the directory from which<\/span>\r\n<span class=\"comment\"># it was submitted with qsub. You can change directory ...<\/span>\r\n<span class=\"comment\"># cd somewhere<\/span>\r\n\r\n<span class=\"comment\"># The NSLOTS variable is set by SGE to the number of processors requested<\/span>\r\n<span class=\"comment\"># with the \"-pe\" option. Use it with mpirun to avoid inconsistency<\/span>\r\n\r\n<span class=\"comment\"># Most common usage<\/span>\r\nmpirun -np $NSLOTS .\/mpi_program\r\n\r\n<span class=\"comment\"># Use the following if your executable requires input arguments<\/span>\r\n<span class=\"comment\">#mpirun -np $NSLOTS .\/mpi_program arg1 arg2 ...<\/span>\r\n\r\n<span class=\"comment\"># You can use fewer cores if needed, for example to run 8 \r\n# processes on 2 28-core nodes use the ppr \"process per resource\"\r\n# to run 4 tasks per node. Each compute node has dual CPU sockets\r\n# so run 2 tasks per socket: <\/span>\r\n<span class=\"comment\"># -pe mpi_28_tasks_per_node 56 \r\n# 2 nodes, 4 procs per node, 8 total.<\/span>\r\n<span class=\"comment\"># mpirun --map-by ppr:2:socket .\/mpi_program<\/span>\r\n<\/code><\/pre>\n<p><\/div>\n<\/div>\n\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"pe\" id=\"pe\"><\/a>Parallel Environment (PE) resources and time limits<\/h2>\n<table class=\"styled_table\" cellpadding=\"3\" border=\"1\">\n<caption>Table 2. The <i>-pe<\/i> parallel environment.<\/caption>\n<tbody>\n<tr>\n<th>parallel-environment<\/th>\n<th>Purpose<\/th>\n<th>Allocation Rule<\/th>\n<th>values of <em>N<em> <\/em><\/em><\/th>\n<th>Maximum runtime<\/th>\n<\/tr>\n<tr>\n<td>omp (or smp)<\/td>\n<td>Multiple processors on a single node<\/td>\n<td>All <em>N<\/em> requested processors on a single node<br \/>\n(node may be shared with other jobs)<\/td>\n<td>1, 2, 3, &#8230;, 28; 36<\/td>\n<td>720 hrs<\/td>\n<\/tr>\n<tr>\n<td>mpi_64_tasks_per_node<\/td>\n<td>MPI<\/td>\n<td>Whole 64-processor node(s)<\/td>\n<td><strong>128<\/strong>,&#8230;, 1024<\/td>\n<td>120 hrs<\/td>\n<\/tr>\n<tr>\n<td>mpi_28_tasks_per_node<\/td>\n<td>MPI<\/td>\n<td>Whole 28-processor node(s)<\/td>\n<td>28,56,84, &#8230;, 448<\/td>\n<td>120 hrs<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<ul>\n<li style=\"list-style-type: none;\">\n<ul>\n<li>The <b>omp<\/b> PE is primarily intended for any jobs using multiple processors on a single node. The value of N can be set to any number between 1 and 28 and can also be set to 36. Use N=36 is to request a very large-memory (1024 GB) node. To make best use of available resources on the SCC, the optimal choices are N=1, 4, 8, 16, 28, 32, or 36.<\/li>\n<li>The <b>mpi_64_tasks_per_node<\/b> PE can be used for N as a multiple of 64. This leads to allocations of whole 64-processor nodes. For jobs sensitive to memory availability, this PE will guaranteed the maximum memory promised for each assigned node. In addition, because intra-node communication is usually more efficient than inter-node communication, this PE might provide better overall performance. The maximum N is 1024 (that is 16 nodes). The maximum runtime is 120 hours for multiple nodes. Note there is a minimum of 2 nodes (N=128) for this PE.<\/li>\n<li>The <b>mpi_28_tasks_per_node<\/b> PE can be used for N as a multiple of 28. This leads to allocations of whole 28-processor nodes. For jobs sensitive to memory availability, this PE will guaranteed the maximum memory promised for each assigned node. In addition, because intra-node communication is usually more efficient than inter-node communication, this PE might provide better overall performance. The maximum N is 448 (that is 16 nodes). The maximum runtime is 120 hours for multiple nodes (N&gt;=56), while it is 720 hours for a single node (N=28).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<ul>\n<li>If your application can run on multiple nodes but doesn&#8217;t use MPI you will need a specialized PE. Send mail to <a href=\"mailto:help@scc.bu.edu\">help@scc.bu.edu<\/a> and we&#8217;ll create an appropriate PE for you.<\/li>\n<\/ul>\n<h2 style=\"margin-bottom: 1.em; margin-top: 2.5em;\"><a name=\"gpu\" id=\"gpu\"><\/a>Running GPU jobs<\/h2>\n<p>Access to GPU enabled nodes is via the batch system (qsub\/qsh\/qrsh\/qlogin). The GPU enabled nodes support all of the standard batch options in addition to the <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/gpu-computing\/#RUNNINGONGPUS\">GPU specific options<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Content Running N single-processor jobs on a compute node with N (or more) cores Running shared-memory multithreaded batch jobs Running an OpenMP program Running an MPI program Parallel Environment resources and time limits Running GPU jobs Running N single-processor jobs on a compute node with N (or more) cores Below is an example batch script&#8230;<\/p>\n","protected":false},"author":1692,"featured_media":0,"parent":137962,"menu_order":9,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/137980"}],"collection":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/users\/1692"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/comments?post=137980"}],"version-history":[{"count":16,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/137980\/revisions"}],"predecessor-version":[{"id":160257,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/137980\/revisions\/160257"}],"up":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/137962"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/media?parent=137980"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}