Resources Available for your Jobs : TechWeb : Boston University

Content

Runtime
CPU architecture and processor type
Graphics Processing Units (GPUs)
Network
Memory
Scratch Space

Runtime

The amount of time your job takes to run on the compute node(s) is represented by a runtime resource. The default amount of runtime for interactive batch and batch jobs is 12 hours. One can request a longer or shorter runtime by supplying the -l h_rt=hh:mm:ss option to their batch jobs. Identifying the amount of time your code takes to run and tailoring your job workflow and runtime resource requests accordingly plays a key role in ensuring that your jobs have access to the greatest amount of SCC resources. For example, almost all of the Buy-In nodes in the SCC will only run shared batch jobs that request 12 hours or less of runtime. By identifying jobs that fit within this runtime window and asking for the appropriate runtime resource you may greatly increase the resources available for your batch job to run on.

CPU Architecture and Processor type

The SCC contains varying types of Intel and AMD CPU architectures and models. The Technical Summary contains more details on the hardware that comprises each node in the cluster. Throughout our documentation we use the word processor to mean what computer hardware vendors call a processor core. The SGE batch system manpages also use the words job slot or simply slot to refer to the same concept. The number of processors that your job requires depends on how well your code can utilize multiple processors for its computations. Another reason for requesting multiple processors is to aid in the memory usage of your job that is being shared with other jobs on the same node. One can also reserve a specific CPU processor architecture and type for their job although in general this is not necessary. Details on how to reserve the number or processors using the -pe option and type/model of processor can be found on the Submitting your Batch Job page.

Graphics Processing Units (GPUs)

There are several types of NVIDIA GPU resources available on the SCC shared nodes. To view a short summary of GPU resources on the SCC, execute:

scc% qgpus

gpu_type  total  in_use  available
--------  -----  ------  ---------
A100          5      5      0
A40           6      6      0
A6000        30     26      4
. . .

If you would like to get a more detailed output, add -v option to the command: qgpus -v.
For more details about GPU computing on the SCC refer to our page on GPU computing.

Network

The SCC consists of three types of networking architecture. The majority of nodes utilize a 1 Gigabit network interface for communication with our shared filesystems and other machines. A smaller set of nodes utilize a 10 Gigabit interface for jobs that require faster network access. For batch jobs that can utilize multiple nodes using MPI (Message Passing Interface) we offer an additional faster Infiniband based network architecture to aid in inter-node communication. Our current Infiniband configurations supports 40Gbps (QDR) and 56Gbps (FDR) speeds.

Memory

Memory options to the batch system are only enforced when the job is dispatched to the node. Once the job has been dispatched, the batch system cannot enforce any limits to the amount of memory the job uses on the node. Therefore each user is expected to follow “fair share” guidelines when submitting jobs to the cluster.

The memory on each node on the SCC is shared by all the jobs running on that node. Therefore a single-processor job should not use more than the amount of memory available per core (TotalMemory / NumCores where TotalMemory is the total memory on the node and NumCores is the number of cores). For example on the nodes with 128GB of memory and 16 cores, if the node is fully utilized, a single-processor job is expected to use no more than 8GB of memory. See the Technical Summary for the list of nodes and the memory available on each of them.

If your job requires more memory, you can request an appropriate node with 64, 128, 192, 256, 512, or 1024 Gigabytes. You can also request more slots to reserve a larger amount of memory. For example, if the job needs 64GB of memory, it can ask for a node with 8 slots and 128GB of shared memory using the options -pe omp 8 -l mem_per_core=8G. If a job needs 200GB it can reserve a whole node with at least 256GB of memory by doing: -pe omp 16 -l mem_per_core=16G or -pe omp 28 -l mem_per_core=9G . We have an example batch script for a Large Memory Job.

Scratch Space

There is a local scratch directory on each of the SCC nodes. This can be used as an additional (temporary) storage area. The files stored in the scratch directories are not backed up and will be deleted by the system after 31 day. To access a node’s local scratch space, simply refer to /scratch. You can access a specific scratch space from any other node on the system by specifying its full network path: /net/scc-xx#/scratch,where xx is a two-letter string such as ab and # is generally a single-digit number (see the Technical Summary for the full list of node names):

scc% cd /net/scc-fc3/scratch

While the job is running, reading and writing to Project Disk Space is slower than to local storage, so you might want to consider saving intermediate results on a storage local to the compute node. When the job is dispatched, a local temporary directory is created and its path is stored in the environment variable $TMPDIR. However, any data written to the $TMPDIR directory will be automatically deleted after the job is finished. The application can also create its own sub-directory in the scratch space and use this for temporary storage. This directory can be accessed for up to 31 day, though users are encouraged to remove any files created in the scratch space when they are no longer needed, so as to free up resources for other users.