Tuning and Best Practices : TechWeb : Boston University

Below you’ll find tips we frequently suggest to optimize “slow” code. These tips are organized into an Easy list and an Advanced list, though everything here is pretty approachable. These lists are not complete, so if you find that the advice does not help, please contact us.

Advanced Options

Don’t Guess, Just profile
Parallelize Your Code

Easy Options

Use Compiler Optimizations

Compiler flags guide the compiler to better optimization.

If you are working with a compiled language, e.g. C, C++ and Fortran, then there are several optimization flags that you can pass to the compiler as you build your program. When it runs, a compiler not only translates your application to a machine-readable format, but it also adjusts your instructions in an attempt to automatically make your application faster. These compiler flags provide guidance to the compiler to help it optimize this process. Our compilers page documents these options for the various compilers we support. Documented there is: how to control the optimization process for each compiler; how to set these flags during the build process, e.g. what environment variables you can set to affect make; and what the trade offs are for using certain flags.

Write to Files Efficiently

Write locally, avoid writing 1000’s of files, and avoid repeatedly opening and closing files, and your program will run faster*.

*This advice is always good to follow, but actual performance gains really depend what you’re doing.

First, if you’re going to write to a file, organize your program so that you open your file once and close it at the end. There is overhead in opening and closing a file, and it really adds up if you are doing this each time you want to save information.

Next, there are two important things to know about our file system. First, most of the storage, including all types of Project Disk Space and home directories, are stored on a high performance file system and are accessed by each compute node through the network. While this involves a fast network, it is usually faster to first write files to a local drive (it’s never slower, but sometimes it’s pretty much a tie). Each compute node has it’s own local storage area called /scratch. Programs sometimes can benefit greatly from writing to /scratch while running, and the user can then copy the files to project disk space following their run’s completion. The second point is about how many files to write. It is more efficient to write one large file versus many (thousands of) small files. If you have questions about this, please email us.

The environment variable $TMPDIR is set each time you run a job. It represents the name of a personal, customized scratch directory that will only last as long as your job does. This is the best option when your local files are temporary, because you know that this directory is empty and you can write to it. You can store your results locally, and copy things to your Project Disk Space at the very end of your script.

Local Scratch Space

The scratch disks are available as temporary storage space. They are open to all SCF users and there is no preset quota for use. If you abuse the fair-share nature of /scratch, your files may be deleted. Files stored on any scratch disk are NOT BACKED UP and these files can only be kept in /scratch for up to 31 days. Files not removed from /scratch by the owner will be deleted by the system after 31 days.

All of the SCC nodes have their own /scratch disk. You can access a specific scratch disk via the pathname /net/scc-xx#/scratch, where xx is a two-letter string such as “ab” and # is a number such as “5” or “13”. See Technical Summary for the list of node names. For example,

scc1% cd /net/scc-ab5/scratch

The SCC login and compute nodes each have significant amounts (427+ GB+) of scratch disk space; specific size may vary by node.
On a compute node, a reference to /scratch points to the local node’s /scratch at runtime.
Similarly, if you are on the login node, type “cd /scratch” to access its own scratch.
You can access the login node’s scratch from any compute node with /net/scc/scratch.
If you’d like to use scratch space in a batch job, please use the scratch space of the compute nodes assigned to the job. (See Item 2 above.)

When does it matter to write locally?

You are compressing your results (e.g. using gzip). It’s fastest to compress locally and to then move the final compressed file to the networked storage than it is to write the original results on the networked spaces before compressing them.
You are writing temporary files, and once completed, these files can be deleted.

Access Arrays Efficiently

When accessing arrays in a loop, keep in mind how the language actually stores multidimensional arrays.

Multidimensional arrays are always stored internally as one dimensional arrays. For some languages, e.g. Fortran, R, Matlab, you want to loop through the first index in the inner most loop. This is called column-major or Fortran-order array access. For other languages, e.g. C, C++, you want to loop through the last index in the inner most loop, and correspondingly this is called row-major or C-order array access. Let’s look at an example of this.

Consider a one dimensional array that you are looping through; this is like a conveyor belt of information getting ready to be processed. It is most efficient to process the values as they come:

For row-major order, a 2 row by 3 column array

    [1 2 3 
     4 5 6]

is really stored as a one dimensional array:

    [1 2 3 4 5 6]

and it is faster to loop through the last index in the inner most loop, like so:

for (row = 0; row < 2; row++) {
    for (col = 0; col < 3; col++) {
      printf ("%d \n", myarray[row][col]);
   }
}

This way, you will access the array as if it were a one dimensional array. If you looped through the rows in the inner most loop, you will jump around the underlying one dimensional array, in an inefficient manner:

This will not matter for very small arrays, but for the large arrays commonly used in research computing, accounting for the underlying storage leads to better performance.

For column-major order (used by Fortran, R and Matlab), a 2 row by 3 column array

    [1 2 3 
     4 5 6]

is really stored as a one dimensional array:

    [1 4 2 5 3 6]

and it is faster to loop through the first index in the inner most loop:

do col=1, 3
   do row=1, 2
     print *, myarray[row, col]
   end do
end do

These rules hold for 3 dimensions or more. In short, it is best to access the array in the same way it is stored internally, and this order is language dependent.

Use External Libraries

Use an external library for access to already-optimized code.

For each language, there are many 3rd party libraries that save you time by providing optimized and tested code. We recommend researchers start with these to develop their code, and only once they’ve found that an alternative, unimplemented approach is needed, then they should develop their own code. In areas of parallelization, efficient IO (e.g. file access), and numerical methods, it is much faster to use code that is already available. For example, FFTW is a library that performs Fast Fourier Transforms. It is used throughout research computing, so it has been well tested and there are abundant help resources. It also provides both serial (single processor) and parallel (cluster) versions, and it is relatively easy to update your code to use parallel resources. In short, it saves development and execution time to use this library.

If you need help finding a library to help your research, feel free to contact us, and if you find can’t find a particular library on our clusters, please let us know that we should install it.

Advanced Options

Don’t Guess, Just Profile

Profile your code to find what functions and subroutines need to be optimized.

Profiling code means measuring how long it takes to run various parts of your program. Profiling your code, either manually or by using external tools, provides empirical evidence of what part of your code needs the most work. With this evidence, you can focus your efforts in areas that actually matter. Often the trade off for faster code is more complex, i.e. harder-to-read, code. The cost of making your code hard to read should be expended only when it matters. You can email us for help with profiling your code, or you can explore your options at your own leisure.

Parallelize your code

Run your code simultaneously on multiple CPU cores, machines or GPU cores to get work done faster.

If your application involves independent tasks, then it may be advantageous to complete them simultaneously using parallelization. If you’re not sure this is the case, please contact us to review your code and talk about your options. There are three main technologies to parallelize your code, and we provide details about programming and running these applications.

OpenMP is a library that automatically generates the code to run your application on many threads. On its own, it can only be used on one machine at a time. The advantage of this library is that it is called in special comments and evoked using special compiler flags. So you can essentially leave your code unaffected as long as you don’t use the special compiler options. With this option, you can theoretically gain 16x efficiency on our systems.

The next technology is MPI (Message Passing Interface). It automatically runs multiple copies of your application on multiple machines, assigns each execution a unique ID, and manages the network communications to send information between these executions. This library allows you to use many more CPU cores, because it can communicate across machines. It does require changes to your code, which then must access mpirun (a tool to start your new parallel application) and the MPI library.This means you will need a different copy of your code/application. With this option, you can theoretically gain >16x efficiency on our systems.

Finally, GPUs provide many cores that speed up calculations, especially those involving array manipulations. To use this option, you must use CUDA, an NVIDIA GPU specific language that comes in C, C++ and Fortran like versions. Fortunately there are many 3rd party libraries that use this technology, allowing you to use GPUs without understanding how to program CUDA. Just as OpenMP is a tool to automatically parallelize applications that use CPUs, a technology called OpenACC is an option for automatically parallelizing your code for GPUs. This might be the easiest option to get started with GPU programming, and we have notes for using OpenACC with C/C++ and Fortran applications, respectively.