{"id":78109,"date":"2014-05-08T14:55:47","date_gmt":"2014-05-08T18:55:47","guid":{"rendered":"http:\/\/www.bu.edu\/tech\/?page_id=78109"},"modified":"2022-06-07T10:21:13","modified_gmt":"2022-06-07T14:21:13","slug":"tuning-and-best-practices","status":"publish","type":"page","link":"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/tuning-and-best-practices\/","title":{"rendered":"Tuning and Best Practices"},"content":{"rendered":"<p>Below you&#8217;ll find tips we frequently suggest to optimize &#8220;slow&#8221; code. These tips are organized into an Easy list and an Advanced list, though everything here is pretty approachable. These lists are not complete, so if you find that the advice does not help, please contact us.<\/p>\n<h4>Easy Options<\/h4>\n<ul>\n<li><a href=\"#compiler\">Use Compiler Optimizations<\/a><\/li>\n<li><a href=\"#write\">Write Files Efficiently<\/a><\/li>\n<li><a href=\"#arrays\">Access Arrays Efficiently<\/a><\/li>\n<li><a href=\"#libraries\">Use External Libraries<\/a><\/li>\n<\/ul>\n<h4>Advanced Options<\/h4>\n<ul>\n<li><a href=\"#profile\">Don&#8217;t Guess, Just profile<\/a><\/li>\n<li><a href=\"#parallel\">Parallelize Your Code<\/a><\/li>\n<\/ul>\n<h2>Easy Options<\/h2>\n<h4><a id=\"compiler\"><\/a>Use Compiler Optimizations<\/h4>\n<p><em>Compiler flags guide the compiler to better optimization.<\/em><\/p>\n<p>If you are working with a compiled language, e.g. C, C++ and Fortran, then there are several optimization flags that you can pass to the compiler as you build your program. When it runs, a compiler not only translates your application to a machine-readable format, but it also adjusts your instructions in an attempt to automatically make your application faster. These compiler flags provide guidance to the compiler to help it optimize this process.\u00a0Our <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/compilers\/\" title=\"Compilers\">compilers<\/a> page documents these options for the various compilers we support. Documented there is: how to control the optimization process for each compiler; how to set these flags during the build process, e.g. what environment variables you can set to affect make; and what the trade offs are for using certain flags.<\/p>\n<h4><a id=\"write\"><\/a>Write to Files Efficiently<\/h4>\n<p><em>Write locally, avoid writing 1000&#8217;s of files, and avoid repeatedly\u00a0opening\u00a0and closing files, and your\u00a0program will run faster*.<\/em><\/p>\n<p>*This advice is always good to follow, but actual performance gains really depend what you&#8217;re doing.<\/p>\n<p>First, if you&#8217;re going to write to a file, organize your program so that you open your file once and close it at the end. There is overhead in opening and closing a file, and it really adds up if you are doing this each time you want to save information.<\/p>\n<p>Next, there are two important things to know about our file system. First, most of the storage, including all types of Project Disk Space and home directories, are stored on a high performance file system and are accessed by each compute node through the network. While this involves a fast network, it is <strong>usually\u00a0<\/strong>faster to first write files to a local drive (it&#8217;s never slower, but sometimes it&#8217;s pretty much a tie). Each compute node has it&#8217;s own local storage area called <code>\/scratch<\/code>. Programs sometimes can benefit greatly from writing to\u00a0<code>\/scratch<\/code>\u00a0while running, and the user can then copy the files to project disk space following their run&#8217;s completion.\u00a0\u00a0The second point is about how many files to write. \u00a0It is more efficient to write one large file versus many (thousands of) small files. If you have questions about this, please email us.<\/p>\n<p>The environment variable <code>$TMPDIR<\/code> is set each time you run a job. It represents the name of a personal, customized scratch directory that will only last as long as your job does. This is the best option when your local files are temporary, because you <em>know<\/em> that this directory is empty and you can write to it. You can store your results locally, and copy things to your Project Disk Space at the very end of your script. <\/p>\n<h5>Local Scratch Space<\/h5>\n<p>The scratch disks are available as temporary storage space. They are open to all SCF users and there is no preset quota for use. <em>If you abuse the fair-share nature of\u00a0<code>\/scratch<\/code>, your files may be deleted.<\/em>\u00a0Files stored on any scratch disk are NOT BACKED UP and these files can only be kept in <code>\/scratch<\/code> for up to 31 days. Files not removed from <code>\/scratch<\/code> by the owner will be deleted by the system after 31 days.<\/p>\n<p>All of the SCC nodes have their own <code>\/scratch<\/code> disk. You can access a specific scratch disk via the pathname <code><span class=\"placeholder\">\/net\/scc-xx#\/<\/span>scratch<\/code>, where\u00a0<em>xx<\/em>\u00a0is a two-letter string such as &#8220;ab&#8221; and\u00a0<em>#<\/em>\u00a0is a number such as &#8220;5&#8221; or &#8220;13&#8221;. See\u00a0<a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/computing-resources\/tech-summary\/#SCC\">Technical Summary<\/a>\u00a0for the list of node names. For example,<\/p>\n<pre id=\"indent\" class=\"code-block\"><code>scc1% cd \/net\/<span class=\"placeholder\">scc-ab5<\/span>\/scratch<\/code><\/pre>\n<ol>\n<li>The SCC login and compute nodes each have significant amounts (427+ GB+) of scratch disk space; specific\u00a0<a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/computing-resources\/tech-summary\/#SCC\">size may vary by node<\/a>.<\/li>\n<li>On a compute node, a reference to\u00a0<code>\/scratch<\/code>\u00a0points to the local node&#8217;s\u00a0<code>\/scratch<\/code>\u00a0at runtime.<\/li>\n<li>Similarly, if you are on the login node, type &#8220;<code><span class=\"command\">cd<\/span> \/scratch<\/code>&#8221; to access its own scratch.<\/li>\n<li>You can access the login node&#8217;s scratch from any compute node with <code>\/net\/scc\/scratch<\/code>.<\/li>\n<li>If you&#8217;d like to use scratch space in a batch job, please use the\u00a0<a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/system-usage\/running-jobs\/resources-jobs\/local-scratch\/\">scratch space of the compute nodes<\/a>\u00a0assigned to the job. (See Item 2 above.)<\/li>\n<\/ol>\n<h5>When does it matter to write locally?<\/h5>\n<ol>\n<li>You are compressing your results (e.g. using <code><span class=\"command\">gzip<\/span><\/code>). It&#8217;s fastest to compress locally and to then move the final compressed file to the networked storage than it is to write the original results on the networked spaces before compressing them.<\/li>\n<li>You are writing temporary files, and once completed, these files can be deleted.<\/li>\n<\/ol>\n<h4><a id=\"arrays\"><\/a>Access Arrays Efficiently<\/h4>\n<p><em>When accessing arrays in a loop, keep in mind how the language actually stores multidimensional arrays.<\/em><\/p>\n<p>Multidimensional arrays are always stored internally as one dimensional arrays. For some languages, e.g. <strong>Fortran, R, Matlab<\/strong>, you want to loop through the first index in the inner most loop. This is called <strong>column-major<\/strong> or <strong>Fortran-order<\/strong> array access. For other languages, e.g. <strong>C, C++<\/strong>, you want to loop through the last index in the inner most loop, and correspondingly this is called <strong>row-major<\/strong> or <strong>C-order<\/strong> array access. Let&#8217;s look at an example of this.<\/p>\n<p>Consider a one dimensional array that you are looping through; this is like a conveyor belt of information getting ready to be processed. It is most efficient to process the values as they come:<\/p>\n<p style=\"text-align: center;\"><a href=\"\/tech\/files\/2014\/05\/array_access_conveyer_belt_good1.png\"><img loading=\"lazy\" src=\"\/tech\/files\/2014\/05\/array_access_conveyer_belt_good1.png\" alt=\"array_access_conveyer_belt_good\" width=\"301\" height=\"237\" class=\"wp-image-78732 aligncenter\" \/><\/a><\/p>\n<p>For\u00a0<strong>row-major<\/strong>\u00a0order, a 2 row by 3 column array<\/p>\n<pre>    [1 2 3 \r\n     4 5 6]<\/pre>\n<p>is really stored as a one dimensional array:<\/p>\n<pre>    [1 2 3 4 5 6]<\/pre>\n<p>and it is faster to loop through the last index in the inner most loop, like so:<\/p>\n<pre class=\"code-block\"><code>for (row = 0; row &lt; 2; row++) {\r\n    for (col = 0; col &lt; 3; col++) {\r\n      printf (\"%d \\n\", myarray[row][col]);\r\n   }\r\n}<\/code><\/pre>\n<p>This way, you will access the array as if it were a one dimensional array. If you looped through the rows in the inner most loop, you will jump around the underlying one dimensional array, in an inefficient manner:<\/p>\n<p style=\"text-align: center;\"><a href=\"\/tech\/files\/2014\/05\/array_access_conveyer_belt_bad.png\"><img loading=\"lazy\" src=\"\/tech\/files\/2014\/05\/array_access_conveyer_belt_bad.png\" alt=\"array_access_conveyer_belt_bad\" width=\"301\" height=\"237\" class=\"wp-image-78731 aligncenter\" \/><\/a><\/p>\n<p>This will not matter for very small arrays, but for the large arrays commonly used in research computing, accounting for the underlying storage leads to better performance.<\/p>\n<p>For\u00a0<strong>column-major<\/strong> order (used by <strong>Fortran, R <\/strong>and\u00a0<strong>Matlab<\/strong>), a 2 row by 3 column array<\/p>\n<pre>    [1 2 3 \r\n     4 5 6]<\/pre>\n<p>is really stored as a one dimensional array:<\/p>\n<pre>    [1 4 2 5 3 6]<\/pre>\n<p>and it is faster to loop through the first index in the inner most loop:<\/p>\n<pre class=\"code-block\"><code>do col=1, 3\r\n   do row=1, 2\r\n     print *, myarray[row, col]\r\n   end do\r\nend do<\/code><\/pre>\n<p>These rules hold for 3 dimensions or more.\u00a0In short, it is best to access the array in the same way it is stored internally, and this order is language dependent.<\/p>\n<h4><a id=\"libraries\"><\/a>Use External Libraries<\/h4>\n<p><em>Use an external library for access to already-optimized code.<\/em><\/p>\n<p>For each language, there are many 3rd party libraries that save you time by providing optimized and tested code. We recommend researchers start with these to develop their code, and only once they&#8217;ve found that an alternative, unimplemented approach is needed, then they should develop their own code. In areas of parallelization, efficient IO (e.g. file access), and numerical methods, it is much faster to use code that is already available. For example, <a href=\"http:\/\/www.fftw.org\">FFTW<\/a> is a library that performs Fast Fourier Transforms. It is used throughout research computing, so it has been well tested and there are abundant help resources. It also provides both serial (single processor) and parallel (cluster) versions, and it is relatively easy to update your code to use parallel resources. In short, it saves development and execution time to use this library.<\/p>\n<p>If you need help finding a library to help your research, feel free to contact us, and if you find can&#8217;t <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/installed\/\" title=\"Installed Software and Languages\">find a particular library on our clusters<\/a>, please let us know that <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/software-and-applications\/request-software\/\">we should install it<\/a>.<\/p>\n<h2>Advanced Options<\/h2>\n<h4><a id=\"profile\"><\/a>Don&#8217;t Guess, Just Profile<\/h4>\n<p><em>Profile your code to find what functions and subroutines need to be optimized.<\/em><\/p>\n<p>Profiling code means measuring how long it takes to run various parts of your program. Profiling your code, either manually or by using external tools, provides empirical evidence of what part of your code needs the most work. With this evidence, you can focus your efforts in areas that actually matter. Often the trade off for faster code is more complex, i.e. harder-to-read, code. The cost of making your code hard to read should be expended only when it matters. You can email us for help with profiling your code, or you can <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/profiling\/\" title=\"Profiling\">explore your options<\/a> at your own leisure.<\/p>\n<h4><a id=\"parallel\"><\/a>Parallelize your code<\/h4>\n<p><em>Run your code simultaneously on multiple CPU cores, machines or GPU cores to get work done faster.<\/em><\/p>\n<p>If your application involves independent tasks, then it may be advantageous to complete them simultaneously using parallelization. If you&#8217;re not sure this is the case, please contact us to review your code and talk about your options. There are three main technologies to parallelize your code, and we <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/\" title=\"Multiprocessor Programming\">provide details<\/a> about programming and running these applications.<\/p>\n<p>OpenMP is a library that automatically generates the code to run your application on many threads. On its own, it can only be used on one machine at a time. The advantage of this library is that it is called in special comments and evoked using special compiler flags. So you can essentially leave your code unaffected as long as you don&#8217;t use the special compiler options. With this option, you can theoretically gain 16x efficiency on our systems.<\/p>\n<p>The next technology is MPI (Message Passing Interface). It automatically runs multiple copies of your application on multiple machines, assigns each execution a unique ID, and manages the network communications to send information between these executions. This library allows you to use many more CPU cores, because it can communicate across machines. It does require changes to your code, which then must access <strong>mpirun<\/strong> (a tool to start your new parallel application) and the MPI library.This means you will need a different copy of your code\/application. With this option, you can theoretically gain &gt;16x efficiency on our systems.<\/p>\n<p>Finally, GPUs provide many cores that speed up calculations, especially those involving array manipulations. To use this option, you must use CUDA, an NVIDIA GPU specific language that comes in C, C++ and Fortran like versions. Fortunately there are many 3rd party libraries that use this technology, allowing you to use GPUs without understanding how to program CUDA. Just as OpenMP is a tool to automatically parallelize applications that use CPUs, a technology called OpenACC is an option for automatically parallelizing your code for GPUs. This might be the easiest option to get started with GPU programming, and we have notes for using OpenACC with <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/gpu-computing\/openacc-c\/\">C\/C++<\/a> and <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/gpu-computing\/openacc-fortran\/\">Fortran<\/a> applications, respectively. <\/p>\n","protected":false},"excerpt":{"rendered":"<p>Below you&#8217;ll find tips we frequently suggest to optimize &#8220;slow&#8221; code. These tips are organized into an Easy list and an Advanced list, though everything here is pretty approachable. These lists are not complete, so if you find that the advice does not help, please contact us. Easy Options Use Compiler Optimizations Write Files Efficiently&#8230;<\/p>\n","protected":false},"author":8020,"featured_media":0,"parent":64953,"menu_order":3,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/78109"}],"collection":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/users\/8020"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/comments?post=78109"}],"version-history":[{"count":50,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/78109\/revisions"}],"predecessor-version":[{"id":140699,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/78109\/revisions\/140699"}],"up":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/64953"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/media?parent=78109"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}