{"id":64342,"date":"2013-03-27T13:58:03","date_gmt":"2013-03-27T17:58:03","guid":{"rendered":"http:\/\/www.bu.edu\/tech\/?page_id=64342"},"modified":"2023-07-21T13:51:06","modified_gmt":"2023-07-21T17:51:06","slug":"openacc-fortran","status":"publish","type":"page","link":"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/gpu-computing\/openacc-fortran\/","title":{"rendered":"Programming for GPUs using OpenACC in Fortran"},"content":{"rendered":"<h2>Introduction<\/h2>\n<p><a href=\"http:\/\/www.openacc-standard.org\/\">OpenACC<\/a> is a directives-based API for code parallelization with accelerators, for example, NVIDIA GPUs. In contrast, <a href=\"http:\/\/openmp.org\/wp\/\">OpenMP<\/a> is the API for shared-memory parallel processing with CPUs. OpenACC is designed to provide a simple yet powerful approach to accelerators without significant programming effort. Programmers simply insert OpenACC directives before specific code sections, typically with loops, to engage the GPUs. This approach enables the compiler to target and optimize parallelism. In many cases of GPU computing, the programming efforts in OpenACC is much less than that in Nvidia&#8217;s <a href=\"http:\/\/www.nvidia.com\/object\/cuda_home_new.html\">CUDA programming language<\/a>. For many large existing codes, rewriting them with CUDA is impractical if not impossible. For those cases, OpenACC offers a pragmatic alternative.<\/p>\n<h2>What you need to know or do on the SCC<\/h2>\n<ol>\n<li>To use OpenACC, compile your Fortran code with the Portland Group Inc. (PGI) compiler, <code><span class=\"command\">pgfortran<\/span><\/code> (or <code><span class=\"command\">pgf90<\/span><\/code>, <code><span class=\"command\">pgf95<\/span><\/code>). You will need to load a module in order to use the PGI compiler:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> <span class=\"command\">module load <span>nvidia-hpc\/2023-23.5<\/span><\/span><\/code><\/pre>\n<\/li>\n<li>After this, you can proceed with compilation. For example:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> <span class=\"command\">pgfortran -o<\/span> <span class=\"placeholder\">mycode<\/span> <span class=\"command\">-acc -Minfo<\/span> <span class=\"placeholder\">mycode.f90<\/span><\/code><\/pre>\n<p>In the above example, <code><span class=\"command\">-acc<\/span><\/code> turns on the OpenACC feature while <code><span class=\"command\">-Minfo<\/span><\/code> returns additional information on the compilation. For details, see the man page of <span class=\"command\"><span style=\"font-family: Courier New;\" face=\"Courier New\">pgfortran<\/span><\/span>:<\/p>\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> man <span class=\"command\">pgfortran<\/span><\/code><\/pre>\n<\/li>\n<li>To submit your code (with OpenACC directives) to an SCC node with GPUs:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> <span class=\"command\">qsub -l gpus=1<\/span> -b y <span class=\"placeholder\">mycode<\/span><\/code><\/pre>\n<p>In the above example, 1 GPU (and in the absence of a multiprocessor request, 1 CPU) is requested.<\/p>\n<p><a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/gpu-computing\/\">Additional examples of GPU batch jobs are available here<\/a>.<\/li>\n<\/ol>\n<h2>Demonstration of Performance<\/h2>\n<ol>\n<li>The following examples demonstrate a matrix multiply (<b><i>C = A * B<\/i><\/b>) using either multi-threaded OpenMP or OpenACC on a single GPU. <code><span class=\"command\"><\/span><\/code>\n<ul>\n<li>For OpenMP application:\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> <span class=\"command\">pgfortran<\/span> <span class=\"command\">-mp <a href=\"https:\/\/www.bu.edu\/tech\/support\/research\/software-and-programming\/programming\/multiprocessor\/gpu-computing\/openacc-fortran\/matrix-multiply-fortran\/\">matrix_multiply.f90<\/a> <\/span>-o mm_omp<\/code><\/pre>\n<\/li>\n<li>\n<p style=\"margin-bottom: 0px;\">For OpenACC application:<\/p>\n<pre class=\"code-block\"><code><span class=\"prompt\">scc1%<\/span> <span class=\"command\">pgfortran<\/span> <span class=\"command\">-acc matrix_multiply.f90 <\/span>-o mm_acc<\/code><\/pre>\n<p style=\"margin-bottom: 0px;\"><i> <\/i><\/p>\n<\/li>\n<\/ul>\n<\/li>\n<li>The following demonstrates timing comparisons for OpenACC, OpenMP, MPI:\n<div style=\"width: 660px\" class=\"wp-caption alignnone\"><img loading=\"lazy\" src=\"https:\/\/www.bu.edu\/tech\/files\/2013\/05\/gpu8ktime.jpg\" border=\"0\" width=\"650\" height=\"387\" alt=\"Bar Chart shows timings for Single GPU Matrix Multiplication using OpenACC, OpenMP, and MPI. \" \/><p class=\"wp-caption-text\">Based on the given data, we can conclude that for 3 sets of four, eight, and sixteen CPUs, the following results were obtained using OpenACC, OpenMP, and MPI: For four CPUs, OpenACC took 19 seconds, OpenMP took 117 seconds, and MPI took 156 seconds. For eight CPUs, OpenACC took 19 seconds, OpenMP took 72 seconds, and MPI took 66 seconds. For sixteen CPUs, OpenACC took 19 seconds, OpenMP took 25 seconds, and MPI took 27 seconds.<\/p><\/div>\n<p><br clear=\"LEFT\" \/>The above figure shows the timings comparison of a matrix multiply using a single GPU (via OpenACC) against two other parallel methods: OpenMP and MPI. The figure below shows the timings of matrix multiply using 1, 2, and 3 GPU devices.<br \/>\n<img loading=\"lazy\" src=\"https:\/\/www.bu.edu\/tech\/files\/2013\/05\/acctime.jpg\" border=\"0\" width=\"650\" height=\"399\" class=\"alignnone\" alt=\"Bar chart showing Matrix Multiple timings using 1, 2, and 3 GPU devices \" \/><\/li>\n<\/ol>\n<h2>OpenACC Tutorial<\/h2>\n<p>Please refer to <a href=\"\/tech\/files\/2017\/04\/OpenACC-2017Spring.pdf\">the RCS tutorial slides for OpenACC programming<\/a>.<\/p>\n<h2>Relevant Links<\/h2>\n<ul>\n<li><a href=\"https:\/\/www.openacc.org\/sites\/default\/files\/inline-images\/Specification\/OpenACC.3.0.pdf\">OpenACC 3.0 specification<\/a><\/li>\n<li><a href=\"http:\/\/www.pgroup.com\">PGI&#8217;s Compilers and Tools<\/a><\/li>\n<\/ul>\n<h2>OpenACC Consulting<\/h2>\n<p>RCS staff scientific programmers can help you with your OpenACC code tuning. For assistance, please send email to <a href=\"mailto:help@scc.bu.edu\">help@scc.bu.edu<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction OpenACC is a directives-based API for code parallelization with accelerators, for example, NVIDIA GPUs. In contrast, OpenMP is the API for shared-memory parallel processing with CPUs. OpenACC is designed to provide a simple yet powerful approach to accelerators without significant programming effort. Programmers simply insert OpenACC directives before specific code sections, typically with loops,&#8230;<\/p>\n","protected":false},"author":1692,"featured_media":0,"parent":62821,"menu_order":7,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/64342"}],"collection":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/users\/1692"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/comments?post=64342"}],"version-history":[{"count":24,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/64342\/revisions"}],"predecessor-version":[{"id":146757,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/64342\/revisions\/146757"}],"up":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/pages\/62821"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/tech\/wp-json\/wp\/v2\/media?parent=64342"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}