{"id":266,"date":"2011-11-29T14:57:54","date_gmt":"2011-11-29T19:57:54","guid":{"rendered":"https:\/\/www.bu.edu\/exafmm\/?page_id=266"},"modified":"2012-06-18T18:41:59","modified_gmt":"2012-06-18T22:41:59","slug":"performance","status":"publish","type":"page","link":"https:\/\/www.bu.edu\/exafmm\/documentation\/performance\/","title":{"rendered":"Performance"},"content":{"rendered":"<h2>MPI-parallel code<\/h2>\n<h3>Strong and weak scaling on CPUs<\/h3>\n<p>On multi-core systems (<a href=\"http:\/\/nics.tennessee.edu\/debugging-and-optimization\" target=\"_blank\">Kraken supercomputer<\/a>), strong scaling with 10<sup>8<\/sup> particles on 2048 processes achieved:<\/p>\n<ul>\n<li>93% parallel efficiency for the non-SIMD code, and<\/li>\n<li>54% for the SIMD-optimized version (which is 2x faster).<\/li>\n<\/ul>\n<p>The plot in <strong>Figure 3<\/strong> shows\u00a0MPI strong scaling from 1 to 2,048 processes, and timing breakdown of the different kernels, tree construction and communications. Test problem: N=10<sup>8<\/sup> points placed at random in a cube; FMM with order p=3. Calculation time is multiplied by the number of processes, so that equal bar heights would indicate perfect scaling.<\/p>\n<figure id=\"attachment_270\" aria-describedby=\"caption-attachment-270\" style=\"width: 572px\" class=\"wp-caption alignnone\"><a href=\"\/exafmm\/files\/2011\/11\/strong_sm.png\"><img loading=\"lazy\" class=\"size-full wp-image-270 \" title=\"strong_sm\" src=\"\/exafmm\/files\/2011\/11\/strong_sm.png\" alt=\"MPI strong scaling from 1 to 2,048 processes, and timing breakdown of the different kernels, tree construction and communications. Test problem: N=108 points placed at random in a cube; FMM with order p=3. Calculation time is multiplied by the number of processes. Parallel efficiency is 93% on 2,048 processes.\" width=\"562\" height=\"300\" srcset=\"https:\/\/www.bu.edu\/exafmm\/files\/2011\/11\/strong_sm.png 703w, https:\/\/www.bu.edu\/exafmm\/files\/2011\/11\/strong_sm-636x339.png 636w\" sizes=\"(max-width: 562px) 100vw, 562px\" \/><\/a><figcaption id=\"caption-attachment-270\" class=\"wp-caption-text\">Figure 3 \u2014MPI strong scaling from 1 to 2,048 processes, and timing breakdown of the different kernels, tree construction and communications.  Parallel efficiency is 93% on 2,048 processes. \u00a9 2011 R Yokota, L Barba.<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>Weak scaling with 10<sup>6<\/sup> particles per node achieved 72% efficiency on 32,768 processes of the Kraken supercomputer. The plot in Figure 4 shows\u00a0MPI weak scaling with (SIMD optimizations) from 1 to 32,768 processes, and timing breakdown of the different kernels, tree construction and communications. Test problem: N=106 points per process placed at random in a cube; FMM with order p=3.<\/p>\n<figure id=\"attachment_274\" aria-describedby=\"caption-attachment-274\" style=\"width: 475px\" class=\"wp-caption alignnone\"><a href=\"\/exafmm\/files\/2011\/11\/weak_sm.png\"><img loading=\"lazy\" class=\"size-full wp-image-274 \" title=\"weak_sm\" src=\"\/exafmm\/files\/2011\/11\/weak_sm.png\" alt=\"MPI weak scaling with (SIMD optimizations) from 1 to 32,768 processes, and timing breakdown of the different kernels, tree construction and communications. Test problem: N=106 points per process placed at random in a cube; FMM with order p=3. parallel efficiency is 72% on 32,768 processes. \" width=\"465\" height=\"287\" \/><\/a><figcaption id=\"caption-attachment-274\" class=\"wp-caption-text\">Figure 4\u2014MPI weak scaling with (SIMD optimizations) from 1 to 32,768 processes. Parallel efficiency is 72% on 32,768 processes. \u00a9 2011 R Yokota, L Barba.<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>The results above are detailed in the following publication:<\/p>\n<blockquote><p>&#8220;A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems&#8221;, Rio Yokota and Lorena A Barba,\u00a0<em>Int. J. High-perf. Comput.,<\/em> online 24 Jan. 2012,\u00a0<a title=\"http:\/\/hpc.sagepub.com\/content\/early\/2012\/01\/18\/1094342011429952.abstract\" href=\"http:\/\/hpc.sagepub.com\/content\/early\/2012\/01\/18\/1094342011429952.abstract\">doi:10.1177\/1094342011429952<\/a> <em> \u2014 <\/em>Preprint:\u00a0<a title=\"http:\/\/arxiv.org\/abs\/1106.2176\" href=\"http:\/\/arxiv.org\/abs\/1106.2176\">arXiv:1106.2176<\/a><\/p><\/blockquote>\n<p>&nbsp;<\/p>\n<h2>Weak scaling on GPU systems<\/h2>\n<p>The <em>ExaFMM<\/em> code scales excellently to thousands of GPUs. We studied scalability on the <a href=\"http:\/\/www.gsic.titech.ac.jp\/en\/tsubame2\" target=\"_blank\">Tsubame 2.0 supercomputer<\/a> of Tokyo Institute of Technology (thanks to guest access). A timing breakdown is shown below, on up to 2048 processes, where the communication time is seen to be minor.<\/p>\n<figure id=\"attachment_284\" aria-describedby=\"caption-attachment-284\" style=\"width: 510px\" class=\"wp-caption alignnone\"><a href=\"\/exafmm\/files\/2011\/11\/breakdownGPU.png\"><img loading=\"lazy\" class=\"size-full wp-image-284\" title=\"breakdownGPU\" src=\"\/exafmm\/files\/2011\/11\/breakdownGPU.png\" alt=\"Timing breakdown and scalability of ExaFMM on many GPUs.\" width=\"500\" height=\"289\" \/><\/a><figcaption id=\"caption-attachment-284\" class=\"wp-caption-text\">Figure 5\u2014Timing breakdown and scalability of ExaFMM on many GPUs.<\/figcaption><\/figure>\n<p>Parallel efficiency of <em>ExaFMM<\/em> on a weak scaling test in Tsubame 2.0 achieved more than 70% on 4096 processes (with GPUs). A similar test of a parallel FFT (without GPUs) showed a dramatic degradation of efficiency at this number of processes.<\/p>\n<figure id=\"attachment_279\" aria-describedby=\"caption-attachment-279\" style=\"width: 431px\" class=\"wp-caption alignnone\"><a href=\"\/exafmm\/files\/2011\/11\/weakGPU.png\"><img loading=\"lazy\" class=\"size-full wp-image-279 \" title=\"weakGPU\" src=\"\/exafmm\/files\/2011\/11\/weakGPU.png\" alt=\"Parallel efficiency of the FMM on a weak scaling test on Tsubame 2.0 achieves more than 70% on 4096 processes (with GPUs), and a similar test of a parallel FFT (without GPUs)\/\" width=\"421\" height=\"372\" \/><\/a><figcaption id=\"caption-attachment-279\" class=\"wp-caption-text\">Figure 6\u2014Parallel efficiency of the FMM on a weak scaling test on Tsubame 2.0  (with GPUs), and of a parallel FFT (without GPUs) on up to 4096 processes.<\/figcaption><\/figure>\n<p>Cite this figure:<\/p>\n<blockquote><p>Weak scaling of parallel FMM vs. FFT up to 4096 processes. Lorena Barba, Rio Yokota. Figshare.<br \/>\n<a href=\"http:\/\/dx.doi.org\/10.6084\/m9.figshare.92425\">http:\/\/dx.doi.org\/10.6084\/m9.figshare.92425<\/a><\/p><\/blockquote>\n<div style=\"margin-bottom: 10em;\"><span style=\"display: none;\">.<\/span><\/div>\n","protected":false},"excerpt":{"rendered":"<p>MPI-parallel code Strong and weak scaling on CPUs On multi-core systems (Kraken supercomputer), strong scaling with 108 particles on 2048 processes achieved: 93% parallel efficiency for the non-SIMD code, and 54% for the SIMD-optimized version (which is 2x faster). The plot in Figure 3 shows\u00a0MPI strong scaling from 1 to 2,048 processes, and timing breakdown [&hellip;]<\/p>\n","protected":false},"author":3344,"featured_media":0,"parent":20,"menu_order":3,"comment_status":"closed","ping_status":"closed","template":"","meta":[],"_links":{"self":[{"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/pages\/266"}],"collection":[{"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/users\/3344"}],"replies":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/comments?post=266"}],"version-history":[{"count":22,"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/pages\/266\/revisions"}],"predecessor-version":[{"id":269,"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/pages\/266\/revisions\/269"}],"up":[{"embeddable":true,"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/pages\/20"}],"wp:attachment":[{"href":"https:\/\/www.bu.edu\/exafmm\/wp-json\/wp\/v2\/media?parent=266"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}