Publications by the Barba group

The following publications report research related to the ExaFMM project:

  • “Petascale turbulence simulation using a highly parallel fast multipole method”, Rio Yokota, L A Barba, Tetsu Narumi, Kenji Yasuoka. Submitted 2011. Preprint: arXiv:1106.5273
    This paper presents the application of a periodic FMM algorithm to the simulation of homogeneous turbulent flow in a cube, demonstrating scalable computations with many GPUs. The calculations were carried out in the TSUBAME 2.0 system at the Tokyo Institute of Technology, thanks to guest access provided by the Grand Challenge Program of TSUBAME. A total of 2048 GPUs were used, corresponding to half of the complete system, to achieve a sustained performance of 0.5 Petaflop/s.
  • “Hierarchical N-body simulations with auto-tuning for heterogeneous systems”, Rio Yokota, L A Barba. Computing in Science and Engineering (CiSE), 3 January 2012, IEEE Computer Society, doi:10.1109/MCSE.2012.1.
    Preprint arXiv:1108.5815
    This paper presents an algorithm that combines features of both treecodes and fast multipole method (FMM), by dynamic selection of cell-cell or cell-particle interactions as the tree is traversed. Dynamically selecting kernels during runtime relies on a flexible and generic algorithm to traverse the tree: a stack-based dual tree traversal. (See the Figure in the Features section of this website.) The paper also gives some historical and current context on the importance of N-body simulation in computational science and the advantages of N-body algorithms on GPU (illustrated by the Roofline model).
  • “A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems”, Rio Yokota, L A Barba, Int. J. High-perf. Comput., published online 24 Jan. 2012, doi:10.1177/1094342011429952
    Preprint: arXiv:1106.2176
    Describes scaling of the FMM on a large CPU system with O(100k) cores. Several advanced techniques were utilized including a hierarchical communication method for efficient intra-node/inter-node all-to-all exchanges, and single-node tuning via OpenMP and SIMD vectorization.
  • “Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns”, Rio Yokota, J P Bardhan, M G Knepley, L A Barba, T Hamada. Comput. Phys. Commun.,182(6):1271–1283 (2011) doi:10.1016/j.cpc.2011.02.013
    Preprint arXiv:1007.4591
    This paper reports the application of the FMM algorithm in a multi-GPU system for biomolecular electrostatics using the continuum model with implicit solvent. We used the Degima cluster at Nagasaki Advanced Computing Center—a system that as of June 2011 occupies the #3 spot in the Green500 list. The largest calculation solved a system of over a billion boundary unknowns for more than 20 million atoms, requiring one minute of run time on 512 GPUs.