Category: Publications

Paper accepted in Computers & Fluids

August 16th, 2012 in Publications

“FMM-based vortex method for simulation of isotropic turbulence on GPUs, compared with a spectral method”, Rio Yokota, L. A. Barba, Computers & Fluids, in press (available online 13 August 2012),

This paper presents the results of comparing a Lagrangian vortex method with a trusted spectral method for the simulation of isotropic fluid turbulence. The numerical engine of the vorticity-based fluid solver is a massively parallel fast multipole method (FMM) running on GPU hardware using CUDA. Thus, there are several aspects to the validation: the particle method itself, the fast summation via FMM, and the use of GPUs in this application.

There have been insufficient efforts so far to fully validate the vortex method in direct numerical simulations of turbulent flows, which has left ample room for skepticism. The benchmark of decaying, homogeneous isotropic turbulence is an ideal test for quantitave validation of fluid solvers, with the simplest possible geomety (a periodic cube) and many years of experience using turbulence statistics in its analysis. It is not the best problem for showcasing a vortex particle method, which shines in wake flows and other flows dominated by vorticity, but it allows quantitative validation. We use a Taylor microscale Reynolds number of 50 and 100 in a 256^3 mesh and look at various turbulence statistics, including high-order moments of velocity derivatives. A parametric study looks at the effect of the order of series expansion in the FMM, the frequency of particle reinitialization in the vortex method, the overlap ratio of particles and the time step. At this resolution, the vortex method matches quantitatively the spectral method at Re=50, but some deviations are appreciable in the velocity skewness and flatness for Re=100.

The vortex method application uses a parallel FMM code, called exaFMM, that runs on GPU hardwared using CUDA, while the reference spectral code (developed and used at the Center for Turbulence Research in Stanford) runs on CPU only. Results indicate that, for this application, the spectral method is an order of magnitude faster than the vortex method when using a single GPU for the FMM and six CPU cores for the FFT.

The entire code that was used to obtain the present results is available under the MIT license from:
The revision number used for the results in the paper was 146. Documentation and links to other publications asre found int he project homepage at

In another publication (coming soon), we focus on the performance of the FMM-based vortex method on massively parallel systems, compared to the spectral method. That work included simulations of isotropic turbulence on up to 4096 GPUs, with a 4096^3 problem size and exceeding 1 petaflop/s of sustained performance. Preliminary results are available via the following figure:

Weak scaling of parallel FMM vs. FFT up to 4096 processes. Lorena A. Barba, Rio Yokota. figshare


Tagged , ,

Paper accepted in CiSE

January 4th, 2012 in Publications

A new paper authored by the ExaFMM team has been accepted, this time to appear in Computing in Science and Engineering, the joint publication of the IEEE Computer Society and the American Institute of Physics.

  • Title: "Heterogeneous N-body Simulations with Auto-Tuning for Heterogeneous Systems"
  • Authors: Rio Yokota and Lorena A. Barba
  • To appear: Computing in Science and Engineering (CiSE), May/June 2012 issue.
  • PreprintarXiv:1108.5815

This paper presents the new hybrid treecode/FMM formulation of ExaFMM, which maintains the O(N) complexity, but is able to perform both cell-cell and cell-particle interactions (i.e., it is both a treecode and an FMM code). The code also offers auto-tuning capability, being able to choose dynamically which type of interaction to perform: cell-cell, cell-particle or particle-particle. This feature is enabled by means of a dual-tree traversal technique, described in more detail in the Features section of this website.

The paper also discusses the advantage of multipole algorithms on GPU hardware, with the aid of the roofline model to show how it compares with other algorithms. Some of the recent work that is also described includes a many-GPU turbulence calculations on Tsubame 2.0 with 2048 GPUs, which achieved 0.5 petaflop/s in performance.

From the Conclusions of the paper:

The fact that the current method can automatically choose the optimal interactions, on a given heterogeneous system, alleviates the user from two major burdens. Firstly, the user does not need to decide among treecode or FMM, predicting which algorithm will be faster for a particular application given the accuracy requirements—they are now one algorithm.  Secondly, there is no need to tweak parameters, e.g., particles per cell, in order to achieve optimal performance on GPUs—the same code can run on any machine without changing anything. This feature is a requirement to developing a black-box software library for fast N-body algorithms on heterogeneous systems, which is our goal.



First ExaFMM paper accepted

October 20th, 2011 in Publications

The first publication reporting our work towards advancing fast multipole methods (FMM) to be a prime algorithm for exascale systems has been accepted by the International Journal of High-Performance Computing Applications, IJHPCA.

Our previous recent work showed scaling of an FMM on GPU clusters, with problem sizes in the order of billions of unknowns [1]. That work led to an extremely parallel FMM  scaling to thousands of GPUs or tens of thousands of CPUs.

This new paper [2] reports on a a campaign of performance tuning and scalability studies using multi-core CPUs, on the Kraken supercomputer.

Intra-node performance optimization was accomplished using OpenMP and tuning of the particle-to-particle kernel using SIMD instructions. Parallel scalability was studied in both strong and weak scaling. The strong-scaling test with 100 million particles achieved 93% parallel efficiency on 2048 processes for the non-SIMD code, and 54% for the SIMD-optimized code (which was still 2x faster).

The largest calculation on 32,768 processes took about 40 seconds to evaluate more than 32 billion unknowns.

This work builds up evidence for our view that FMM is poised to play a leading role in exascale computing, and we end the paper with a discussion of the features that make it a  particularly favorable algorithm for the emerging heterogeneous and massively parallel architectural landscape.

IJHPCA is not only one of the top journals in computer science and interdisciplinary applications, but also has an author-friendly copyrights policy and offers open-access options.


[1] “Biomolecular electrostatics using a fast multipole BEM on up to 512 GPUs and a billion unknowns”, Rio Yokota, J P Bardhan, M G Knepley, L A Barba, T Hamada.
Comput. Phys. Commun., 182(6):1271–1283 (2011) doi:10.1016/j.cpc.2011.02.013 Preprint arXiv:1007.4591

[2] “A tuned and scalable fast multipole method as a preeminent algorithm for exascale systems”, Rio Yokota, L A Barba. Int. J. High-perf. Comput. Accepted (2011)
Preprint arXiv:1106.2176

Tagged ,