David Keyes, Columbia University and King Abdullah University of Science and Technology (KAUST)
Sustained floating-point computation rates on real applications, as tracked by the ACM Gordon Bell Prize, increased by three orders of magnitude from 1988 (1 Gigaflop/s) to 1998 (1 Teraflop/s), and by another three orders of magnitude to 2008 (1 Petaflop/s). Computer engineering provided only a couple of orders of magnitude of improvement for individual cores over that period; the remaining factor came from concurrency, which is approaching one million-fold.
Algorithmic improvements contributed meanwhile to making each flop more valuable scientifically. As the semiconductor industry now slips relative to its own roadmap for silicon-based logic and memory, concurrency, especially on-chip many-core concurrency and GPGPU SIMD-type concurrency, will play an increasing role in the next few orders of magnitude, to arrive at the ambitious target of 1 Exaflop/s, extrapolated for 2018. An important question is whether today’s best algorithms are efficiently hosted on such hardware and how much co-design of algorithms and architecture will be required.
From the applications perspective, we illustrate eight reasons why today’s computational scientists have an insatiable appetite for such performance: resolution, fidelity, dimension, artificial boundaries, parameter inversion, optimal control, uncertainty quantification, and the statistics of ensembles.
The paths to the exascale summit are debated, but all are narrow and treacherous, constrained by fundamental laws of physics, cost, power consumption, programmability and reliability. Drawing on recent reports, workshops, vendor projections, and experiences with scientific codes on contemporary platforms, we propose roles for today’s graduate researchers in one of the great global scientific quests of the next decade.
Tsunami Simulation on GPUs
Takayuki Aoki, Tokyo Institute of Technology
Thursday, January 6th 2011
Tsunamis are destructive forces of nature and thus their accurate forecast and early warning is extremely important. In order to predict a tsunami, the Shallow Water Equations must be solved in real-time. To solve these equations the CIP-CSL2 and the method of characteristics can be used. A new, outstanding way to speed up these computations uses GPUs to drastically accelerate the computation in a highly parallelized environment.
A single-GPU calculation has been found to be 62-times faster than using a single CPU core (Intel i7). We also have applied domain decomposition to solve the problem on a multi-node GPU cluster. Two transfer models were used, synchronous and asynchronous models. In the synchronous model, the computing stops while the transfers are done, whilst in the asynchronous model, the computing and transfers are done simultaneously. Overlapping transfers and computation further accelerated the process by hiding communication. Because GPU to GPU transfers are not possible, the CPU must be used as a bridge to share information between neighbors. Therefore, for the GPU transfers an asynchronous-copy model was used, and the MPI library was used to transfer the data between nodes.
A domain representing real bathymetry was used as our dataset, with a grid size of 4096×8192 and 90m resolution. Our tests on the supercomputer TSUBAME showed excellent scalability. Also on the TSUBAME GPU cluster, consisting of Tesla S1060 cards, impressive results were obtained, e.g., 1000 CPUs were required to match the performance of 8 GPU’s.