Advanced algorithmic techniques for GPUs

by Wen-Mei Hwu

University of Illinois Urbana-Champaign

Computational thinking

Professor Hwu is co-author of the first book on GPU programming in the English language.

Professor Hwu is co-author of the first book on GPU programming in the English language.

In this section, we discuss the particular limitations and constraints of many-core hardware, and which computational patterns are desirable or undesirable, from the architectural point of view. Among the main obstacles to performance, we discuss: conflicts in critical resources leading to serialization, load imbalance, and memory bandwidth bottlenecks.

With these constraints in sight, we aim to discover how computational thinking enables one to transform the structure of a problem by: identifying inherently serial parts, finding the parts which are amenable to parallel execution, and the methods and compromises for transforming one to the other. We make the observation that there are only a modest number of fundamental algorithm strategies used in successful GPU programs.

Parallelism transformations for performance

Often domain problems have inherent parallelism that needs to be recognized. The most efficient implementation that exploits the problem’s parallelism may be non-intuitive. For example, two alternative thread arrangements that appear in electrostatics calculations have, respectively, scatter and gather memory access behavior. The first is more intuitive, but the second is much more efficient on the GPU architecture.

Avoidance of conflicts in resources

The GPU architecture is characterized by memory access bandwidth that, although fast, is often limiting in comparison to compute throughput. Thus, achieving performance critically depends on finding ways to reduce and regularize global memory access. Three important algorithmic strategies for conserving bandwidth are “register/memory tiling”, “layout transformation” and “thread coarsening”. These come at a cost of increased on-chip memory usage, which is also a limited resource. We will discuss a variety of examples from PDE solvers, linear algebra, and convolution.

An efficient technique for non-uniform data is sorting by binning.

An efficient technique for non-uniform data is sorting by binning.

Dealing with data efficiently

Regular computation and data access is the ideal combination for GPUs. Domain problems present data in a variety of ways to the programmer: often data is sparse, sometimes it is non-uniform, or it can be dynamic. Specific techniques such as data binning, compaction, and queuing are effective for GPU architectures. We will present examples from medical imaging and graph problems to illustrate these techniques.