Example : Iterative Solvers (F90)
In this example, we demonstrate an application of the cartesian topology by way of a simple elliptic (Laplace) equation.
- Fundamentals The Laplace equation along with prescribed boundary conditions are introduced. Finite Difference Method is then applied to discretize the PDE to form an algebraic system of equations.
- Jacobi Scheme A very simple iterative method known as the Jacobi Scheme is described. A single-process computer code is also shown. This program is written in fortran 90 for its concise but clear array representations. (Parallelism and other improvements will be added to this code in steps as we progress)
- Parallel Jacobi Scheme A parallel algorithm for this problem is discussed.
Simple MPI routines, without the invocation of cartesian topology, are inserted into the basic, single-process, code above to form the parallel code. - SOR Scheme The Jacobi scheme, while simple and hence desirable for demonstration purposes, is impractical for “real” applications due to its slow convergence. Enhancements to the basic technique are introduced which leads to the Successive Over Relaxation (SOR) scheme.
- Parallel SOR Scheme With the introduction of a “red-black” algorithm, we can employ the parallel algorithm used for Jacobi to parallelize the SOR.
- Scalability We show the performance of the code for various number of processes to demonstrate its scalability.
- Click here To download a zipped file containing the F90 Jacobi and SOR codes, along with make files and a matlab m-file for plotting the solution.
Fundamentals
First, some basics:
and is shown below in a contour plot with x pointing from left to right and y going from bottom to top.
Jacobi Scheme
While numerical techniques abound to solve PDEs such as the Laplace equation, we will focus on the use of an iterative method as it will be shown to be readily parallelizable and lends itself to the opportunity to apply cartesian topology. The simplest of iterative techniques is the Jacobi scheme, which can be stated as follows:
- Make initial guess for ui,j at all interior points (i,j) for all i=1:m and j=1:m.
- Use Eq. 3 to compute un+1i,j at all interior points (i,j).
- Stop if prescribed convergence threshold is reached, otherwise continue on next step.
- uni,j = un+1i,j
- Go to Step 2.
Single process f90 code for the Jacobi Scheme
Parallel Algorithm for the Jacobi Scheme
To enable parallelism, first we need to divvy up works for individual processes; this is known commonly as domain decomposition. Since the governing equation is two-dimensional, typically we have a choice of using a 1D or 2D decomposition. Here, we will focus on a 1D decomposition and defer the discussion of a 2D decomposition for later. Assuming that p processes will be used, we split the computational domain into p
horizontal strips, each assigned to one process, along the north-south or y-direction. This choice is made primarily to facilitate a simpler boundary condition (code) implementations.
For the obvious reason of better load-balancing, we will divide the amount of work, in this case proportional to the grid size, evenly among the processes (m x m / p). For convenience, we define m’ = m/p as the number of cells in the y-direction for each process. Next, lets rewrite Eq. 3 for a process k as follows:
The figure below depicts the grid of a typical process k as well as part of adjoining grids of k-1, k+1.
- The cells that hold the solution v at the new iteration step, n+1, as governed by Eq. 6 are represented in red. We will call these solution cells or interior cells. The terms on the right hand side of Eq. 6, which are known quantities at iteration n, span the entire grid system shown.
- The blue cells on the top row represent cells belonging to the first row (j = 1) of solution cells of process k+1 and the blue cells on the bottom row represent the last row (j = m’) of solution cells of process k-1. It is important to emphasize that the v at the blue cells of k belong to adjoining processes (k-1 and k+1). They are not accessible to the current process and hence must be “imported”, or “received”, via MPI message passing routines. Similarly, process k‘s first and last rows of cells must be “exported”, or “sent”, to adjoining processes for the same reason that the current process k need “imports”.
- For i = 1 and i = m, Eq. 6 again requires an extra cell beyond these two locations. These cells contain the prescribed boundary conditions (u(0,y) = u(1,y) = 0) and are colored green to distinguish them from the red and blue cells. Note that no message passing operations is needed for these green cells as they are fixed boundary conditions and are known a priori.
- From the standpoint of process k, we consider the blue and green cells as boundary cells. Consequently, the range of the strip becomes (0:m+1,0:m’+1). To its green cells we will impose physical boundary conditions while to its blue cells we import v from two adjoing processes. With the boundary conditions in place, we can proceed to apply Eq. 6 to all its solution cells. Concurrently, all other processes proceed following the same procedure. It is interesting to note that the grid layout for a typical process k is completely analogous to that of the original undivided grid. Whereas the orginal problem has fixed boundary conditions, the problem for a process k is subjected to variable boundary conditions.
- These boundary conditions can be stated mathematically as :
- Note that the interior points of u and v are related by the following relationship
Jacobi Parallel Implementation. Note that Cartesian topology is not employed in this implementation but will be
used later in the parallel SOR example with the purpose of showing alternative ways to solve this type of problems.
Successive Over Relaxation (SOR)
While the Jacobi iteration scheme is very simple — and parallelizable — its slow convergent rate however renders it impractical for any “real world”
applications. One way to speed up the convergent rate would be to “over predict” the new solution by linear extrapolation. This leads to the Successive Over Relaxation scheme, or SOR:
- Make initial guess for ui,j at all interior points (i,j).
- Define a scalar wn ( 0 < wn < 2).
- Apply Eq. 3 to all interior points (i,j) and call it u’i,j.
- un+1i,j = wn u’i,j + (1 – wn) uni,j
- Stop if prescribed convergence threshold is reached, otherwise continue on next step.
- uni,j = un+1i,j
- Go to Step 2.
Note in the above that setting wn = 1 recovers the Jacobi scheme while wn< 1 underrelaxes the solution. Ideally, the choice of wn would be such that it provides the optimal rate of convergence and is not restricted to a fixed constant.
As a matter of fact, an effective choice of wn, known as the Chebyshev acceleration, is defined as
For details, see Chapter 19.5, Relaxation Methods for Boundary Value Problems, the Numerical Recipe .
A Parallel SOR Red-black Scheme
To further speed up the rate of convergence, we could use u at time level n+1 for any or all terms on the right hand side of Eq. 6 as soon as they become available. This is the essence of the Gauss-Seidel scheme. This requires that the solution be updated cell-by-cell via do
loop in order to make use of the most recent solution as soon as they become available. Alternatively, we will explore a conceptually similar red-black scheme which will work equally well for an array-based language such as F90 as a scalar-based language such as C or F77. This scheme can best be understood visually by painting the interior cells alternately in red and black to yield a Checkerboard-like pattern:
With this red-black group identification strategy and upon applying the five-point finite-difference stencil to a point (i,j) located at a red cell, it is immediately apparent that the solution at the red cell depends only on its four immediate black neighbors to the north, east, west, and south by virtue of Eq. 6.
On the contrary, a point (i,j) located at a black cell depends only on its north, east, west, and south red neighbors. In other words, the finite-difference stencil in Eq. 6 effects an uncoupling of the solution at interior cells such that solution at the red cells depend only on solution at the black cells and vice versa. In a typical iteration, if we first perform an update on all red (i,j) cells, then when we perform the remaining update on black (i,j) cells, we could use the red cells that have just been updated. Otherwise, everything that we described about the parallel Jacobi scheme applies equally well here; i.e., the green cells represent the physical boundary conditions while the solutions from first and last rows of the grid of each process are deposited into the blue cells of respective process grids to be used as the remaining boundary conditions.
parallel F90 SOR code. Cartesian topology is employed in this implementation.
A few comments …
- Due to array syntax in f90, update per iteration
must be done simultaneously. Do loops on
the other hand permits update cell by cell and hence follows
Gauss-Seidel more closely. - Had we chosen a 2D decomposition, not only that the boundary conditions would be more cumbersome to implement, the interior strips (i.e.,
k=1, …,p-2) would have had all blue cells on their borders which would mean additional message passing and hence hinders overall performance.
Scalability Plot of SOR
Here is a plot showing the scalability of the MPI implementation of the Laplace equation using SOR on an SGI Origin 2000 shared-memory multiprocessor.