Multigrid - Efficient Domain Partitioning for Stencil-based Parallel Operators

Multigrid [25, 37, 63, 72] methods are hierarchical algorithms used to optimally solve certain sparse linear systems of equations having N unknowns in O(N ) time. They are based on the idea of using grids of decreasing mesh resolution [63, 72, 73]. Iterative schemes [25, 37, 63, 72] such as Gauss-Seidel, weighted Jacobi (ω-Jacobi) etc., can remove high frequency error components very effectively, known as smoothing, but decrease the low frequency error spectrum very slowly, thus producing an unacceptable convergence rate for large numbers of unknowns. These low frequency error components can be represented as relatively high frequency components on coarser grids [25,59,63]. Standard coarsening reduces the number of points by one-eighth in 3-D from the immediate finer grid level, i.e. coarsening is done in all dimensions [37, 59, 63]. When these iterative schemes are applied on the coarser grid, they filter out these high frequency errors and speed up the overall convergence. In general, the smooth or low frequency error modes are associated with large Eigenvalues and the high frequency error components are associated with small Eigenvalues of the iteration matrix. These smoothing properties of certain iterative methods and the equivalent system of equations at various levels, i.e. coarser grids, form the basis of Multigrid [59]. A vast repository about Multigrid can be found on-line [74] along with a huge list of references in a file named mgnet.bib. Multigrid finds a particular use in Computational Fluid Dynamics (CFD) where it has been used to solve problems such as viscous flow around the aircraft and fluid flows in industrial machines [75].

Multigrid can be viewed as a recursive algorithm and is best expressed in the form a 2- grid algorithm/coarse grid correction algorithm [25, 37, 59, 63, 72]. A 2-grid algorithm works by applying a few iterations of the smoother (ω-Jacobi or Gauss-Seidel), on the finest grid, calculation of a residual, restricting these residuals to the coarse grid, solving an equivalent linear system of error equations exactly on the coarse grid to approximate the error, interpolating the error solution to obtain a better approximate of the solution at the fine grid level and repeating the same procedure until a desired convergence is achieved at the fine level [25,59,63]. This scheme is explained in detail in Chapter 6. This 2-grid scheme when repeated recursively

on coarse levels gives rise to the Multigrid algorithm. Typically the pre-smoothing (ν1) and post-smoothing (ν2) iterations of the smoother vary between one and three for most practical problems [59]. Depending on the order of the traversal between grids, two common types of cycles are categorized as V-cycles and W-cycles [25, 63]. The shape is dictated by a parame- ter called the cycle index (γ), which determines the number of times the recursive Multigrid algorithm is called at a particular coarse grid level. Thus, γ = 1 produces a V-cycle and γ = 2 produces a W-cycle (where each coarse grid level is solved twice in an approximate manner) [59].

A method called the Full Approximation Scheme (FAS) may be used when the discretization operator is non-linear. This is called so because a full approximation of the solution at the coarsest grid is solved instead of solving only for the error [63, 76]. Another method called Newton-Multigrid is also used in non-linear settings and a comparison of these two methods appears in [77]. When the coarse grid is used recursively to approximate the initial guess on the fine grid, it gives rise to the concept of nested iteration [63]. Nested iteration when combined with the recursive Multigrid technique gives rise to Full Multigrid methods (FMG). FMG usually starts on the coarsest grid, solves it accurately, interpolates the solution to the finer grid and then applies a Coarse Grid Correction (CGC) scheme or Full Approximation Scheme (FAS) cycle before further interpolating to the next finer level [59, 63].

2.5.1 Type of Multigrid methods

Multigrid methods are broadly classified as Geometric or Algebraic Multigrid methods [63, 75]. Algebraic Multigrid uses no geometric information regarding the grid on which the PDE or any other problem is solved and thus they can be better called Algebraic Multilevel methods rather than Algebraic Multigrid [75]. Though the flexibility of Algebraic Multigrid is unparalleled, the higher throughput of Geometric Multigrid in terms of unknowns solved per second makes it extremely attractive [78]. A discussion of Algebraic Multigrid is beyond the scope of the thesis but a gentle introduction can be located in [63]. The classical Multigrid method refers to the Geometric version and we discuss it exhaustively in Chapter 6 of the thesis.

2.5.2 Parallelization and Coarser Grids

Parallelization introduces a bottleneck when coarser grids in Multigrid are visited due to the low ratio of computation to communication. This problem of inefficient solution on coarse grids does not exist in serial Multigrid codes [11]. Communication aggregation and vertical traffic avoidance do not offer substantial benefits at coarser levels [62]. Further, for very large core counts, it is the coarsest grid which contributes to the maximum percentage of run-time [79] as the time spent in MPI Waitall() increases. Researchers have explored the possibility of vertical and horizontal communication avoidance at coarser levels and found them to be ineffective [80].

When a large number of processors (or cores) are present, the coarsest grid can be solved in two standard ways. The first method is to agglomerate the coarse grid points from every processor onto a single processor and then solve the problem. Two constraints exist for a single processor solve. The complete coarse grid problem should be able to fit into the memory of a processor and the solve time should be optimal. The second method is a generalization of the first method where the coarse grid points from all processors are collected on a subset of processors and the problem is again solved in parallel. The first approach incurs zero communication cost (excluding the cost of agglomeration and transfer of the solution after solving) whereas the second one has lesser communication cost as compared to solving the coarsest grid problem on all the processes [11, 81]. Tasks from processes can be aggregated onto a subset of processes (agglomeration) or the combined task copies can be solved on different subsets of processes (redundant approach) [82]. The redundant approach also embeds in itself a resilient approach i.e. in case of a failure of a node in a subset, the result does not need to be re-computed.

Scalability of the coarsest level solvers is an extremely important issue [62]. The coarsest level solver maybe a direct solver [11] such as MUMPS (Multifrontal Massively Parallel sparse direct Solver [83]) or SuperLU (SupernodalLU) [84] in both Geometric and Algebraic Multi- grid. Coarsest level iterative solvers can vary depending on the problem being solved, i.e. from a constant number of relaxations at the coarsest level to implementing an Algebraic Multigrid solver. As an example, in a comparison based study, the truncated V-cycle was terminated when the coarse level contained a 43_{domain and twenty-four iterations of the Red-Black Gauss} Seidel method were performed at the coarsest level [62]. Researchers have preferred the direct solvers as compared to an aggregation of the coarse grid problem on Blue Gene/P systems which has a number of cores of the order of 3 × 105_{[85]. These direct solvers are very difficult} to implement as a stable pivot choice is needed [86] and have sub-optimal efficiency. An appre- ciable number of unknowns can be kept at the coarsest level and a highly parallel solver such as Chebyshev semi-iterative solver or unpreconditioned Conjugate Gradient method can also be used [11]. Researchers have made attempts to make a rough estimate of the coarsest grid solve using Conjugate Gradient method with a heuristic d

√ N

2l−1, where d is the number of dimensions,

N is the number of unknowns and l is the level of the coarsest grid. The obtained coarsest grid was then solved using this CG approximation [79, 85].

In our experiments with parallel Geometric Multigrid, we also fix the number of Jacobi iterations at the coarsest level such that the number of V-cycles does not increase. To fix the iterations, we first solve the coarsest grid problem to a high degree of accuracy and note the number of V-cycles. We then remove the global communication calls (MPI Allreduce()) at the coarsest grid level and fix the coarsest grid iterations to the smallest number such that the number of V-cycles does not increase. To find this least value, the coarsest grid iterations are systematically decreased, until a point is reached where the V-cycles start increasing.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 59-62)