Difficulties in validating the hypothesis

In the previous chapter, we formulated a strategy for minimizing the cache-misses of a sub- domain and showed the superiority of such partitions by experimenting on single grids. Our comparison showed that our cache-minimizing topologies performed better than the communication minimizing topology for almost all combinations of grid sizes and process counts. Overlap of communication with computation formed a significant part of our analytical derivation for cache-minimizing topologies. The reason is that when communication is overlapped with com-

putation, both while packing/unpacking and communicating data, the next-to-halo layers are accessed separately after the halo data arrives. This has the advantage of MPI advancing its communication progress engine while the serial computing thread updates the Independent Compute kernel but at the same time suffers from a disadvantage that the next-to-halo layers now cannot be updated along with the Independent computational kernel, resulting in extra cache-misses. There are several reasons why we emphasize overlapping communication with computation and we list these below:

1. Non-blocking communication in MPI : The reason why these non-blocking routines exist is that we are expected to overlap communication with computation. MPI 3.1 also has versions for non-blocking collective operations.

2. Increasing distance: As the nodes grow fatter and the number of nodes in a cluster continues to increase, the distance between cores is increasing. Thus, it would/has become imperative to overlap communication with computation in future/current architectures.

The hypothesis that we formulated in the previous chapter holds only partially when eval- uating Adaptive Mesh Refinement in BoxLib. For single grids, the communication minimizing topology (MDC) never outperforms the cache-minimizing topologies for any data size and core counts in our experiments. Since the codes are in Fortran, we also took into account the reverse communication minimizing topology (Rev. MDC) but there existed topologies which outperformed both MDC and Rev. MDC for all the cases. This demonstrates that the communication minimizing topology is not the optimal choice for single grids as shown previously. For AMR codes, there existed cases where the MDC was outperformed by specific non-cubic sub-domains, thus establishing that the MDC is not always the optimal choice at all data sizes or processor counts. However, the superiority of the cache-minimizing topologies was not at all clear cut in these cases. The following plausible reasons explain why our hypothesis partially fails when considering AMR using BoxLib, and also the difficulties in analyzing BoxLib codes.

1. Communication and Computation: In BoxLib, communication of the halo zones is not overlapped with computation and, further, the packing and unpacking of data from the boxes does not use MPI derived data types. Thus, there is no overlapping while packing/unpacking or communicating data. The sub-domain is updated when the data arrives from neighbouring processes and it is treated as a contiguous sub-domain without any need for updating the planes separately. This completely eliminates the cache-misses that we calculated separately for the Dependent Planes in our abstract high level mathematical model for minimizing cache-misses.

2. Internal data structures: It is difficult to estimate the size of the metadata and the consequent effect on the application performance that BoxLib maintains for both single grids and AMR. Clearly, the metadata for the latter is more complicated and much larger in size.

3. No Control over distribution of boxes: The user does not have any control over distribution of boxes in AMR with BoxLib. This is completely controlled by BoxLib using the Knapsack or Morton ordering with a dynamic switching scheme implemented to choose the appropriate algorithm. Since boxes are distributed per-level, BoxLib does not distin- guish between inactive or active boxes. An inactive box is a box where the solution is not updated: thus there is a large probability that the active boxes may not be load-balanced.

4. Load-balancing over shape: The load-balancing, i.e. the number of boxes per core, changes when the shape of the box is changed even though the volume remains constant. Thus, the load-balancing algorithm used by BoxLib takes into account the sub-domain points in each direction.

5.11 Summary

Adaptive Mesh Refinement (AMR) is a computational technique where local regions on a mesh are refined to obtain an increased accuracy in those regions. It helps to direct the compute resources towards regions of interest (or higher error) rather than devoting them to a globally refined mesh. Though theoretically this is an ideal strategy, software packages such as BoxLib which are used for building complex multiscale multiphysics structured AMR applications, in- cur additional overheads in the form of maintaining metadata and synchronization in a parallel settings.

The parallelization in BoxLib is abstracted away from the user and thus the user is free to focus on the problem, but at the cost of losing some of the control of the execution. Using BoxLib, we have tested the applicability and extension of our previously formulated hypothesis that there exist cache-miss minimizing topologies which outperform the communication mini- mization topology in solving PDEs using point iterative methods such as Jacobi iteration on structured 3-D uniform grids. We further extended this evaluation to AMR codes with up to 3 levels (2 refined, 1 unrefined). All the codes for the uniform grid and AMR were developed using Fortran90 in BoxLib and tested with no overlap of communication and computation. In this process, we implemented an MPI Cartesian topology of MPI processes that can be used in BoxLib for single and multiple boxes per core for uniform meshes. Further, we compared the execution timings of uniform as well as AMR codes, while profiling the cache-misses for both cases to experimentally investigate the validity of our aforementioned hypothesis.

Chapter 6

Multigrid

In the previous chapter we evaluated the use of non-cubic sub-domains on single grids and with Adaptive Mesh Refinement (AMR) using a library called BoxLib. We demonstrated that our hypothesis, that there exist cache-miss minimizing domain partitions (or Boxes) that outperform cubic sub-domains, holds true for uniform meshes even with blocking communication (as in BoxLib). When using a structured, nested, AMR hierarchy, the hypothesis is only partially true due to a multitude of issues. These issues are strongly linked to the BoxLib implementation and include the load imbalance and the non-overlap of communication with computation. In this chapter, we continue to evaluate our hypothesis, but now using a multiple grid, hierarchical convergence acceleration technique called Geometric Multigrid. Our results will validate our hypothesis for this important class of iterative method, but will also uncover additional subtle factors in determining optimal sub-domain dimensions. The key focus in this chapter again remains on investigating, quantifying, measuring and improving the parallel efficiency by predicting high performing domain partitions.

6.1 Introduction

After a domain has been discretized to numerically approximate a linear PDE, iterative methods such as Jacobi, weighted Jacobi (ω-Jacobi), Gauss-Seidel (GS), Red-Black Gauss-Seidel (RBGS), Conjugate Gradient (CG) and others can be used to compute the solution of this discrete system [33, 37, 52]. Due to the slow rate of convergence of these iterative methods, and the time taken to solve large systems on uniform structured grids, multilevel algorithms have been created that accelerate the rate of convergence to the solution. The Multigrid [25,63] method is an optimal hierarchical method which can be used for solving sparse systems of linear equations that arise from a local discretization of Elliptic PDEs in O(N ) time, where N is the number of unknowns or degrees of freedom (dof) in the system. The hierarchy in Multigrid consists of several linear systems corresponding to discretizations on several levels of grids of

decreasing resolution, where the finest level grid represents the actual problem to be simulated. It accelerates the convergence of the solution by quickly and systematically eliminating low frequency error components on the series of coarse grids. To further decrease the solve time of Multigrid methods, they are parallelized on distributed, shared memory or hybrid architectures to allow simulation of extremely large scale problems [86, 143, 144], where the number of un- known variables can be of the order of billions or trillions. It is the parallelization of Multigrid that is challenging and requires a careful design and implementation to achieve near perfect Weak Scaling and thus preserve its theoretical optimality.

When Multigrid is parallelized over distributed-shared memory architectures, traditionally, the domain partitioning creates cubic partitions of the mesh to minimize overall communication. We extend and apply our high level analytical model in the scenario of multiple grid levels of Multigrid to investigate its effectiveness on this optimal algorithm. To this effect, we first extend the model to Geometric Multigrid (GMG) and again show that “close to 2-D” partitions for GMG can give higher performance than the partitions returned by the default MPI Dims create() function which minimizes the communication volume by default. Further, our model seeks to put this in the context of all the factors that might influence the choice of sub-domain shape and size. Thus, we qualitatively and quantitatively consider factors such as cache-misses, prefetching, cache-eviction policy, Vectorization etc., and explore their effect on determining optimal sub-domain dimensions. Though these factors have been separately well explored in the literature, the focus of our work is on establishing a connection between them and domain partitioning. We present the results of our investigations and discuss their limitations to open further research avenues. It may be noted that we use the term Multigrid to refer to GMG and not Algebraic Multigrid (AMG), the latter being beyond the scope of the current work.

The chapter begins by giving a general introduction to GMG and our aim in the current work. Following it is a detailed description of GMG and an explanation of the terms associated with it. We then describe the extension of our model to Multigrid along with underlying assumptions. An attempt to identify, explain and connect various serial parameters to decide optimal sub-domain dimensions evolves as the next logical step. This step can be considered as a part of the single uniform grid and is equally applicable there but its conception lies in the multiple grids scenario. Next we describe the mixed-boundary value test problem followed by our experimental results. Our results lead us into conclusions and a discussion of our work.

6.2 Motivation and Contribution

Multigrid adds a significant layer of complexity over single level uniform grid solvers due to a multiple level grid hierarchy, a decreasing computation to communication ratio at coarser grid

levels and the appearance of inter-grid transfer operators which are based upon higher order stencils themselves. In the pure sense AMR is a grid resolution technique whereas Multigrid is a convergence acceleration method. Both these computational techniques are extremely important in Scientific Computing and belong to the set of candidates for Exascale computing. Though fundamentally different, their structured versions have a common feature in the form of using uniform structured grids of varying resolution. This common factor, the optimality of Multigrid for Elliptic problems, and its vast applications in real world problems become our motivation for extending/applying our high level mathematical model to Parallel GMG. The following are the contributions of the current chapter:

– Extension of our quasi-cache-aware model for minimizing cache-misses to Parallel GMG.

– Demonstration that the fine grid execution time dominates the total solve time and hence even when a topology is sub-optimal at coarser levels, this cannot offset the effect of the optimal topology at the finest level.

– Realization that the Smoothing, Restriction and Interpolation operators have equivalent characteristic expressions for cache-misses, even when the Smoothing operator is a 7-pt stencil and the latter operators are represented by a 27-pt stencil.

– Identification and connection of other Serial Control Parameters (SCP) such as Vectoriza- tion, Prefetching, Least Recently Used (LRU) policy, and Cache Line Utilization (CLU) to optimal sub-domain dimensions.

– Experimentally verify that our model is independent of the hardware (using test platforms ARC2 and ARC3 at the University of Leeds) and software by using a combination of compilers (Intel and GNU) and MPI implementations (OpenMPI and Mvapich2) to obtain the same relative behaviour of topologies, along with the observation that the execution timing curve is a characteristic of the compiler that is used.

– Developing a lightweight, dynamic cache space tiling/loop-blocking heuristic, dependent on the shared L3 cache and the number of arrays in the Working Set Size (WSS).

– Demonstrate the effect of domain decomposition by passing an equal number of X, Y and Z planes through a hierarchy of network elements and measuring the execution time.

– Measuring the positive/negative accuracy of our model to demonstrate that the accuracy need not decrease with an increasing number of cores.

– An overall demonstration through theory and experimentation that the problem of domain decomposition for GMG is much more complex than just minimizing the communication volume.

Level : L

Level : L-1

Level : L-2

Figure 6.1: Decreasing mesh resolution with decreasing level in 2-D Geometric Multigrid

6.3 Multigrid

Local Iterative schemes [25, 37, 63, 72] such as weighted Jacobi, Gauss-Seidel, Red-Black Gauss- Seidel, can remove high frequency error components quickly (known as smoothing) but decrease the low-frequency error spectrum very slowly. Thus, the overall convergence is slow. These low- frequency components can be represented as relatively high frequency components on coarser grids [25, 59, 63] and thus effectively smoothed on that grid. These smoothing properties of certain iterative methods, and the equivalent system of equations at various levels, form the basis of Multigrid [59].

Multigrid [25, 37, 63, 72] is a multilevel convergence acceleration concept that involves using coarser forms [63, 72, 73] of the given fine grid discretization to remove the low-frequency errors and more efficiently provide an estimate of the approximated solution. Figure 6.1 shows a grid hierarchy of decreasing grid resolution where the coarser grids can be used to remove the low frequency errors. Clearly, the number of unknowns on the coarse grid are fewer and this leads to reduced computation on those grids. Further, the convergence factor of a single grid smoother is approximately 1 − O(h2_{), where h is the grid spacing (assumed as uniform in all directions)} and for each successive coarse grid, the grid resolution decreases [63]. As mentioned in Chapter 2, depending on the pattern of the traversal between grids, two common types of cycles are categorized as V-cycles and W-cycles [25, 63]. The following section introduces notations to explain the concept of Multigrid in detail, focusing on the V-cycle.

6.3.1 Notation used and Multigrid Steps

Let Ah_uh_{= f}h_{denote a linear system of equations arising from a local discretization of a linear} Elliptic PDE, where the superscript h denotes the grid spacing. Successive grid levels (finest to coarsest) are represented as: Ωh _{→ Ω}2h _{→ Ω}4h_{... → Ω}2ih_{. We use standard coarsening in} our implementations which reduces the total degrees of freedom by approximately one-eighth on the immediate coarser grid in 3-D (one quarter in 2-D). After ν1pre-smoothing iterations on Ωh_{, an approximation to u}h_{is obtained (denoted by v}h_{) and the residual is then calculated as} rh= fh− Ah_vh_{. A restriction (I}2h

h ) operator transfers this residual (r

h_{) to the next immediate} coarser grid (Ω2h_{). In the 2-grid method (detailed in the next section), the error e}2h_{is obtained} after solving A2h_e2h _{= r}2h _{(error equation) exactly on the coarser grid. This error is then} transferred back to the finer grid using the interpolation/prolongation operator (Ih

2h) to obtain a better approximation to the solution on the finer grid, followed by ν2post-smoothing iterations. For Multigrid, the error equation is not solved exactly, instead it is replaced by a recursive use of the 2-grid method to update the estimated error. Only at the coarsest level is an exact solve used. The recursive algorithm halts when the ratio of the current norm of the residual (||rh

k||) on the finest level to its initial norm ((||rh

0||)) becomes less than a specified tolerance. Typically the pre-smoothing (ν1) and post-smoothing (ν2) iterations of the smoother vary between one and three for most practical problems [59].

6.3.2 2-grid Algorithm

The basis of Multigrid is the 2-grid correction scheme which forms the heart of the Multigrid concept. The following sequence of steps explains this method in detail:

1. Choose a starting estimate for uh_.

2. Approximate uh satisfying Ahuh = fh using ν1 iterations of an iterative (smoothing) scheme, starting with the latest estimate, to obtain uh

approx = vh.

3. Calculate residual rh_{= f}h_{− A}h_vh _{for Ω}h_.

4. Transfer (Restriction) rh _{to Ω}2h_{: let it be denoted by r}2h_.

5. Solve A2h_e2h_{= r}2h _{exactly to obtain e}2h_.

6. Transfer (Interpolation/Prolongation) e2h to Ωh: let it be denoted by eh.

7. Obtain better approximate for uh _{i.e. u}h

approx= vh+ eh.

8. Improve uhapprox using ν2further iterations of the smoother.

The two vital steps of Restriction and Interpolation for transferring the residual and the error respectively, are explained in the next section. In general, we refer to them as Transfer operators or Inter-grid Transfer operators as they determine the flow of information between the fine and coarse grids.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 176-184)