Adaptive Mesh Refinement (AMR) - Efficient Domain Partitioning for Stencil-based Parallel Opera

The accuracy of the approximate numerical solution of a PDE can be increased by increasing the resolution of the mesh, i.e. decreasing the grid spacing. Since the error in the solution may be undesirably higher in only certain regions, increasing the grid resolution locally in such regions is a more efficient strategy rather than a global increase in the resolution. Thus, the mesh obtained after discretization can be refined locally depending on the error, geometric “interestingness” of the solution or any other relevant parameter. This technique is known as Adaptive Mesh Refinement (AMR) [87, 88]. The main goal of AMR thus, is to obtain a desired accuracy of solution with the least possible mesh points. This also implies an optimal use of computational resources. AMR automatically adds mesh points to regions where a greater resolution is desired and removes points from regions where a low resolution solution will suffice [89]. Although AMR is complex to implement, it is extremely useful for applications involving a large gradient change, phase change, discontinuities, and shocks. Further physical examples include high Reynolds number flows interacting with solid objects, chemically reacting flows, cosmology simulations (resolution is twelve orders of magnitude) and combustion problems [90].

2.6.1 Structured and Unstructured AMR

AMR can be used for both structured (SAMR) and unstructured meshes (UAMR). UAMR are often based on Finite Element discretizations of unstructured meshes but due to indirect memory references, its implementations on cache-based architectures remain inefficient. SAMR uses logical rectangular grids refined spatially and temporally - categorized either as patch-based or tree-based. The main advantage of SAMR is the ease with which the neighbours of a mesh point can be decoded, in general, simply through array indices. Since the identification of a neighbouring mesh point is straightforward, the efficiency of the method is expected to be high and thus SAMR is used in applications with strict time constraints [89]. In tree-based schemes, grid elements are stored using k-way trees with at least a pointer to the parent and an array of pointers to its children. Additionally, some metadata such as the element type (if multiple geometries are allowed), refinement level, a boolean value to distinguish between a boundary element/non-boundary element etc., is also stored. Further, the leaves of the tree are the active elements and elements are generated upon refinement [91]. The tree-structure demands higher storage space and thus it is non-trivial to decide the splitting of this hierarchical data structure which forces additional interprocessor communication [90] to exchange splitting information. Each node of a tree can contain a single cell or a contiguous block of elements (represented using arrays). The latter gives rise to block-structured AMR where even if a single cell within a block is refined, the entire block is refined. Block-structured AMR can be implemented for both patch-based and tree-based schemes. In the orthogonal approach of refining a single cell the advantage is a much more flexible refinement but the disadvantage is the indirect memory references [1]. Small grids in complex applications such as AMR are not recommended because

of the increased metadata, increased ghost cells and associated computations and copying of data between different levels.

2.6.2 Software Packages for SAMR

Some notable software packages for parallel Structured AMR (SAMR) are: Chombo [2], BoxLib [92] (both from Lawrence Berkeley National Laboratory), PARAMESH [1] (NASA) and SAM- RAI [93] (Lawrence Livermore National Laboratory). A detailed survey of block-structured AMR can be found in [94]. PARAMESH uses a tree-based approach while the other three use a patch-based approach. Since load balancing is a critical issue, several algorithmic approaches such as Space-Filling Curves (SFCs), greedy algorithms, sensitivity analysis and Knapsack problems have been explored [90]. As an example, PARAMESH uses the Peano-Hilbert SFC [1] for load-balancing and BoxLib can either use a Knapsack strategy or SFC.

In a study conducted on scaling Chombo to thousands of cores [95], researchers found the influence of OS to be the performance bottleneck rather than the hardware or application code. The migration from Catamount micro-kernel to Compute Node Linux caused a decrease of 10% performance in an AMR benchmark due to complex interactions between Linux libc heap management and the memory hierarchy. Since it is difficult to interpret weak scaling in AMR, replication scaling was used to take a hierarchy of grids and data points for a fixed number of cores and replicated for higher concurrencies [95]. The affinity of data and threads (called geographical locality) has been stressed for a good performance of AMR as data and work need to be re-partitioned dynamically.

2.6.3 BoxLib

BoxLib [4, 19] is a parallel, multiscale, multiphysics, patch-based AMR framework for structured grids written in C++ and Fortran90. We use and describe BoxLib in detail in Chapter 5. BoxLib uses a properly nested hierarchy of grids but not based on a tree structure i.e. there is no unique parent-child relationship between grids at two adjacent levels. The smallest unit of abstraction is a Box and boxes at each level are distributed independently of the boxes at the next level. BoxLib is the basis of several massive codes such as MAESTRO [96] and CAS- TRO [97]. Unfortunately, the BoxLib library is now deprecated but a new framework called AMReX [98] targeted at Exascale and similar to BoxLib has been released.

The major computational intensity in BoxLib lies in two types of computations: (i) Point- wise evaluation i.e. expressions of the form ¯φi,j,k= φi,j,k+ k(f xi,j,k+ f yi,j,k+ f zi,j,k) and (ii) Stencil evaluations i.e. expressions of the form ¯φi,j,k= kφi,j,k+ m(φi±a,j,k+ φi,j±a,k+ φi,j,k±a) where a is some scalar offset [19]. In a comparative study of Hybrid parallelism using a combi- nation of OpenMP and MPI, the division of the entire index range of the set of boxes owned by

a process to the set of threads (Tiling) outperformed the strategy of dividing each box among the set of threads (Striping) [19] by a factor of 5.6x. Each tile can only belong to a unique box and thus the tile index space is a subset of the box index space. The strategy of assigning one box to one thread has the disadvantage of leaving some threads idle if the number of boxes per MPI process are less than the number of threads. It is to be noted that tiled code has a significant effect on stencil computations but little/no effect on point-wise computations [19]. Application of loop tiling along with improved loop vectorization resulting from simplification of loops using loop fission in Nyx - a hybrid application for cosmological simulations - improved the performance by an order of magnitude on the Intel Xeon Phi Knights Landing processor [19]. Tiling in the context of BoxLib exposes more parallelism and reduces the working set size of threads [99]. There is no language support for tiling but manually tiling loops and element loops are introduced to loop over tiles and individual elements, respectively [99]. Determining the size of the tile for BoxLib kernels also remains an important research question. Shifting the burden of tiling from the application programmer to compilers has always been an aim of researchers [6, 100–102].

Regional tiling is a hierarchical scheme in which a grid represents a rectangular index space, a contiguous division of the grid represents a region and a logical division of the index space of a region represents a logical tile. Thus, a grid can be made up of multiple contiguous regions and logical tiles are just index space divisions of the regions which can be varied on a loop-by-loop basis. A special case is Logical tiling in which each grid consists of a single contiguous region. While creating tiles, the length of the tile in the contiguous dimension is left uncut [6, 12, 99]. BoxLib uses OpenMP for threading and tiling allows it to use coarse-grained threading instead of fine grained loop-level threading. Specifically, the OpenMP parallel do loops are placed around tiles and not individual loops [99].

2.6.4 Error Estimation

There are multiple ways in which errors can be measured. In general when solving a PDE using numerical methods, the actual solution is not known and thus the accuracy of the solution needs to be estimated without full knowledge of the actual solution. For a system of linear equations of the form Au = f , we can define the residual r as r = f − Auk_{, where u}k _{is the k}th_iterate of the approximated (computed) solution. Clearly, when the approximated solution uk _{= u}∗_, where u∗ is the true solution, then r = f − Auk = f − Au∗ = 0. Further, since the residual is a vector ∈ Rm_{, we use some norm to check if sufficient accuracy has been attained to stop} the simulation. A norm is a mapping from a vectors u ∈ Rm to the set of non-negative real numbers [34]. The two most common norms are described below.

1. Infinity or Max -norm: The infinity norm is denoted by ||.||∞ and is defined as the maxi- mum absolute value of the components of the vector i.e. ||e||∞= max

the infinity norm implies that no component in the vector is more than the max-norm.

2. 2-norm: The 2-norm of a vector e having m components is defined as

||e||2= v u u t m X i=1 e2 i.

In the current work we use the test problems to investigate the performance of various domain partitions and since their solutions are known to us, we may choose to use the norm of the error vector calculated from the actual solutions to stop the simulation when sufficient accuracy has been obtained. We use this methodology to terminate our simulations in AMR and use the residual 2-norm in Multigrid as the stopping criterion. In some cases, we choose to fix the number of iterations for performance comparisons. However, the accuracy measurement remains a valuable asset to verify the correctness of our implementations. Errors can be estimated a priori or a posteriori. In the context of AMR, the a priori error estimates based on fundamental error analysis of discretization methods and geometry are insufficient in the presence of sharp changes in the solution or singularities [103]. These are insufficient in the sense that they provide information only on the asymptotic error behaviour and assume that the solution is regular. Thus, a posteriori error estimates based on the computed solution are needed to select the regions for further refinement [104]. It is to be noted that we do not use AMR in the traditional way i.e. we fix the refinements at the beginning of the simulation and keep them fixed. This treatment is sufficient to serve our performance studies. Traditionally, AMR starts on a coarse mesh and after an a posteriori error estimation selects regions for further refinement. The coarsening and refinement of regions continue till sufficient accuracy is obtained. A detailed discussion of error estimation is beyond the scope of the current work.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 62-65)