Software - Efficient Domain Partitioning for Stencil-based Parallel Operators

We group the software into two categories. The first category consists of the language compilers and MPI implementations. This is a primary category which we describe separately for ARC2 and ARC3 clusters. The second category consists of performance profiling and visualization tools which we do not describe separately for ARC2 and ARC3.

3.2.1 ARC2 Compilers and MPI Implementations

ARC2 uses the CentOS release version 6.9 ($lsb release -a) and the kernel version 2.6.32- 696.18.7.el6.x86 64 ($uname -a). Since we develop our programs in the C language, we use the Intel C/C++ compiler for compilation. There are multiple versions of the Intel icc compiler present on ARC2, managed using the module environment. Various module numbers for the Intel compiler are: intel/13.1.3.192 (default), intel/15.0.0, intel/16.0.2, intel/17.0.1 and the very recently installed intel/18.0.2. In our experiments, we use intel/16/0.2 and intel/17.0.1 but not intel/18.0.2 due to the unavailability of the latter during the course of the project. The corre- sponding icc compiler versions are ($icc --version): icc (ICC) 13.1.3, icc (ICC) 15.0.0, icc (ICC) 16.0.2, icc (ICC) 17.0.1, and icc (ICC) 18.0.2. The same output is obtained using$mpicc --version as well, as the MPI implementation internally uses the underlying C/C++ compiler.

There are multiple MPI implementations installed on the ARC2 cluster, namely, OpenMPI 1.6.5, multiple versions of Intel MPI and Mvapich2/1.9. Out of these we only use the OpenMPI 1.6.5 implementation. Our reasons for choosing this implementation are multiple. First, this seems to be the most popular choice in published literature. Second, it was designed with the goal of supporting Infiniband [66]. Third, just like Mvapich2 - a derivative of MPICH2 [117], it is publicly available. According to [118], all OpenMPI versions up till 1.8 have been either declared as retired or ancient. We use an updated version of OpenMPI (version 2.0.2) when using the latest ARC3 cluster and its details are described in the next section. As of writing this thesis, the current stable version seems to be OpenMPI v3.0 [118].

3.2.2 ARC3 Compilers and MPI Implementations

ARC3 uses the CentOS release version 7.4 ($lsb release -a) and the kernel version 3.10.0- 693.11.6.el7.x86 64 ($uname -a). There are various Intel C/C++ compilers, each activated by choosing the respective module, namely, intel/16.0.2, intel/17.0.1 and intel/18.0.2. The names of the respective icc compilers can be derived using $icc --version after loading the appropriate module and is the same as the name of the modules mentioned above. There are multiple GNU modules on ARC3, namely, gnu/6.3.0 and gnu/7.2.0. We use the C compiler gcc 6.3.0 on ARC3 to show the compiler independence of our model. Out of the multiple OpenMPI implementations on ARC3, we use OpenMPI 2.0.2 as the other implementation, namely, OpenMPI 2.1.3, was not available throughout the course of the project. In the Intel

MPI flavour we make use of Intel MPI 2017.1.132. Further, we conduct some experiments with Mvapich2/2.2 to further test the independence of our model from MPI implementations.

3.2.3 Other Tools

We use performance profiling tools to capture the cache-misses and other relevant performance metrics. TAU [119] (Tuning and Analysis Utilities) is a tool that can be used for profiling, tracing and sampling an application. It can be used with serial codes as well as parallel codes utilizing MPI, OpenMP and pthreads. The general steps consist of preparing the application for instrumentation, generating a profile and then examining the profile using a command line or a graphical tool. The graphical tool called Paraprof produces a visualization of the metrics captured and helps to identify the bottlenecks in the application. By default a single metric is collected, i.e. the time spent in different phases of execution but additional metrics such as cache-misses/hits, Floating Point operations, etc., can be configured using the TAU METRICS environment variable. One file per process is generated and if multiple metrics are specified, each one is written to a different directory starting with the identifier MULTI . The TAU MAKEFILE environment variable specifies what kind of a parallel program is to be profiled. We only make use of pure MPI programs but the options can include hybrid programs using MPI and OpenMP as well. TAU can be used without recompiling the program though it is recommended that the program should be recompiled using a script file provided by TAU, namely, tau cc.sh for C programs and tau f90.sh for Fortran programs. Without recompilation, the executable provided by TAU can be placed directly with the command used to execute the parallel program. As an example $mpirun tau exec <prog> is a completely valid instrumentation for an MPI program. TAU internally uses the PAPI (Performance API) [120] interface for recording various metrics. The commands $papi avail lists the various possible hardware counters supported by environment or architecture and$papi choose event checks whether the counters specified are compatible with each other.

Scalasca [121] is another performance analysis tool that works by instrumenting, analyzing and then examining the profile/trace. The scalasca module uses another measurement infras- tructure called Score-P [122]. Score-P can be used with other profiling tools as well, such as TAU [119]. Scalasca supports profiling of MPI, OpenMP, and Hybrid MPI+OpenMP programs as well as programs written in CUDA (Compute Unified Device Architecture). The profiling results can be visualized using the Cube tool. An advantage of Scalasca over TAU is that the former shows the load-imbalance, late-sender and late-receiver scenarios explicitly, thus, it helps identify performance bottlenecks directly. We use Scalasca with BoxLib to capture the cache-misses of various sub-domain shapes as BoxLib does not interface seamlessly with TAU.

VisIt [123] is a visualization tool that we use to plot the results from BoxLib. It is a dis- tributed, open-source, visualization tool that can run on Desktop computers to HPC clusters

having 105 _{cores. VisIt offers a GUI with a wide variety of operators and mathematical ma-} nipulations that can be applied to visualizations. As an example, levels in adaptively refined meshes can be coloured or the local refinement in 3-D meshes can be plotted with a wire-mesh. It offers features such as slicing, rotating, and creating a video, among many others.

Chapter 4

Cache-aware Domain

Partitioning

With the ubiquitous appearance of multicore processors, a natural step is to parallelize the simulations to minimize the time to produce meaningful results (or increase the accuracy of the results obtained in a given execution time). It is challenging to optimize the process of parallelization due to overheads such as data movement, mismatch in the speeds of the processor and memory, data dependency constraints, algorithmic inefficiencies, loose coupling of software with hardware etc., among many others. As mentioned in Chapter 1, our research broadly lies at the intersection of Parallel Computing and numerical methods for the solution of PDEs. We attempt to optimize their solution on multicore systems by creating a novel technique for Domain Decomposition/Domain Partitioning - the first step in parallel computing which consists of distributing data to individual cores of a multiprocessor system. We provide an insight into why the orthodox approach of domain partitioning based on minimizing the communication volume is not generally the optimal solution. We create and experimentally validate a new model for domain partitioning based on the minimization of cache-misses. Cache- misses are the major performance bottleneck in serial computing and our research focuses on connecting them to domain partitioning in parallel computing. To the best of our knowledge, such a relationship stands unexplored in the literature. With this macroscopic view of our research, we now delve into the details.

4.1 Introduction

Partial Differential Equations (PDEs) [21,124] lie at the heart of numerous scientific simulations depicting physical phenomena. It is very difficult, if not impossible, to solve them analytically and thus, they are discretized and solved numerically [22]. Discretization of the problem can be achieved by using, amongst others, the Finite Difference (FDM), Finite Element (FEM) or

the Finite Volume (FVM) methods [125]. A detailed description of FDM and a brief overview of FEM, FVM and other discretization schemes was provided in Chapter 2. To recollect, Finite Difference Methods are a numerical approximation method to estimate derivatives of any order and can be obtained using Taylor’s theorem [22]. We only use the Finite Difference method in the current and subsequent chapters but we expect the results to hold for other local forms of discretization as well. Iterative methods such as the Jacobi, weighted Jacobi (ω-Jacobi), Gauss-Seidel, Red-Black Gauss-Seidel (RBGS) etc., can be used to update the solution at various mesh points after discretization (see Chapter 2). A fixed geometrical shape called a Stencil, is used to define the approximate solution at each mesh point using a weighted average of the solution at some fixed neighbouring mesh points. As illustrated in Chapter 2, a 7-pt, 19-pt and 27-pt stencils are the most commonly used stencils in 3-D. As the number of mesh points become larger, the time to solution increases. Parallel computing is used to decompose/divide the domains (grid) into sub-domains (sub-grids) and reduce the time to solution by letting the processor cores work independently on sub-problems, exchanging data when needed. In this chapter we consider only structured 3-D domains and decompose them with divisions/cuts parallel to the Cartesian Axes.

The parallelization of such simulations introduces additional performance penalties in the form of local and global synchronization among cooperating processes. Domain decomposition, the first step in parallel computing, partitions the largest shareable data structures into sub- domains and attempts to achieve perfect load balance with minimal need for communication. This chapter aims to introduce, develop and validate an alternate strategy to achieve optimal domain decomposition/partitioning for structured 3-D stencil-based PDE discretizations. This new strategy uses the minimization of cache-misses at the sub-domain level as the basis for ob- taining optimal domain partitions/decompositions. We further, logically divide the sub-domain into three parts, namely, the Independent Compute (IC) - a part which does not require data from other processes for computation of a full iterative update, the Dependent Planes (DP) - a part which requires data from other processes for updating the solution, and the Ghost Layer (or Halo Layer ) which acts as a buffer for the incoming data from neighbouring processes. Up to now research efforts to optimize spatial and temporal cache reuse for stencil-based PDE discretizations have considered sub-domain operations after the domain decomposition has been determined [6,12–14,126]. We derive a heuristic that minimizes cache-misses at the sub-domain level through a cache-directed analysis to predict families of high performance domain decompositions of structured 3-D grids. Our approach and strategy thus connects a true single core parameter (i.e. cache-misses) to a true multicore parameter (i.e. domain decomposition) - an aspect which to the best of our knowledge has no associated literature. The analysis is followed by appropriate experiments to demonstrate the efficacy of our high level model. The chapter concludes by emphasizing the need to re-examine the orthodox approach of domain decomposition for stencil-based PDE discretizations due to the tightly-coupled, evolving software and

hardware ecosystem of multicore processors.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 80-85)