Notation and Reference Figure

4.3 The Problem

4.3.1 Notation and Reference Figure

While simulating Finite Difference Methods [22] in 3-D, we represent the size of the input problem as NxNyNz, where (Ni+ 1) is the number of mesh points in direction i and i = x, y, z. The number of internal points (i.e. unknowns in the terminology of Finite Difference Methods) is then Ni− 1. In the case of pure Dirichlet boundary conditions the outermost points in a 3-D domain form the boundary in our problem and have a prescribed value. Hence, for Dirichlet problems, we have a system of linear equations in (Nx−1)(Ny−1)(Nz−1) unknowns which may be solved using an iterative scheme such as unweighted Jacobi, weighted Jacobi, Gauss-Seidel etc. Stated concisely, the above discussion formulates a Boundary Value Problem (BVP) [22,37] in a structured 3-D domain which is solved using a 7-pt stencil (say) in FDM to simulate a linear Elliptic PDE.

For parallel processing, these points (vertex unknowns) in each direction must be divided into sub-domains and mapped to individual processes running on independent cores (see Figure 4.1). Without any loss of generality, and to make the inferences and discussion simpler, we assume Nx= Ny= Nz= N . The number of processes or cores = P and any regular Cartesian domain decomposition must satisfy DxDyDz= P , where Diis the number of cuts/divisions in the ith dimension for i = x, y, z. The number of mesh points (i.e. unknowns) assigned to each process is then PxPyPz, where Pi = NDi−1i and i = x, y, z. Since the domain has been parti-

tioned, sub-domains will require data from neighbouring sub-domains for stencil calculations. To store data from adjoining sub-domains, extra space is allocated to each sub-domain on each core. This data is typically called ghost data/ghost points/halo data [61]. Thus, the actual 3-D domain size allocated to each process = (Px+ 2)(Py+ 2)(Pz+ 2) due to ghost data/halo data, and we say that the ghost layer depth is one. We note that there will be processes which will have no neighbour in a particular direction. Such neighbours are called NULL processes and MPI has a constant named MPI PROC NULL1 _{that may be used for representing them [48].}

A process will need to pass between 0 to 6 planes of data, depending on the number of neighbour processes it has. Each sub-domain can be seen as being composed of three layers. The outermost layer stores the ghost data/ halo data and is not a part of the actual data that

0! Ghost!layer! Dependent!layer! Independent!layers! Px+1! Pz+1! Py+1!

Figure 4.4: A 3-D sub-domain having an Independent Compute (IC) layer, Dependent Planes (DP) layer and Ghost/Halo layer, indexes of the sub-domain dimensions including the ghost layer are shown

the process contains but is necessary to store the data communicated by neighbouring processes. Hence, each process uniformly has 6 ghost layers to store data received from a maximum of 6 possible adjoining neighbours. There is no need for a ghost layer in a direction in which the neighbour is a NULL process i.e. no process. In such cases the ghost layer can act as a boundary layer and can be used to specify the boundary value (as in a Dirichlet Boundary Value Problem). The second layer is the Dependent layer - a layer which needs data from neighbouring processes to carry out stencil calculations. This has been appropriately named as a Dependent layer as it is dependent on neighbouring processes for stencil computations. We address the third layer as the Independent layer, and as the name suggests, it needs no data from neighbouring processes for computation of each iteration of the solution update al- gorithm. This layer also forms the computational kernel as it generally contains many more mesh points than the dependent layers. The various dimensions (indexes) can be seen in Figure 4.4 which also shows the three basic layers for a 3-D sub-domain: Independent layers which form the core computational kernel, Dependent layers which require data from other processes for updating elements in them and finally ghost layers to hold data from neighbouring processes.

A 7-point stencil in 3-D is illustrated in Figure 4.5. The central point is updated by the weighted average of six of its neighbours (two neighbours in each direction). These iterative solution algorithms then move to the next point, where the solution is updated using the same stencil, continuing until the whole domain under consideration is covered. The stencil in Figure 4.6 shows the same stencil along with directions and with the assumption that the central point has an index of (i, j, k). When considering the Row-major order (described later in this section), the data points at indexes (i, j, k − 1), (i, j, k) and (i, j, k + 1) are contiguous in memory. Similarly when considering a Column-major order (described later in this section), the data points at indexes (i − 1, j, k), (i, j, k) and (i + 1, j, k) are contiguous in memory.

Figure 4.5: 7-pt Stencil for updating the central red point (i,j+1,k) (i,j-1,k) (i+1,j,k) (i-1,j,k) (i,j,k-1) (i,j,k+1)

Figure 4.6: A 7-point stencil in 3-D. The central point is updated according to prescribed weights associated with, and values of the neighbouring points.

The total independent calculations done by each process at each solution iteration, i.e. the number of elements which do not depend on data from other processes, is: (Px− 2)(Py − 2)(Pz − 2). The maximum total data contained in planes communicated by processes is 2PyPz, 2PxPz or 2PxPy for the X, Y and Z planes, respectively. Please note that this is an upper bound on the data as there exist decompositions where data less than this upper bound can be sent depending on the number of neighbours which maybe NULL. The value 2[(Dx− 1)(Ny− 1)(Nz− 1) + (Dy− 1)(Nx− 1)(Nz− 1) + (Dz− 1)(Nx− 1)(Ny− 1)] represents an upper bound on the total data elements communicated by all processes.

Figure 4.7 shows an example domain and the Reference axes with selected decompositions. The upper YZ plane is called X UP and the lower YZ plane is called X DOWN. The left XZ plane is called Y LEFT and the right is called Y RIGHT. The XY plane closer to the reader is called Z TOWARDS U and the plane farther away from the reader is called Z AWAY U. The

Rank 0 (0,0,0) Rank 1 (1,0,0) Rank 2 (2,0,0) X Y Z (a) X decomposition: 3 × 1 × 1 X Y Z (b) Y decomposition: 1 × 3 × 1 X Y Z (c) Z decomposition: 1 × 1 × 3 X Y Z (d) Decomposition: 2 × 2 × 2

Figure 4.7: Process Grid Decomposition and Coordinate Axes (a) Shows process ranks in X decomposition with MPI process coordinates (b) Only Y direction is decomposed (c) Only Z direction is decomposed (d) General decomposition in all 3 directions

Z X

(a) 3-D data layout: Z direction - contiguous data

Z X

(b) Data layout where data is contiguous in the X-direction (Column-major order) Figure 4.8: Row-major and Column-major data layout

coordinate axes shown in Figure 4.7 are in the direction of the coordinate axes assumed by the MPI function MPI Cart coords(). This function returns the process coordinates of processes in an n-dimensional space. Thus, for a topology of 2 × 2 × 2 when P = 8, the ranks have the following process coordinates: Rank 0 (0,0,0), Rank 1 (0,0,1), Rank 2 (0,1,0), Rank 3 (0,1,1), Rank 4 (1,0,0), Rank 5 (1,0,1) Rank 6 (1,1,0), and Rank 7 (1,1,1). The fastest changing index is Z and the slowest changing index is X when looping through a 3-D MPI process decomposition - this also matches the Row-major data storage in C language when looping through a 3-D array.

Figure 4.8a shows the layout of data in a 3-D array. The data points are contiguous along the Z-axis and this is what constitutes a Row-major order layout. A language which supports such an order is the C language and we use the C language for all our implementations in this chapter. The contiguity of data points (drawn as circles) is shown by means of continuous black lines in Figure 4.8a. Figure 4.8b shows the Column-major order in which the fastest changing index is the X-index and this data layout is supported by a language such as Fortran. Although we illustrate both the data layouts here for completeness, we use the Row-major order in this chapter to quantify the cache-misses in the sub-domain. It can be noted that the final inferences derived from the model remain independent of the data-layout. The independence comes from the fact that the Z-direction in the Row-major order is analogous to the X-direction in the Column-major order and the X-direction in the former is equivalent to the Z-direction of the latter.

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 89-94)