Domain Decomposition/Domain Partitioning

2.4 Parallel Computing

2.4.5 Domain Decomposition/Domain Partitioning

The first step in a parallel implementation of a problem is the division of computational work or data among processes/cores. Domain Decomposition [47, 52] or Domain Partitioning2 _is the process of dividing and assigning the largest data-structures associated with the problem domain to multiple cores of a multiprocessor [52, 58]. Another approach is that of functional decomposition where the computation is decomposed first and then data is associated with it. Some authors stress that Domain Decomposition be differentiated from Domain Partitioning in the sense that the former is a special technique where individual sub-domains independently solve the global problem without any communication and further, converge to the global so- lution by adapting to the local solutions of neighbouring processes [59]. Data Decomposition within the field of PDEs can either refer to the separation of domains which can be modelled with different equations or division of large linear systems into smaller problems while precon- ditioning [42]. In the current work, we parallelize the finite difference discretizations of Elliptic PDEs resulting in sub-domains on individual cores that require communication for solving the PDE.

The domain which is discretized using FDM, FEM or FVM can either result in an unstruc-

2_{We use the terms Domain Decomposition and Domain Partitioning interchangeably in this thesis to refer}

tured or structured grid/mesh. The discussion in this thesis remains limited to the Domain Partitioning of structured grids only. Naturally occurring data tends to be 3-dimensional and thus the problem domain can be divided into 1-D, 2-D or 3-D partitions depending on the problem. A 1-D partition on structured grids permits partitions along a single Cartesian direc- tion only. Similarly, a 2-D partition and a 3-D partition allow partitions in 2 and 3 Cartesian directions, respectively. In general a higher dimensional partition gives us the opportunity to use a larger number of Processing Elements (PEs) for the problem [52]. Theoretically, a d - dimensional partition for d -dimensions containing a total of nd _{elements can allow us to use} nd _{PEs. However practically, the cost of communication among n}d _{PEs can be so high that} parallelization may not yield any benefits in terms of application speed-up.

MPI offers a convenient way of specifying Domain Partitioning by allowing one to specify a virtual geometrical arrangement of MPI processes known as an MPI Cartesian Topology [48]. This arrangement is virtual as it need not follow any specific process-to-core mapping. Two functions play a major role in the creation of a Cartesian Topology, namely, MPI DIMS CREATE() and MPI CART CREATE(). The C language bindings for the MPI DIMS CREATE() function is shown in Listing 2.1. This function takes as input the number of MPI processes (nnodes) for which a topology is to be created, the dimension (ndims) of the topology and also the individual number of processes in each dimension as entries into an array (dims) of size ndims. If dims[i]=0 for all i = 1, ndims, then the function returns into the dims[] array positive values, in decreasing order, that are set as close to each other using an appropriate divisibil- ity algorithm which the standard does not describe. In this thesis, we call this the default MDC (MPI DIMS CREATE()). As an example for nnodes=64, ndims=3, the default MDC returns dims[0]=4,dims[1]=4,dims[2]=4. Thus, in 3 dimensions the default MDC is the clos- est to a cubic topology. As another example for nnodes=24, ndims=3, the default MDC returns dims[0]=4,dims[1]=3,dims[2]=2. If the user provides a non-zero, non-negative value of dims[i], it is not altered by the function MPI Dims create() but in all the cases

nnodes−1 Y

dims[i] = nnodes

must hold else an error is returned.

1 int MPI Dims create(intnnodes,intndims,intdims[])

Listing 2.1: MPI Dims create() function

For a structured domain, it is easy to see that for a given nnodes and ndims, the default MDC minimizes the surface area of a sub-domain. In three dimensions, thus, the sub- domains are as close to a cube as possible when the topology is the default MDC. It is inter-

esting to note that any permutation of the individual sizes in the dims[] array returned by the default MDC minimizes this surface area. Thus, for a default MDC for nnodes=24 i.e. dims[0]=4,dims[1]=3,dims[2]=2, a combination such as dims[0]=3,dims[1]=4,dims[2]=2 also minimizes the surface area of the sub-domain. If in 3-D we denote the product of three dimensions as Dx×Dy×Dz= dims[0] × dims[1] × dims[2], then all 6 combinations, namely, 4×3×2, 4×2×3, 3×4×2, 3×2×4, 2×4×3 and 2×3×4 minimize the surface area. Regardless of the data layout supported by a language, the function MPI DIMS CREATE() return the same value. Thus, the default MDC using the Fortran version of the function mpi dims create() also returns a topology of 4 × 3 × 2 with nnodes=24. It is important to note that the default MDC only minimizes the surface area of cubic domains and may or may not minimize the surface area of non-cubic domains.

Different MPI implementations use different heuristic algorithms to implement the default MPI DIMS CREATE() function. The aim of all these heuristics is to produce a balanced partition but the interpretation of a balanced partition is debatable and researchers have found that implementations such as MPICH, MVAPICH2 and OpenMPI can produce weak and strong violations of the MPI specification [60].

We now outline and illustrate the working of the heuristic algorithm used by OpenMPI for implementing the default MPI DIMS CREATE() convenience function. The algorithm is outlined in Figure 2.4. The algorithm takes as input a given number of processes (say nnodes) and dimensions (say ndims), and after determining the prime factors of nnodes, it then distributes these factors using a greedy heuristic into ndims bins. As a final step, the bins are sorted in a descending order to obtain the default decomposition (as output in the dims[1...ndims] array).

We illustrate the algorithm with the help of two examples. The first example is for nnodes=24 and ndims=3 for which the default MPI DIMS CREATE() returns 4 × 3 × 2. Using Figure 2.4, the value of sqnnodes=5, space allocated for the array factors=5 and the space allocated for the array bins=3. The prime factors of nnodes=24 are 2 × 2 × 2 × 3 and thus the array factors contains 2, 2, 2, and 3, with an empty trailing array element. Since there are a total of 4 factors, nfactors=4. The number of elements in array bins is 3 as ndims=3 and they are all initialized to one. Since the minimum value in the bins array is one initially, the element at position one in the bins array i.e. bins[1]=bins[1]*factors[nfactors]=1*3=3. We now decrement nfactors by one and again find the minimum of bins[1...ndims]. This minimum is now found at position two in the bins array and hence bins[2]=1*2=2. In the third assign- ment step, bins[3]=1*2=2. Now the bins array contains 3, 2, and 2 at positions 1, 2 and 3, respectively. Clearly, the next minimum value in the bins array is at position 2 and 3. We can choose any of these values but we choose the lowest possible index to break the tie. Thus,

Require: nnodes: number of processes, ndims: number of dimensions, dims[1...ndims]: output array containing processes in each dimension

1: sqnnodes ← d√nnodese

2: factors ← malloc(dlg₂(nnodes)e × sizeof (int)) 3: bins ← malloc(ndims × sizeof (int))

4: i ← 1 5: while nnodes%2 = 0 do 6: f actors[i + +] ← 2 7: nnodes ← nnodes₂ 8: end while 9: j ← 3 10: while j <sqnnodes do 11: while nnodes%j = 0 do 12: f actors[i + +] ← j 13: end while 14: j ← j + 2 15: end while 16: if nnodes 6= 1 then 17: f actors[i + +] ← nnodes 18: end if 19: nfactors ← i − 1 20: Initialize bins[1...ndims] ← 1 21: while nfactors > 0 do

22: Find minimum i such that bins[i] is minimum 23: bins[i] ← f actors[nf actors]× bins[i]

24: nf actors ← nf actors − 1 25: end while

26: dims ← SORT(bins)

after assigning the fourth remaining factor in the array factors, the value at bins[2] becomes bins[2]=2*2. We now sort the bins in the descending order and assign it to the dims array to obtain the decomposition 4 × 3 × 2 as the final decomposition. It is to be noted that this is a balanced decomposition. The second example below shows how this greedy heuristic approach fails to obtain a balanced decomposition when nnodes=72 and ndims=2.

For nnodes=72 and ndims=2, the factors array is allocated space for dlg2(nnodes)e = 7 elements. The number of elements in the bins array is now two as ndims=2. The prime factor- ization of 72 yields 2 × 2 × 2 × 3 × 3. Since the number of prime factors is 5, hence nfactors=5. Since both bins[1] and bins[2] initially contain a one, we choose bins[1] as the minimum element and assign factors[nfactors]=factors[5] (i.e. bins[1]=bins[1]*factors[5]=1*3) to it. Thus, bins[1]=3. Now, the minimum element is bins[2] which also becomes 3 after carrying out bins[2]=bins[2]*factors[4]=1*3=3. For assigning factors[3], we again choose bins[1] as the minimum and hence bins[1] becomes bins[1]=3*2=6. Carrying out the same procedure again, bins[2]=3*2=6. Now we are left with only one factor i.e. factors[1]=2 and both bins[1]=bins[2]=6. Thus, we choose bins[1] as the minimum and it is multiplied with factors[1] to obtain bins[1]=6*2=12. After sorting the bins array (and copying it to the dims array) the final output in the dims array becomes 12 × 6. This is where the optimality of the balance is violated as another decomposition 9 × 8 exists where the difference between the first and the last dimension is smaller than in the decomposition 12 × 6. Thus, it is not necessary that the algorithm employed by OpenMPI (or any other implementation) will always yield the most balanced decomposition.

The function MPI DIMS CREATE() only helps to specify the number of processes in each dimension but does not actually create the Cartesian Topology. The function MPI CART CREATE() is used to create the Cartesian Topology. The C syntax of this function is shown in Listing 2.2. This takes as input the old MPI communicator comm old (for example MPI COMM WORLD), dimension of the topology ndims, processes in each dimension through the array dims[], peri- odicity in each dimension through the array periods[], a boolean value reorder to permit or not permit reordering of ranks and outputs the new Cartesian communicator comm cart. If the topology is periodic in a certain dimension, then the last process is followed by the first process in that dimension. If reordering of ranks is allowed in the new Cartesian Topology then the ranks of the processes in the new communicator maybe changed from the ranks of processes in the old communicator.

1 int MPI Cart create(MPI Comm comm old,intndims,const intdims[], 2 const int periods [], int reorder , MPI Comm∗comm cart)

In document Efficient Domain Partitioning for Stencil-based Parallel Operators (Page 53-58)