Application of the Parallel Dynamic Load Balancing Algorithm

Adaptivity Algorithms

5.4 Application of the Parallel Dynamic Load Balancing Algorithm

In this section we are going to modify the parallel dynamic load-balancing algorithm already introduced in Chapters 3 and 4. These modications are required to add generality to the code due to the dierent nature of solver as well as underlying meshes. This modied algorithm is similar in strategy to that of Chapter 4 with Group Balancing, Data Migration and a Divide and Conquer philosophy. However, as far as the implementation is concerned there are major dierences between this and Chapter 4. We now discuss these dierences in details.

5.4.1 Calculation of WPCG

The idea of the Weighted Partition Communication Graph (WPCG) was rst introduced inx2.4 and since then has been used extensively in the previous two chapters.

processor and an edge between two vertices if and only if they are face adjacent to each other. Although the denition of WPCG is quite standard the actual determination may vary from one context to another. In the present situation the concept of 'halo' elements together with the hierarchical nature of the mesh plays an im- portant role in the determination of the WPCG. During the bisection process of the WPCG it is necessary to calculate the weighted Laplacian of WPCG as well as the weights of all the vertices of WPCG. This is done in two phases. In phase one, which is shown in Figure 5.4, we calculate the weights of all the vertices and edges of the dual graph with the help of the c-style function FindWeights(). In phase two, which is shown in Figure 5.5, we use these edge weights to nd the weighted Lapla- cian of the WPCG. It is interesting to observe the recursive nature of the function FindWeights() shown in Figure 5.4 (the very rst call to this function is of the form FindWeights(Element,Element,WeightOfElement,WeightOfElementEdge)). It may also be observed that as the original mesh is already distributed across a number of processors of a parallel machine, the calculations in Figure 5.4 are performed by each processor on those roots elements which it currently owns. Also, the kth

row of the WPCG is assembled by the kth _{processor with the help of Figure 5.5}

(note that in this gure the weight of jth _{element face is the same as the weight of}

the corresponding edge in the dual graph). Just like in x4.2.1 each processor, after

assembling its own row, sends it to one processor (which we call a master processor). The master processor, after receiving all the contributions from all other processors, forms the Laplacian and then divides the WPCG into two subgroups denoted by Group1 and Group2 using the same procedure as in x3.2.1.

5.4.2 Use of Tokens

At each level of the recursive re-balancing algorithm, after deciding how much to shift between the dierent processors, we face the same practical diculties as encountered in the previous chapter. However to overcome these diculties we take a dierent approach to that taken in the previous chapter. The rst dierence, which is a major one, is concerned with passing the information associated with moving a coarse element. In the previous chapter, whenever a coarse element goes from one subdomain to another subdomain all of its associated data structures are sent, even if the new home is only a transitory one. In the current implementation

FindWeights(ElmC,Elm,WeightOfElm,WeightOfElmEdge[])f

if (ElmC has no child)f

WeightOfElm++;

if (ElmC has a face which is contained in jth _{face of}_Elm)

WeightOfElmEdge[j]++;

for(i = 0; i < Children of ElmC; i++) f

ElmC2 = ith _{child of} _ElmC;

FindWeights(ElmC2,Elm,WeightOfElm,WeightOfElmEdge[]);

g g

& %

Figure 5.4: Calculation of weights of vertices and edges of the weighted dual graph. the communication of the full coarse element hierarchies is left until the very end of the load-balancing process, with much smaller tokens being passed instead during the transitory stages. There are many reasons for doing this. One reason is that in the previous chapter the load-balancing algorithm was used only once, usually at the end of the generation of the mesh and before the solution process commences, but in the current context it can be used possibly after each adaptation step (in case the resulting imbalance is greater than a predened tolerance). The second reason is that in the 2-d application the size of the accompanying data structure of a moving element is much smaller than the corresponding size in this 3-d application. In fact the re-balancing times in all the problems of Chapter 4 were less than a second (except problems 10 and 12 where it was slightly higher than a second), showing that in 2-d steady state cases moving all the information associated with a migrating coarse element is indeed not that costly.

5.4.3 No Colouring

The other major dierence between the current implementation and that of the previous chapter is to drop the colouring methodology. Note that the colouring scheme in the 2-d case was implemented in order to avoid the complication involved in the simultaneous migration of two neighbouring elements(without it, complicated

for(e = 1; e total no. of root elements;e++ )f

Weight of the kth _{vertex of WPCG = Weight of the root element}_e;

if (elemente is nothalo) for(j = 1;j 4; j++)

if (jth _{face of element}_{e touches the boundary of i}th _processor) f

Lapr[i] -= weight of jth _{element face;}

Lapr[k] += weight of jth _{element face;}

g g

& %

Figure 5.5: Calculation of a row of the weighted Laplacian matrix.

forwarding messages would have been required to accomplish the same task). The decision to drop the colouring approach in the current context is based upon the fact that in the 3-d setup it is computationally more expensive to implement. In 2-d the average number of colours used was never more than 10. But in 3-d a given vertex may have 100 (or even more) common tetrahedral elements. So the colouring scheme would almost certainly adversely aect the performance of the dynamic load balancer. An alternative to the colouring approach is discussed below.

5.4.4 Use of Global Communication

Rather than use the colouring scheme introduced in Chapter 4 in order to to avoid data con icts (see x4.3) we decided to make use of global arrays, whereby each

processor knows the owners of every coarse element in the entire coarse mesh. These arrays are updated after each level of the recursive step. Note that each level of the algorithm starts with a given group. Based upon the sorted version of the Fiedler vector it is divided into two subgroups which are called Sender and Receiver groups respectively. A certain number of coarse elements are marked for migration from the Sender to the Receiver groups in an attempt to balance the computational load evenly.

Initially each processor in a group is assigned as the owner of all non-halo coarse elements which reside within the processor: the unique owner for each coarse element being the ID of the processor itself. Soon after these assignments, which

are local to each processor, the list of elements owned by each processor is broad- casted globally within the I Group (let us recall from x4.8 that each processor is a

member of two groups, the initial group (called the I Group) which consists of all the p processors involved and which remains the same throughout the discussion and the current group (known simply as the Group) which is a variable group and changes with each application of the Divide and Conquer algorithm). As a result of this broadcast every processor now knows the owner of every coarse element in the entire mesh. When a coarse element in the Sender group is marked for a migration, the owning processor of the coarse element keeps a note of this change (i.e. it records the new owner of the coarse element (which is the ID of some processor in the Receiver group)).

At the end of this symbolic migration step each candidate processor in the Sender group broadcasts globally within the I Group the new owners of those coarse elements which are marked for migration from the processor. As a result of this broadcast every processor knows not only the IDs of those processors from which there would be migration of coarse elements but they also know the new owners of these coarse elements. Soon after the broadcast each processor updates its own version of the array which keeps track of the owner of all the coarse element in the entire coarse mesh.

It may be pointed out here the this broadcasting step consists of two separate steps. In the rst step each candidate processor in the Sender group broadcasts only the number of coarse elements marked for the migration. In the second step it broadcasts the new owners of these coarse elements. This division of the broadcasting was necessary. The rst step is necessary for the second step. After the rst step each processor can create the necessary temporary arrays in order to accom- modate the new owners of the marked coarse elements (which will be broadcast in the second step).

An overview of the whole algorithm is given in Figure 4.3.

In document ParallelDynamicLoad-BalancingforAdaptiveDistributive MemoryPDESolvers. NasirTouheed by (Page 147-151)