Parallel Finite Element and Load-Balancing

For a small problem where the number of degrees of freedom is just a few thousand the system of equations (1.5) can be easily and quickly solved on a serial machine. But when the number of degrees of freedom is in excess of a million or so then the memory and speed of a serial machine start to become a serial bottleneck. Also for some applications where the size of the problem is not so big the time taken by a serial machine may still be very large (for non-linear problems for example, where the iterative methods for solving the corresponding system (1.5) are quite expensive). In these cases a promising way forward is to use a parallel architecture. By using such a machine not only can we hope to solve larger problems (e.g. in structural mechanics) but we can also hope to solve them more quickly.

In the rest of this section we discuss a method for assembling and solving the sparse system of equations (1.5) in parallel. Let us suppose the domain has been divided into n subdomains 1, 2, .... ,n and the i

th _subdomain_i _{has been}

assigned to theith_{processor of a parallel machine. Let us assume that the unknowns}

on the interface between the subdomains are labelleda and the unknowns inside

each subdomains are labelled a1, a2,....,an. If we rst number the unknowns in

a1 then in a2,a3,...,an and lastly in a

then the system of equations (1.5) can be

2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 A1 C1 A2 C2 : : : : An Cn B1 B2 : : Bn A 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 a1 a2 : : an a 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 f1 f2 : : f_n f 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 ; (1.9)

whereAi,Bi,CiandAare themselves usually sparse. It is clear from the denition

of the basis functionsPj thatf_i ,Ai,Bi andCi are totally housed by theithprocess

and hence can be assembled independent of each other in parallel. ButfandA are

distributed across dierent processors. Each processor can compute and assemble its own contribution to them, independently, storing them in the blocks f

i and A i say ( so thatf = f 1 +f 2 + .... + f n and A = A 1 +A 2+...+A n ).

In order to solve the system of equations (1.9) we rst write it in component form: Aiai+Cia =f i; i = 1;2;:::;n; (1.10) X i Biai+A a =f: (1.11)

If we substitute the value of ai from equation (1.10) in (1.11) we get the following

equation: X i BiA ?1 i (f_i?Cia ) +Aa =f ; (1.12)

On simplication this reduces to, (A ? X i BiA ?1 i Ci)a=f ? X i BiA ?1 i f_i: (1.13)

If we deneAs by the equation,

As =A ? X i BiA ?1 i Ci; (1.14)

then the equation (1.13) can simply be written as, Asa =f ? X i BiA ?1 i f_i: (1.15)

If equation (1.15) is then solved for a then this can be substituted into equation

(1.10) and solved for ai for all i. This approach is ideal for distributed memory

and may therefore be solved in parallel with the others when required. Moreover, if an iterative method, such as the conjugate gradient (CG) algorithm ([40]), is used to solve equation (1.15) then it is not necessary to explicitly form the matrix As

of (1.14). The main step involved is the matrix vector multiplication of w = Asp

where p is the direction vector obtained from the residual of the kth _{iterates of} _a,

so we have w = Ap ? X i Bi(A ?1 i (Cip)): (1.16)

From equation (1.16) it is clear thatw can be obtained using only matrix-vector multiplication and subdomain solves (some local communication is also required between processors sharing interpartition boundary vertices).

From above discussion it is clear that the communication overhead is propor- tional to the number of vertices on the interpartition boundary, hence one should try to keep this boundary as small as possible. Also once the vector a is known

each subdomain will try to solve the equation (1.10) in parallel, hence it is desirable that the number of unknowns in each ofai is approximately same (otherwise some

processors will be idle while others are still busy solving their systems).

Hence the decomposition of the elements of the mesh into subdomains should have two main features,

each processor should store approximately the same number of vertices or

elements (to ensure equal load),

number of vertices which lie on the boundary between the processors should

be kept low.

In order to achieve the above we rst dene the dual graph of a given mesh. The dual graph of a given mesh is obtained by replacing each element by a node, and that a pair of nodes is connected by an edge only if the corresponding elements are neighbours of each other, then above problem becomes a special case of a more general problem, namely the graph partitioning problem.

The n-way graph partitioning problem is dened as follows: Let G = G(N,E) be an undirected graph where N is the set of nodes with kNk nodes and E is the

set of edges with kEk edges, partition N into n subsets, N

1, N2, ...,Nn such that

Ni \Nj = ; for i = j,6 kNik = kNk / n and S

of E whose incident vertices belong to dierent subsets is minimised. The n-way partition problem is most frequently solved by recursive bisection. That is, we rst obtain a 2-way partition of N, and then we further subdivide each part using 2- way partitions. After log n phases, graph G is partitioned into n parts. Thus, the problem of performing a n-way partition is reduced to that of performing a sequence of 2-way partitions or bisections.

Unfortunately this problem, which is well-known in the graph theory literature , is not solvable in polynomial time. It is in fact an NP-hard problem ([22, 36, 68]). Nevertheless there are heuristic approaches which perform well in most cases. In the next few sections we review some of the more important of these heuristics.

In document ParallelDynamicLoad-BalancingforAdaptiveDistributive MemoryPDESolvers. NasirTouheed by (Page 31-34)