method
4.2.1
Introduction
In the Schur substructuring method, the underlying mesh is subdivided into blocks (submeshes). The idea is to map the blocks to processors. Then, for each block, the internal degree of freedoms are eliminated, using a direct method, leading to a reduced system of equations that involve only the interface degrees of freedoms. The internal elimination is carried out independently on each processor and requires no communication. The remaining parallel problem, the reduced system,
42 Design of parallel distributed implementation
is then solved using an appropriate preconditioned Krylov subspace solver such as CG, GMRES, MINRES, BiCGSTAB, etc... [86]. Iterative solver is used because it exhibits better performance and it is easier to implement on large parallel machines than the sparse direct solvers. Once the iterative process has converged to the desired accuracy, the solution of the reduced system is used simultaneously by the direct solver to perform the solution for the interior degree of freedoms. Thus the hybrid approach can be decomposed into three main phases:
• the first phase phase1 consists into the local factorization and the computation of the local Schur complements,
• the setup of the preconditioners (phase2), • the iterative phase (phase3).
We describe below the main algorithmic and software tools we have used for our parallel implemen- tation. In Section4.2.2, we present briefly the multifrontal method and the direct software MUMPS
which is a parallel package for distributed platforms. In Section 4.2.3, we discuss the efficient implementation of both local ( Md−64, Md−mix and Msp−64) and global components of the precon-
ditioner. Finally in Section4.2.4, we describe the parallel implementation of the main kernels of the iterative solvers.
4.2.2
Local solvers
Many parallel sparse direct algorithms have been developed such as multifrontal approaches [37,38], supernodal approaches [32] and Fan-both algorithms [8]. Our work is based on the multifrontal approach. This method is used to compute the LU or LDLT factorizations of general sparse matrix. Among the few available parallel distributed direct solvers, MUMPSoffers a unique feature, which is the possibility to compute the Schur complements defined in Equation (4.1) using efficient sparse calculation techniques,
S
i=A
ΓiΓi−A
ΓiIiA
−1
IiIi
A
IiΓi. (4.1)This calculation is performed very efficiently as MUMPSimplements a multifrontal approach [37] where local Schur complements are computed at each step of the elimination tree process (during the factorization of each frontal matrix) and is based on level 3 BLASroutines. Basically, the Schur complement feature of MUMPScan be viewed as an partial factorization, where the factorization of the root, associated with the indices of
A
ΓiΓi, is disabled. Consequently this feature fully benefitsfrom the general overall efficiency of the multifrontal approach implemented by MUMPS. From a software point of view, the user must specify the list of indices associated with
A
ΓiΓi. The codethen provides a factorization of the
A
IiIi matrix and the explicit Schur complement matrixS
i. TheSchur complement matrix is returned as a dense matrix. The partial factorization that builds the Schur complement matrix can also be used to solve linear systems associated with the matrix
A
IiIi.The MUMPS software
The software MUMPS(MUltifrontal Massively Parallel Solver) is an implementation of the mul- tifrontal techniques for parallel platforms. It is written in Fortran 90 and use new functionalities of this language (modularity, dynamic memory allocation). We present here the main features of this package.
• Factorization: of sparse symmetric positive definite matrices ( LDLT factorization), general symmetric matrices and general unsymmetric matrices ( LU factorization).
• Entry format for the matrices: The matrix can be given in different formats. The three formats that can be used are:
– the centralized format where the matrix is stored in coordinate format on the root pro- cessor,
4.2 Classical parallel implementations of domain decomposition method 43
– the distributed format where each processor own a subset of the matrix described in a coordinate format, defined in global ordering,
– the elemental format where the matrix is described as a sum of dense elementary matri- ces.
• Ordering and scaling: the code implements different orderings such as AMD [1], QAMD, PORD [89], METIS [62] nested dissection, AMF, and user defined orderings.
• Distributed or centralized Schur complement: the software enables us to compute the Schur complement in a explicit way. The Schur complement matrix is returned as a dense matrix. It can be returned as a centralized matrix on the root processor or as a distributed 2D block-cyclic matrix.
4.2.3
Local preconditioner and coarse grid implementations
In this subsection we discuss both the local and the global (coarse grid correction) component of the preconditioner considered in our work.
Local preconditioner:
This phase depends on the variant of the preconditioner used. For dense preconditioner it consists in assembling the local Schur complement computed by the direct solver, and then to factorize them concurrently using LAPACKkernels. For mixed arithmetic preconditioner, it consists in assembling the local Schur complement in 32-bit arithmetic, and then to factorize them concurrently using LA-
PACKkernels. For the sparse preconditioner, it consists in assembling the local Schur complement, to sparsify them concurrently, then to factorize them using the sparse direct solver MUMPS. The as- sembly phase consists in exchanging part of the local Schur data between neighbouring subdomains. This step can be briefly described by Algorithm4.
Algorithm 4 Assembling the local Schur complement 1:
S
¯i=S
i or ¯S
i= sngl(S
i) for Md−mix2: for k= 1, nbneighbour do
3: Bufferize SEND part of
S
i to neighbour k;4: end for
5: for k= 1, nbneighbour do
6: Receive RECV part of
S
i from neighbour k: bu f f ertemp← RECV()7: Update ¯
S
i← ¯S
i+ bu f f ertemp.8: end for
Construction of the coarse part:
The coarse matrix is computed once as described in Algorithm5. Because the matrix associated with the coarse space is small, we decide to redundantly build and store this matrix on all the processors. By this way we expect that applying the coarse correction at each step of the iterative process only implies one global communication for the right-hand side construction [26]. The coarse solution is then performed simultaneously by all processors. So at the slight cost of storing the coarse matrix, we can cheaply apply this component of the preconditioner.
4.2.4
Parallelizing iterative solvers
The efficient implementation of a Krylov method strongly depends on the implementation of three computational kernels, that is the matrix-vector product, applying the preconditioner to a vector, and the dot product calculation.
44 Design of parallel distributed implementation
Algorithm 5 Construction of the coarse component 1: Each processor calls GEMM to compute tempS←
S
RT02: Each processor calls GEMM to compute S0loc← R0tempS 3: Each processor reorders
S
0← S0loc in subdomains order 4: AssembleS
0 in all processors5: Factorize
S
0 simultaneously in all processorsmatrix-vector product: yi=
S
ixiIt can be performed in two ways, explicitly using BLAS-2 routine or implicitly using sparse matrix- vector calculations. The explicit computation is described by Algorithm6, whereas the implicit one is given by Algorithm7.
Algorithm 6 Explicit matrix-vector product
1: Completely parallel and does not need any communication between processors. Each processor call to DGEMV compute yi←
S
ixi2: Update data: it needs some exchange of informations between neighbouring subdomains. Each processor assembles y←
nbneighbour
∑
i=1
R
Γiyi3: for k= 1, nbneighbour do
4: Bufferize SEND part of yi to neighbour k;
5: end for
6: for k= 1, nbneighbour do
7: Receive RECV part of yi from neighbour k: ytemp← RECV()
8: Update yi← yi+ ytemp.
9: end for
Algorithm 7 Implicit matrix-vector product
1: Each processor compute a sparse matrix vector product yi←
A
IiΓixiWe use a special subroutine for sparse matrix vector product
2: Concurrently, each processor call MUMPS to perform a forward/backward substitution yi←
A
I−1iIiyi using the computed factors of
A
IiIi3: Then also in parallel, each processor computes the sparse matrix-vector product
yi←
A
ΓiΓixi−A
ΓiIiyi4: Last step (update data): it needs some exchange of informations between neighbouring subdo- mains.
Each processor assembles y←
nbneighbour
∑
i=1
R
ΓiyiApplying the preconditioner: yi= Mi−1xi
This step described in Algorithm8 can be performed using either LAPACKkernels for the dense preconditioner or a forward/backward substitution using the sparse solver MUMPS for the sparse preconditioner.
The dot product: yi= yTixi
The dot product calculation is simply a local dot-product computed by each processor followed by a global reduction to assemble the complete result as described in Algorithm9.