• No results found

A numerical experimental study of inverse preconditioning for the parallel iterative solution to 3D finite element flow equations

N/A
N/A
Protected

Academic year: 2021

Share "A numerical experimental study of inverse preconditioning for the parallel iterative solution to 3D finite element flow equations"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Journal of Computational and Applied Mathematics 210 (2007) 64 – 70

www.elsevier.com/locate/cam

A numerical experimental study of inverse preconditioning for the

parallel iterative solution to 3D finite element flow equations

L. Bergamaschi

, G. Gambolati, G. Pini

Department of Mathematical Methods and Models for Scientific Applications, University of Padua, Via Trieste 63, 35121 Padova, Italy

Abstract

Integration of the subsurface flow equation by finite elements (FE) in space and finite differences (FD) in time requires the repeated solution to sparse symmetric positive definite systems of linear equations. Iterative techniques based on preconditioned conjugate gradients (PCG) are one of the most attractive tool to solve the problem on sequential computers. A present challenge is to make PCG attractive in a parallel computing environment as well. To this aim a key factor is the development of an efficient parallel preconditioner. FSAI (factorized sparse approximate inverse) and enlarged FSAI relying on the approximate inverse of the coefficient matrix appears to be a most promising parallel preconditioner. In the present paper PCG using FSAI, diagonal and pARMS (parallel algebraic recursive multilevel solvers) preconditioners is implemented on the IBM SP4/512 and CLX/768 supercomputers with up to 32 processors to solve underground flow problems of a large size. The results show that FSAI may allow for a parallel relative efficiencyEp∗larger than 50% on the largest problems withp = 32 processors. Moreover, FSAI turns out to be significantly less expensive and more robust than pARMS. Finally, it is shown thatEpfor p in the upper range may be much improved if PCG–FSAI is implemented on CLX.

© 2006 Elsevier B.V. All rights reserved. MSC: 65N30; 65F10; 65Y05

Keywords: Iterative method; Approximate inverse; Finite elements; Parallel preconditioning

1. Introduction

Solving the classical equation of flow through porous media by finite element (FE) in space and finite differences (FD) in time, e.g., the Crank–Nicolson scheme[7], yields the following linear set of equations:

 H +2P t  ht+t=  2P t − H  ht− qt− qt+t, (1)

where htis the vector of nodal potential heads at timet, H and P are the stiffness and capacity matrices, respectively, both symmetric positive definite (spd), qt is the source/sink term, andt the time integration step. It is well known that scheme (1) is unconditionally stable. Typically H and P are time independent witht frequently increased by a factor f (between 1 and 2) during the simulation if q is also time independent. As a major result the coefficient matrix

Corresponding author. Tel.: +39 49 827 1329; fax: +39 49 827 1333.

E-mail addresses:[email protected](L. Bergamaschi),[email protected](G. Gambolati),[email protected](G. Pini).

0377-0427/$ - see front matter © 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.cam.2006.10.056

(2)

in (1) changes with the time integration level and, at least theoretically, the preconditioned conjugate gradients (PCG) preconditioner might also change.

In recent years, PCG has become a quite standard method to solve spd sets of linear equations on sequential computers (see for instance[22]). Several preconditioners have been developed[1,2,10–12,20,24]. Preconditioning based on the incomplete Cholesky decomposition of the coefficient matrix with zero (IC(0)[16]) or variable (ILUT[23]) is perhaps the technique most widely used. However, both IC(0) and ILUT prove inefficient on parallel computers because of the native unsuitability of the triangular factorization to be parallelized. On supercomputers the simple diagonal scaling appears to be superior to IC(0)[21]and may represent a quite inexpensive and nevertheless efficient preconditioner. An important improvement in a parallel context is offered by the approximate inverse[10–15,25], specifically the one labeled FSAI (factorized sparse approximate inverse) or “enlarged” FSAI(), with a user specified parameter [17,18]. The diagonal preconditioner, FSAI and FSAI() are implemented into our FE flow code developed to solve (1) in a 3D porous medium[6]with the respective performance addressed and compared on an IBM SP4 and CLX supercomputers using up to 32 parallel processors. FSAI() is also compared with the preconditioner of the package parallel algebraic recursive multilevel solvers (pARMS)[19]properly incorporated into our FE code. We observe that pARMS makes use of a block preconditioner based on the Schur complement. Comparison are performed in terms of pseudospeed-up

S

pand parallel pseudoefficiencyEpvs. the number p of processors used on the IBM SP4 and CLX supercomputers. The paper is organized as follows. FSAI is first briefly reviewed. The PCG parallelization together with the H and

P construction and parallel FSAI implementation is then addressed. A set of SP4 and CLX runs for complex 3D flow

problems of variable size N up to half a million using tetrahedral FE are carried out with the outcome discussed in terms of execution time, number of PCG iterations,Sp∗andEp. Finally, a few conclusive remarks on the efficiency of PCG–FSAI on parallel computers are provided.

2. PCG–FSAI parallelization

2.1. FSAI preconditioner

The FSAI method computes an approximate inverse of an spd matrix A in the factorized formQ = GTLGL, whereGL

is a sparse nonsingular lower triangular matrix approximatingL−1A , with LAthe Cholesky factor ofA. To obtain GL

a sparsity patternSL ⊆ {(i, j) : 1i = j N}, such that {(i, j) : i < j} ⊆ SLis first prescribed. A lower triangular

matrix ˆGLis computed by solving the equations:

( ˆGLA)ij= ij (i, j) /∈ SL, (2)

withij the Kroneker delta. The diagonal entries of ˆGLare all positive. SettingD = [diag( ˆGL)]−1/2andGL= D ˆGL,

the preconditioned matrixGLAGTLis spd with all diagonal entries equal to 1. A widespread choice forSLis to allow for

nonzeros inGLonly in positions corresponding to nonzeros in the lower triangular part ofAk, where k is a small positive

integer, e.g.,k = 1, 2, 3. Solution to (2) always exists for an spd A. While the approximate inverses corresponding to

Ak,k > 1, are often better than the one obtained with k = 1, they may be too expensive to compute and use. Kolotilina

et al.[17]describe a simple approach, called post-filtration, to improve the quality of FSAI preconditioners for an spd matrixA. This is based on a posteriori sparsification, by using a drop-tolerance parameter, with the aim to reduce the number of nonzero elements ofGL, and thus decrease the computational burden of the iteration phase. In a parallel

environment, a substantial reduction of the communication complexity of the preconditioner-by-vector multiplication can be achieved. In our problem we setA = H + 2P /t.

Our FORTRAN 90 code makes use of the MPI library interface. Matrices H and P are first built in parallel as well as preconditioners FSAI and FSAI(0.1)[4]. Next, Eq. (1) is solved in parallel by PCG–FSAI.

2.2. Parallel construction of H and P

We use tetrahedral FE generated according to the procedure described in[8]. Data partitioning among the p processors is accomplished as follows. Denoting as N the number of FE nodes, as an example takep =2, all the tetrahedra with (at least) one node number smaller than or equal toN/2 are allocated on processor 1; the elements with (at least) one node number between 1+ N/2 and N are allocated on processor 2. If p > 2, a similar distribution strategy is implemented.

(3)

Hence, the matrix assemblage takes place independently on each processor with H and P partitioned accordingly and

N/p rows of the global H and P (which by the way are never fully built) residing on each processor. 2.3. Parallel PCG implementation

The PCG algorithm can be decomposed into a number of scalar products, daxpy-like linear combinations of vectors, and matrix–vector (MV) products. Scalar products, were distributed among the p processors by uniform block mapping. We tailored the implementation of parallel MV products for application to sparse matrices, using a technique for minimizing data communication between processors[6]. A key point in the parallelization procedure is the MV product. In this respect two schemes may be implemented: a classical scheme and a scheme relying on Geus and Röllin’s algorithm[9]which attempts to enhance the cache usage by the data prefetching technique described in[5]for a matrix stored in CSR (compressed storage row) mode. In the sequel this implementation of the MV product will be referred to as MV1. Experience shows that Geus and Rollin’s procedure is superior to the traditional one, and therefore, it will be used in the experiments that follow.

We implemented the FSAI preconditioner computation with the specification of either A orA2sparsity patterns. We

used a block row distribution of matrices A,GL. Complete rows are assigned to different processors [3]. Letni be

the number of nonzeros allowed in the ith row ofGL. Any row i of theGLmatrix can be computed independently of

each other, by solving a small spd dense linear system of sizeni. To attain parallelism, the processor that computes row i must accessni rows of A. Since the number of nonlocal rows needed by each processor is relatively small, we temporarily replicate the nonlocal rows on auxiliary data structures. The dense factorizations needed are carried out using BLAS3 routines from LAPACK. OnceGLis obtained, a parallel transposition routine yields to every processor

the eligible part ofGTL. 3. Numerical results

The numerical experiments are performed on an SP4 supercomputer with up to 512 POWER 4 1.3 GHz CPUs with 1088 GB RAM. The current configuration has 48 (virtual) nodes: 32 nodes with 8 processors and 16 GB RAM each, 14 nodes with 16 processors and 32 GB RAM and 2 nodes with 16 processors and 64 GB RAM each. Each node is connected with two interfaces to a dual plane switch.

For the sake of comparison, and exclusively for problem 4 below, a simulation is also performed on CLX. Each node owns a 2 GB DRAM (32 nodes have 4 GB DRAM), and two Intel Xeon Pentium IV 3.055 GHz processors (10 I/O nodes feature 2.8 GHz processors). Each processor is equipped with a 512 Kb L2 Cache. Disk space is 5.5 TB. The internal network is a Myrinet IPC one. PCG iterations (as well as pARMS iterations) are completed when the Euclidean norm of residualrk  10−12.

3.1. Sample problems

The porous medium is a 6 m side cube subdivided into tetrahedra[8]. The vertical layering involves either 6 or 30 layers with the hydraulic conductivity (HC) uniform in problem 1 and horizontally variable in problems 2–4. The mesh properties and simulation details are summarized inTable 1. Dirichlet boundary conditions are prescribed on one face of the cube while on a node opposite to the Dirichlet face a pumping rate equal to−1 m3/s is prescribed. The remaining boundary elements are assumed to be impermeable. In all the problems we start with an initial time integration step t0(in hours), and use at step k,tk= max{f · tk−1, tmax}. The FSAI preconditioner is computed only once at the

beginning of the simulation (i.e., fort = t0) and then kept constant irrespective of the fact that the coefficient matrix

H + 2P /t changes with the t size. Actually the outcome shows that updating the preconditioner vs. t does not

improve significantly the PCG solver performance.

3.2. Sequential PCG simulations

The results are obtained with preconditioner IC(0) on one SP4 processor. For problem 1 the total CPU time T required to complete the simulation is 1492.1 s with the average numberIaof PCG iterations per time step equal to 210. For

(4)

Table 1

Main properties of the selected test problems and corresponding computational simulations

3D nodes # Layers HC t tmax tmax f # Steps

Pb1 453871 30 1 10 100 2400 1.2 33

Pb2 453871 30 var 10 100 2400 1.2 33

Pb3 453871 30 var 0.1 50 240 1.2 34

Pb4 102487 6 var 0.1 50 240 1.2 34

Table 2

CPU timesTp(s) required to complete the simulations with p SP4 processors and corresponding pseudospeed-upsSpfor various preconditioners. The average number of PCG iterations per time step is provided in brackets

Preconditioner T2 T4 S4∗ T8 S8∗ T16 S16∗ T32 S32∗ Problem 1 with MV diag (1949) 3150 1535 4.10 1040 6.06 536 11.76 299 21.07 FSAI (920) 2916 1425 4.08 958 6.08 413 14.10 284 20.54 FSAI(0.1) (538) 1698 976 3.48 604 5.62 262 12.96 161 21.09 Problem 2 with MV diag (4932) 9001 5185 3.47 3495 5.15 1460 12.33 945 19.06 FSAI (1307) 5465 3407 3.21 2126 5.14 828 13.21 505 21.65 FSAI(0.1) (784) 4793 2837 3.38 2045 4.69 875 10.96 429 22.37 Problem 2 with MV1 diag (4932) 6701 3921 3.42 2568 5.22 1309 10.24 748 17.91 FSAI (1307) 3702 1939 3.82 1359 5.44 595 12.44 365 20.27 FSAI(0.1) (784) 3028 1701 3.56 1069 5.66 529 11.46 323 18.73 Problem 3 with MV1 diag (4467) 6229 3397 3.67 2318 5.37 1112 11.21 701 17.77 FSAI (1227) 3544 1952 3.68 1292 5.48 559 12.68 342 20.74 FSAI(0.1) (717) 3009 1721 3.50 1122 5.37 522 11.53 302 19.55 Problem 4 with MV1 diag (1371) 358 198 3.61 128 5.57 83 8.57 68 10.59 FSAI (723) 332 176 3.77 113 5.88 73 9.10 62 10.77 FSAI(0.1) (503) 291 165 3.53 106 5.47 66 8.77 52 11.16

problems 2 and 3 with a variable permeability, the PCG performance reduces with T increased to 3677.1 s andIato

421. Finally for problem 4,T = 4333.5 s and Ia= 294. 3.3. Parallel PCG simulations

Since for memory reasons the FE–PCG code could not be run for the largest test cases within 1 GB memory usually provided by a single SP4 processor, we define as an indicator of the degree of parallelism a pseudospeed-up Sp∗ as Sp= 2T2/Tp.Table 2provides the timeTp needed to complete the simulation with p SP4 processors and the corresponding pseudospeed-upSp. The iterations Ia are given in brackets.Table 2shows the results employing MV

for problems 1 and 2, and with the optimized MV product MV1 for problems 2–4. The results reported inTable 2for problem 2 confirms the improvement of the optimized MV product. On this test case, using MV1 requires from 20% to 30% less CPU time than using MV.

Regarding CPU time, we observe inTable 2that FSAI(0.1) is better, and sometimes appreciably better, than the diagonal preconditioner, and slightly superior to FSAI. FSAI() proves quite robust vs.  and is not very sensitive to the value. The pseudospeed-ups are quite interesting. For the (larger) problem 2, S8∗andS16∗ are larger than for the (smaller) problem 4. This is accounted for by the fact that, being problem 2 (four times) larger than problem 4, the computational burden is also larger and prevails over the communication time among the processors. Regarding

(5)

4 8 16 32 0.3 0.5 0.7 0.9 # processors Ep SP4 4 8 16 32 0.3 0.5 0.7 0.9 # processors Ep CLX

Fig. 1. Problem 4. Parallel pseudoefficiency indexEpvs. number of processors p on SP4 (left) and CLX (right).

Table 3

Problems 2 and 4. Comparison between the CPU timesTp(s) required to complete the simulations with p SP4 processors for pARMS and FSAI(0.1) preconditioners. The average number of outer (left) and inner (right) iterations per time step is provided in bracket. The matrix–vector product is MV1 T2 T4 T8 T16 T32 Pb2-pARMS 5184.51 3541.42 2788.7 2099.41 1164.74 (152–760) (155–775) (167–835) (171–855) (179–895) FSAI(0.1) 3028.3 1823.3 1068.5 528.7 323.4 Pb4-pARMS 258.55 193.33 151.4 131.11 59.06 (38–190) (46–230) (53–265) (57–285) (60–300) FSAI(0.1) 291.2 165.0 106.3 66.4 62.2

problem 1, pseudospeed-upsS4∗slightly larger than 4 for diagonal and FSAI preconditioners may be ascribed to cache effects. Differently, the preconditioner FSAI(0.1), due to its larger memory requirements, takes less advantage of cache size.

To further investigate the parallelization degree of the PCG–FSAI solver, a few simulations were performed on CLX.

Fig. 1shows the parallel pseudoefficiency defined asEp= Sp/p for problem 4 using SP4 (left) and CLX (right). The bestEp∗is obtained in both cases forp = 4. However, on SP4 Epdeteriorates quickly as p increases while on CLX the pseudoefficiency deterioration is as expected. The performance degradation is more notable when moving from 4 to 8 processors, and it is observed in all our parallel experiences on the SP4 machine, due to hard/soft processor aggregation into virtual/physical nodes. Since 8 processors share the same node core memory, many memory conflicts occur when all the processors are engaged on unstructured matrix computations. Inspection of Fig.1(right) indicates that the CLX pseudoefficiency deterioration is consistent with a good parallel algorithm. Hence, we may conclude that FSAI allows for a better parallelization degree on CLX than SP4.

3.4. pARMS results

The parallel pARMS package[19]represents an interesting effort in devising a distributed preconditioner for iterative solvers. pARMS algorithms are intrinsically parallel. However, the number of iterations performed depends quite significantly on the number of processors engaged. Hence, the analysis of its parallel performance cannot naively rely upon classical factors such as the speed-up.

Table 3shows the CPU timesTp and the number of iterations Ia required for the solution of problems 2 and 4

(6)

complement (lsch_ilut). We used the following parameters:iov = 1 (overlap), im = 100 (Krylov subspace size),

lf ill =60 (level of fill-in of ILUT preconditioner), eps =10−3(tolerance of the outer iteration). Left and right numbers

in bracket denote outer and inner iterations, respectively. The FSAI(0.1) times are given again for a direct and easier comparison. It may be noted that, with the exception of one case only (problem 4 with 2 processors) using the pARMS preconditioner yields a worse computational performance, and particularly so for the larger and more difficult problem 2. Also note that with pARMS the iteration count varies appreciably with p while is not so with FSAI(0.1) as is shown inTable 2. The FSAI performance appears to be insensitive to p and hence more robust vs. the number of processors employed in the simulation.

4. Conclusion

The implementation in a parallel computing environment of a PCG scheme for the most efficient solution to subsurface FE equation of flow has been addressed. The following points are worth summarizing:

(1) Assembling of the stiffness and capacity matrices can be done in a very efficient way with each single processor making use of the corresponding submatrix when the spd system is finally solved.

(2) The preconditioner FSAI and enlarged FSAI can be easily parallelized for a most efficient implementation of a parallel PCG solver. FSAI turns out to be superior to both the diagonal and the pARMS preconditioners.

(3) The pseudospeed-ups obtained on SP4 and CLX are quite attractive with maximum values on the order of 20 for

p = 32 and the largest problem (N  0.5 Million).

(4) The parallel pseudoefficiency index decreases with p on both SP4 and CLX, as expected. However, the pseudoef-ficiency deterioration is less pronounced on CLX where the machine architecture enhances the algorithm parallel performance.

(5) Using a massively parallel computer (e.g.,p=16 or 32) allows for the treatment of 3D flow problems characterized by a complex heterogeneity and hence a large number of elements that would be hard (if not impossible) to address with a traditional sequential computer.

Acknowledgment

This work has been supported by the Italian PRIN project Numerical models for multiphase flow and deformation

in porous media. Free accounting units on the SP4 and CLX supercomputers were given by CINECA (Italy), inside a

frame research grant.

References

[1]O. Axelsson, Iterative Solution Methods, Cambridge University Press, Cambridge, 1994.

[2]M. Benzi, Preconditioning techniques for large linear systems: a survey, J. Comput. Phys. 182 (2) (2002) 418–477.

[3]L. Bergamaschi, A. Martínez, Parallel acceleration of Krylov solvers by factorized approximate inverse preconditioners, in: M. Daydè, et al. (Eds.), VECPAR 2004, Lecture Notes in Computer Science, vol. 3402, Springer, Heidelberg, 2005, pp. 623–636.

[4]L. Bergamaschi, G. Pini, F. Sartoretto, Computational experience with sequential and parallel preconditioned Jacobi Davidson for large sparse symmetric matrices, J. Comput. Phys. 188 (1) (2003) 318–331.

[5]L. Bergamaschi, G. Pini, F. Sartoretto, Parallel solution of sparse linear systems arising in flow and transport problems, in: Euro-par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal. Lecture Notes in Computer Science, vol. 3648, Springer, Berlin, 2005, pp. 804–814.

[6]L. Bergamaschi, M. Putti, Efficient parallelization of preconditioned conjugate gradient schemes for matrices arising from discretizations of diffusion equations, in: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, March, 1999 (CD-ROM).

[7]J. Crank, P. Nicolson, A practical method for numerical evaluation of solutions of partial differential equations of the heat conduction type, in: Proceedings of Cambridge Philosophical Society, vol. 40, 1947, pp. 50–67.

[8]G. Gambolati, G. Pini, T. Tucciarelli, A 3-D finite element conjugate gradient model of subsurface flow with automatic mesh generation, Adv. Water Resour. 3 (1986) 34–41.

[9]R. Geus, S. Röllin, Towards a fast parallel sparse matrix-vector multiplication, in: E.H. D’Hollander et al. (Eds.), Parallel Computing: Fundamentals and Applications, Imperial College Press, London, 2000, pp. 308–315.

[10]G.A. Gravvanis, Generalized approximate inverse finite element matrix techniques, Neural Parallel Sci. Comput. 7 (4) (1999) 487–500.

[11]G.A. Gravvanis, K.M. Giannoutakis, Parallel approximate finite element inverse preconditioning on distributed systems, in: ISPDC/HeteroPar, 2004, pp. 277–283.

(7)

[12]G.A. Gravvanis, K.M. Giannoutakis, N.M. Missirlis, A distributed normalized explicit preconditioned conjugate gradient method, Parallel Algorithms Appl. 19 (2–3) (2004) 163–174.

[13]M.J. Grote, T. Huckle, Parallel preconditioning with sparse approximate inverses, SIAM J. Sci. Comput. 18 (3) (1997) 838–853.

[14]T.K. Huckle, Efficient computation of sparse approximate inverses, Numer. Linear Algebra Appl. 5 (1) (1998) 57–71.

[15]T.K. Huckle, Approximate sparsity patterns for the inverse of a matrix and preconditioning, Appl. Numer. Math. 30 (2–3) (1999) 291–303 (iterative methods and preconditioners, Berlin, 1997).

[16]D.S. Kershaw, The incomplete Cholesky-conjugate gradient method for the iterative solution of systems of linear equations, J. Comput. Phys. 26 (1978) 43–65.

[17]L.Yu. Kolotilina, A.A. Nikishin, A.Yu. Yeremin, Factorized sparse approximate inverse preconditionings IV. Simple approaches to rising efficiency, Numer. Linear Algebra Appl. 6 (1999) 515–531.

[18]L.Yu. Kolotilina, A.Yu. Yeremin, Factorized sparse approximate inverse preconditionings I. Theory, SIAM J. Matrix Anal. 14 (1993) 45–58.

[19]Z. Li, Y. Saad, M. Sosonkina, pARMS: a parallel version of the algebraic recursive multilevel solver, Numer. Linear Algebra Appl. 10 (5–6) (2003) 485–509 (preconditioning, 2001, Tahoe City, CA).

[20]J.A. Meijerink, H.A. van der Vorst, An iterative solution method for linear systems of which the coefficient matrix is a symmetric M-matrix, Math. Comput. 31 (1977) 148–162.

[21]G. Pini, G. Gambolati, Is a simple diagonal scaling the best preconditioner for conjugate gradients on supercomputers?, Adv. Water Resour. 13 (3) (1990) 147–153.

[22]Y. Saad, SPARSKIT: a basic tool kit for sparse matrix computation, Technical Report 1029, University of Illinois, CSRD, Urbana, 1990.

[23]Y. Saad, ILUT: a dual threshold incomplete ILU factorization, Numer. Linear Algebra Appl. 1 (1994) 387–402.

[24]S. Sauter, The ILU method for finite-element discretizations, J. Comput. Appl. Math. 36 (1) (1991) 91–106.

References

Related documents

This was a randomized, open-label, parallel-group study of a single SC 40-mg in 0.8 mL dose of FKB327 in healthy subjects to compare the relative bioavailability (assessed by

First, the develop- ment processes of component-based systems are separated from development processes of the components; the components should already have been developed and

CollectionSpace is an open-source, web- based software application for the description, management, and dissemination of museum collections information – from artifacts and

The new ID document itself resembles a credit card, - a rigid polycarbonate thick enough to allow the insertion of two memory chips to store personal data,

Additionally, the data suggest that the levels of client advocacy do not differ between CPAs and non-CPAs as shown in prior literature, that the clients of National Tax

After the Medical Executive Committee has approved a New Privilege to be performed at SHC/LPCH and/or to be included on the privilege list of any New Service, and has approved the

Usable and efficient user interfaces should be developed by implementing user centered development methodologies to gather the reliable and accurate data in medical

less than Spent Last Year Adjusted for Forecast Change FYTD actual is Remaining Estimate Challenge Cash Flow Vs Forecast $    (91,015.20) 2021 July