Scalable Distributed Schur Complement Solvers
for Internal and External Flow Computations
on Many-Core Architectures
Dr.-Ing. Achim Basermann, Dr. Hans-Peter Kersken, Melven Zöllner**
German Aerospace Center (DLR)
Simulation- and Software Technology
Distributed Systems and Component Software
Porz-Wahnheide, Linder Höhe, D-51147 Cologne, Germany
[email protected]
Survey
Internal and external flow computations at DLR
Storage Schemes for sparse matrices
The Distributed Schur Complement method (DSC)
Experiments with TRACE and TAU matrices
Parallel Simulation System TRACE
TRACE: Turbo-machinery Research
Aerodynamic Computational Environment
Developed by the Institute for Propulsion Technology of the German Aerospace Center (DLR-AT)
Calculates internal turbo-machinery flows
Finite volume method with block-structured grids
The linearized TRACE modules require the parallel, iterative solution with preconditioning of large, sparse, non-symmetric real or
TAU: developed for the aerodynamic design of aircrafts by the DLR Institute of Aerodynamics and Flow Technology
Unstructured RANS solver (Reynolds-averaged Navier-Stokes), exploits finite volumes
Requires the parallel, iterative solution with preconditioning of large, sparse, real, non-symmetric systems of linear equations
Solvers used: geometric Multigrid, AMG preconditoned GMRes Here: experiments with DSC methods
Storage Schemes for Sparse Matrices
TRACE and TAU apply BCSR with 5x5 blocks. Avantage: less indirect addressing
Disadvantage: A few zeros are stored.
Compressed Row Storage (CSR) and Block Compressed Row Storage (BCSR)
Matrix:
Non-zero values, row-wise:
Column indices, row-wise:
DSC Method (1)
Distributed matrix,
2 processors
DSC Method (2)
DSC Algorithm
Schematic view on
each processor
BiCGstab or GMRes iteration for the local interface rows (unknowns)
DSC Method (3)
DSC Method: Strong Scaling (CSR, real)
(
Dual-processor nodes; Quad-Core Intel Harpertown; 2.83 GHz)
1 10 100 1000 8 16 32 64 128 Processor cores T im e i n seco n d s
DSC Method: CSR versus BCSR Format (real)
(
2x Intel XEON E5520 with 4 cores each, 2.26 GHz
)
102 19 29 20 9 121 0 20 40 60 80 100 120 Ti m e i n s e c onds dsc2011, bs=1 dsc2011, bs=5 TAU matrix: n=541,980; nz=170,610,950; threshold = 10-3; |rel. residual| < 10-7 Results on 8 cores
DSC Method: Strong Scaling (CSR, complex)
(
Dual-processor nodes; Quad-Core Intel Harpertown; 2.83 GHz)
1 10 100 1000 10000 4 8 16 32 64 128 192 Processor cores Ti m e i n s e c ond s TRACE matrix THD (n=378,400; nz=45,456,500;
threshold = 10-3; |rel. residual| < 10-5)
100 1000 10000 32 64 128 192 Processor cores T im e i n se co n d s
TRACE matrix UHBR
(n=4,497,520; nz=552,324,700;
linearTRACE matrix
(8 processes, dim = 56,240, nnz = 2.6 Mio)
0 0,5 1 1,5 2 2,5 3 3,5 4 4,5 5
total ilut construction solver iteration
T ime in s e c o n d s real real blocked (bs=5) complex complex blocked (bs=5) # 76 # 40 # 32 # 29
DSC Solver: CSR versus BCSR Format
dsc2011 solver for linearTRACE
(8 processes, test case "THD stator": dim = 0.8 Mio, nnz = 90 Mio)
0 10 20 30 40 50 60 70 T im e in s e c o n d
s trace (setup matrix etc)
solver iteration
prec. preparation (ilut) # 140
iterations
# 57 iterations
Conclusions
Complex computations distinctly faster than real ones
BCSR format application significantly outperforms CSR format
application for TRACE and TAU problems.
DSC method distinctly faster than block-local methods
DSC method very suitable for TRACE and TAU problems
Questions?
Questions?
DLR
German Aerospace Center
Research Institution Space Agency
Koeln Lampoldshausen Stuttgart Oberpfaffenhofen Braunschweig Goettingen Berlin- Bonn Trauen Hamburg Neustrelitz Weilheim Bremen-
Locations and employees
Germany: 6,900 employees across 33 research institutes and
facilities at
15 sites.
Offices in Brussels, Paris and Washington.
DSC Method: Effect of the Interface Iteration (real)
(
2x Intel XEON E5520 with 4 cores each, 2.26 GHz
)
TAU matrix: n=541,980; nz=170,610,950; threshold = 10-3; |rel. residual| < 10-7 Results on 8 cores 0 5 10 15 20 25 30 35 0 1 2 3 4 5 6 7 8 9 10 S o lv er i ter at io n t im e i n seco n d s interf-bicgstab, bs=1 interf-gmres, bs=1 interf-bicgstab, bs=5 interf-gmres, bs=5