• No results found

Scalable Two-Level Preconditioners for CFD Computations on Many-Core Systems

N/A
N/A
Protected

Academic year: 2021

Share "Scalable Two-Level Preconditioners for CFD Computations on Many-Core Systems"

Copied!
25
0
0

Loading.... (view fulltext now)

Full text

(1)

www.DLR.de • Chart 1 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

Scalable Two-Level Preconditioners for

CFD Computations on Many-Core Systems

Dr.-Ing. Achim Basermann, Melven Zöllner*

(2)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 2

Survey

-

Background DLR

-

CFD Computations at DLR

-

Storage Schemes for Sparse Matrices

-

Distributed Schur Complement (DSC) Preconditioning

-

Experiments with TRACE and TAU Matrices

(3)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Folie 3

DLR

German Aerospace Center

Research Institution

Space Agency

(4)

Köln

Oberpfaffenhofen

Braunschweig

Göttingen

Berlin

Bonn

Neustrelitz

Weilheim

Bremen

 

Trauen

Lampoldshausen

Hamburg

Stuttgart

Stade

Augsburg

Jülich

Locations and Employees

7000 employees across

32 institutes and facilities at

16 sites.

Offices in Brussels,

Paris and Washington.

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 4

(5)

Research Areas

- Aeronautics

- Space Research and Technology

- Transport

- Energy

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 5

(6)

Parallel Simulation System TRACE

- TRACE: Turbo-machinery Research

Aerodynamic Computational Environment

- Developed by the Institute for Propulsion

Technology of DLR

- Calculates internal turbo-machinery flows

- Finite volume method with block-structured grids

- The linearized TRACE modules require the

parallel, iterative solution with preconditioning of

large, sparse, non-symmetric real or complex

systems of linear equations

www.DLR.de • Chart 6 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

Mach

number

(7)

-

TAU: developed for the aerodynamic design of aircrafts by the DLR

Institute of Aerodynamics and Flow Technology

-

Unstructured RANS solver (Reynolds-averaged Navier-Stokes), exploits

finite volumes

-

Requires the parallel, iterative solution with preconditioning of large,

sparse, real, non-symmetric systems of linear equations

www.DLR.de • Chart 7 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

(8)

Storage Schemes for Sparse Matrices

- TRACE and TAU apply BCSR with 5x5 blocks.

- Avantage:

less indirect addressing

- Disadvantage:

A few zeros are stored.

Compressed Row Storage (CSR) and Block Compressed Row Storage (BCSR)

Matrix:

Non-zero values, row-wise:

Column indices, row-wise:

Row pointers:

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 8

(9)

www.DLR.de • Chart 9 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

Preconditioners: Incomplete

LU

Decomposition

Incomplete

Gauss elimination

Block methods

apply sub-matrices, e.g.

The set

M

determines which entries are

not dropped

.

LU construction and

forward and backward

substitution are hard

to parallelize!

(10)

Block-Jacobi-ILU Preconditioning for

𝑲

−𝟏

𝑨𝑨 = 𝑲

−𝟏

𝒃

- Idea: independent incomplete

LU

decompositions of the diagonal blocks

highly parallel

- Disadvantage: Entries outside the diagonal blocks are neglected.

Preconditioner quality decreases with increasing #diagonal blocks.

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 10

(11)

Schur Complement Transformation for

𝑨𝑨 = 𝒃

Solution method

1. Transformation:

2. Solve Schur complement system:

3. Back transformation:

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 11

(12)

Distributed Equation System

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 12

(13)

DSC Preconditioning (1)

- Use Schur complement transformation for distributed equation systems

(DSC,

distributed Schur complement

)

- Iteratively improve block-Jacobi-ILU preconditioning by the consideration of

the matrix entries outside the diagonal blocks.

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 13

(14)

DSC Preconditioning (2)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 14

1. ILU of the diagonal blocks:

-1

1. ILU of the diagonal blocks:

2.

Transformation

:

3. Approximate the solution of the

Schur complement systems

:

4.

Back transformation

:

Formulation saves

forward/back operations

↔ Saad/Sosonkina 1999

(15)

Whole Solver: ILU Construction

Preconditioning within the DSC algorithm

D

i

D

i

~

~

~

~

S

i

S

i

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 15

(16)

Schematic view on

each processor

BiCGstab or GMRes iteration for the local

interface rows (unknowns)

Whole Solver:

Outer and Inner Iteration

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 16

Also possible for outer iteration:

Flexible QMR

Flexible BiCG

Flexible BiCGstab

(17)

Hardware System

www.DLR.de • Chart 17 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

-

RWTH Bull HPC cluster

- Intel Westmere X5675 CPUs

- 6 cores per CPU with 3.06 GHz

- 12 cores (2 CPUs) per node

(18)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 18

Block size

E

x

ec

ut

ion

t

im

e i

n

s

ec

onds

ILU construction

Iterations

Experiments: CSR versus BCSR Format

Block-Jacobi-ILU preconditioning with 12 processes

TAU matrix:

n=541,980;

nz=170,610,950;

ILU fill-

in ratio≈0.8;

|rel. res.| < 10

-5

(19)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 19

Experiments: Strong Scaling, Iterations

TRACE mat. UHBR: n=4,497,520; nz=552,324,700; threshold=5·10

-4

; |rel. res.|<10

-5

# processes

#

it

er

at

ions

(20)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 20

Experiments: Strong Scaling, Time

TRACE mat. UHBR: n=4,497,520; nz=552,324,700; threshold=5·10

-4

; |rel. res.|<10

-5

# processes

E

x

ec

ut

ion

t

im

e i

n

s

ec

onds

(21)

0

10

20

30

40

50

60

70

internal solver ( gmres100, ssor(0.7,3) )

dsc2011 ( fgmres40, dsc gmres 5,

ilut(0.01,1) )

T

im

e

in s

e

c

onds

dsc2011 solver for linearTRACE

(8 processes, test case "THD stator": dim = 0.8 Mio, nnz = 90 Mio)

trace (setup matrix etc)

solver iteration

prec. preparation (ilut)

# 140

# 57

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 21

linearTRACE Performance: Internal versus DSC Solver

(22)

www.DLR.de • Chart 22 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

11,5

16,9

0

2

4

6

8

10

12

14

16

18

internal solver ( gmres100, ssor(0.7,3) )

dsc2011 ( fgmres40, dsc gmres 5, ilut(0.01,1) )

P

eak

me

mo

ry

[

G

ig

a

b

y

te

]

dsc2011 solver for linearTRACE

(8 processes, test case "THD stator": dim = 0.8 Mio, nnz = 90 Mio)

linearTRACE Memory Usage: Internal versus DSC Solver

(23)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Chart 23

Conclusions

-

BCSR format application significantly outperforms CSR

format application for real TRACE and TAU problems.

-

DSC method achieves higher scalability and faster iteration

than block-local methods.

-

DSC method very suitable for TRACE and TAU problems

Future Work

-

Hybrid parallelization is appropriate to further improve

(24)

> PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012 www.DLR.de • Folie 24

Questions?

(25)

www.DLR.de • Chart 25 > PMAA 2012 > Achim Basermann • 20120628-1 DSC PMAA2012 Basermann.pptx > 28.06.2012

DSC Method: Effect of the Interface Iteration

(2x Intel XEON E5520 with 4 cores each, 2.26 GHz)

TAU matrix:

n=541,980;

nz=170,610,950;

threshold = 10

-3

;

|rel. residual| < 10

-7

Results on

8 cores

0

5

10

15

20

25

30

35

0

1

2

3

4

5

6

7

8

9

10

S

o

lv

er

i

ter

at

io

n

t

im

e

in

seco

n

d

s

interface iterations

interf-bicgstab, bs=1 interf-gmres, bs=1 interf-bicgstab, bs=5 interf-gmres, bs=5

References

Related documents

Lean product development is an important issue within research area 3. The majority of the SFI Norman companies have participated in a large survey concerning lean

After the Medical Executive Committee has approved a New Privilege to be performed at SHC/LPCH and/or to be included on the privilege list of any New Service, and has approved the

Sakwa (2006), also assessed the skill of the High Resolution Regional Model (HRM) in simula- tion of airflow and rainfall over East Africa, found that the model was able to

He has over 25 years of experience in conducting environmental site assessments and managing environmental drilling projects, including workload scheduling and

in case of probabilities or priorities the Aps method is a fast exact method for ranking project scenarios and project structures, and another counting algorithm (Kosztyán-Kiss, 2011)

The ‘expensive-tissue’ hypothesis of Aiello and Wheeler is well-known in anthropology for positing that an increasingly small gut was a key factor in the