• No results found

Software and Performance Engineering for Iterative Eigensolvers

N/A
N/A
Protected

Academic year: 2021

Share "Software and Performance Engineering for Iterative Eigensolvers"

Copied!
29
0
0

Loading.... (view fulltext now)

Full text

(1)

Software and Performance Engineering for Iterative Eigensolvers

Jonas Thies

German Aerospace Center (DLR) Simulation and Software Technology High Performance Computing

(2)

www.DLR.de •Chart 2>PASC’17>J. Thies•software & performance>Lugano, 2017-06-27

Motivation 1: analyze nonlinear PDE systems

2nd order PDE after space discretization

• M∂Φ

∂t = F(Φ,t)

• with suitable boundary and

initial conditions

Steady state;Φ ast→ ∞.

Standard technique: time stepping

• may take very long

• no information about stability physical difficulty: low frequency modes affect solution on very long time scales

Example: 3D Boussinesq equations

∂u/∂t=−((uu)x+ (vu)y+ (wu)z)−px+ν∇2u ∂v/∂t=−((uv)x+ (vv)y+ (wv)z)−py+ν∇2v ∂w/∂t=−((uw)x+ (vw)y+ (ww)z)−pz+ν∇2w+gαT ∂T/∂t=−((uT)x+ (vT)y+ (wT)z) +κ∇2T ux+vy+wz= 0 • Newton-Krylov with preconditioning • ‘parameter continuation’ as globalization

linear stability analysis =solve

Ax=λBx for someλs near 0, B spd, A not.

(3)

Motivation 1: analyze nonlinear PDE systems

2nd order PDE after space discretization

• M∂Φ

∂t = F(Φ,t)

• with suitable boundary and

initial conditions

Steady state;Φ ast→ ∞.

Standard technique: time stepping

• may take very long

• no information about stability physical difficulty: low frequency modes affect solution on very long time scales

Example: 3D Boussinesq equations

∂u/∂t=−((uu)x+ (vu)y+ (wu)z)−px+ν∇2u ∂v/∂t=−((uv)x+ (vv)y+ (wv)z)−py+ν∇2v ∂w/∂t=−((uw)x+ (vw)y+ (ww)z)−pz+ν∇2w+gαT ∂T/∂t=−((uT)x+ (vT)y+ (wT)z) +κ∇2T ux+vy+wz= 0 Our approach: • Newton-Krylov with preconditioning • ‘parameter continuation’ as globalization

linear stability analysis =solve

Ax=λBx for someλs near 0, B spd, A not.

(4)

Example: Rayleigh-B´enard convection

• Cube-shaped domain • heated from below • Rayleigh-Number

Ra =αg∆Td

3

νκ

Figure:Flow patterns near the

first three primary bifurcations (a) x/y roll,

(b) diagonal roll, (c) four rolls, (d) toroidal roll

(5)

Motivation 2: provide a useful solver library

(i) Application scientistsmiss solvers that ...

can handle generalized and non-Hermitian problemscan be integrated deeply into applications

can easily be used from Fortran

• support GPU accelerators and heterogenous hardware (ii) Numericistsneed a platform for

implementing algorithms on increasingly complex hardwareperforming meaningful performance studies

(iii) Portability requirements:

(6)

Jacobi-Davidson: Newton’s as an Eigensolver

• Eigenvalue problem: solveAx−λx = 0 for (x, λ) • Applyinexact Newton

• JDQR:subspace acceleration, locking and restart (Fokkema’99)

Jacobi-Davidson correction equation

• current approximation: A˜v−˜λv˜=r,

previously converged Schur vectors q1, . . . ,qk=Q • solve approximately (A−λ˜I)∆v =−r,∆v ⊥Q˜= (Q,v˜) • use some steps of preconditioned GMRES

(7)

Block JDQR

outer loop: work on nbRitz values ˜λj at a time Inner solver: computetj ⊥Q˜

without preconditioning:

P(A−λ˜jI)tj =−rj

P= (I−Q˜Q˜T)

with (left) preconditioning,

PKK−1(A−λ˜jI)tj =−PKK−1rj

PK = (I−Q˜K( ˜QTQ˜K)−1Q˜T) whereK is a preconditioner forA−¯λI and ˜QK=K−1Q.˜

blocked solvers: separate Krylov spaces, but using block kernels.

(8)

Common operations of iterative methods

1. Memory-bounded linear operations involving

sparse matrices

A∈RN×N (sparseMat)

multi-vectors

X,Y ∈RN×m (mVecs)

small and dense matrices C∈Rm×k (sdMats) node-local/inshared

memory

Developed in ESSEX/ (e.g. Y ←αAX+βY,C←XTY,X ←Y·C) 2. Algorithms for sdMats

• e.g. eigendecomposition of

projected matrix

• LAPACK/PLASMA/MAGMA

3. Sparse matrix (I)LU factorization

• not available in

• allow using external libraries

(9)

Why do we need our own kernels?

simple(?) operation: C=VTV,V

(10)

Why do we need our own kernels?

simple(?) operation: C=VTV,V

(11)

Why do we need our own kernels?

simple(?) operation: C=VTV,V

(12)

Why do we need our own kernels?

simple(?) operation: C=VTV,V

(13)

SPMD/OK Programming Model

• SPMD (‘BSP’) vs. task

parallelism

• Heterogenous cluster:

distribute problem according to limiting resource (e.g. memory bandwidth)

• OptimizedKernels make

sure each component runs as fast as possible

• User sees a simple functional

interface (no

general-purpose looping constructs etc.)

A success story: Chebyshev methods on Piz Daint

1 4 16 64 256 1024

Number of heterogeneous nodes 0.1 1 10 100 Performance in Tflop/s 100% Parallel Efficiency Square, Weak Scaling Bar, Weak Scaling Square, Strong Scaling

Only needs sparse matrix times multiple vector (spMMV) products and an occasional vector operation

(14)

PHIST software architecture

a Pipelined Hybrid-parallel Iterative Solver Toolkit

• facilitate algorithm development using • holistic performance engineering • portability and interoperability

application vertical integration algorithms preconditioners preconditioners computational core computational core «abstraction» eigenproblem setup/apply

sparseMat mVec sdMat

solver templates FT strategies

algo core «interface»

kernel interface holistic

performance

engine

ering

C wrapper

(15)

PHIST software architecture

a Pipelined Hybrid-parallel Iterative Solver Toolkit

• facilitate algorithm development using • holistic performance engineering • portability and interoperability

application vertical integration algorithms preconditioners preconditioners computational core computational core «abstraction» eigenproblem setup/apply

sparseMat mVec sdMat

solver templates FT strategies

algo core «interface»

kernel interface holistic

performance engine ering C wrapper adapter BEAST CRAFT

(16)

Useful abstraction: kernel interface

Choose from several ‘backends’ at compile time, to

• easily usePHIST in existing applications

• perform the same run with different kernel libraries • compare numerical accuracy and performance

(17)

PHIST interface example

Inspired by MPI:objects represented by handles only

C/C++:

// compute y = alpha*A*x + beta*y

void phist_DsparseMat_times_mvec(double alpha, phist_Dconst_sparseMat_ptr A, phist_Dconst_mvec_ptr x, double beta, phist_Dmvec_ptr y, int* iflag);

Fortran 2003:

subroutine phist_DsparseMat_times_mvec(alpha, A, x, beta, y, iflag) use iso_c_binding, only: c_double, c_ptr, c_int

use phist_types

real(c_double), value :: alpha, beta type(Dconst_sparseMat_ptr), value :: A type(Dconst_mvec_ptr), value :: x type(Dmvec_ptr), value :: y integer(c_int) :: iflag

similarPythoninterface exists

(18)

Cool features of PHIST and

Task macros: out-of-order execution of code blocks

overlap comm. and comp. • asynchronous checkpointing • ...

Consistent random vectors: makePHIST runs comparable

across platforms (CPU, GPU...)across kernel libraries

independent of #procs, #threads

PerfCheck: print achieved roofline performance of kernels after complete run to reveal

deficiencies of kernel lib

implemntation issues of algorithm

(strided data access etc.)

Special-purpose operations

fused kernels, e.g. compute

Y =αAX+βY and YTX

highly accurate core functions, e.g. block

orthogonalization in simulated quad precision

(19)

Example application: Turing problem

Reaction-Diffusion problem

∂u ∂t =Dδ∇ 2u+αu(1r 1v2) +v(1−r2u) ∂v ∂t =δ∇ 2v+v+αr 1uv) +u(γ+r2v) (1)

• 2D: spot and stripe patterns • can be solved using AMG

(20)
(21)
(22)

www.DLR.de •Chart 16>PASC’17>J. Thies•software & performance>Lugano, 2017-06-27

Preconditioning may be dangerous...

(normalized) projected operatorVTPKK −1

AV after 150 Arnoldi iterations

-0.1

0

0.1

0.2

-0.1

-0.05

0

0.05

0.1

-10

0

10

0

5

10

-4

-3

-2

-1

0

0

50

100

150

-10

-5

0

5

0

500

1000

1500

-2

-1

0

1

2

10

-6

restart=160 maxit=10 RELRES=3.377e-10 realres=3.9024e-08 diffex=2.8253e-05

Singular=0

with 1 eigenvector of A inPK We used an adaptation of Trefethens Matlab code: http://www.cs.ox.ac.uk/pseudospectra/software.html

(23)

www.DLR.de •Chart 16>PASC’17>J. Thies•software & performance>Lugano, 2017-06-27

Preconditioning may be dangerous...

(normalized) projected operatorVTPKK −1

AV after 150 Arnoldi iterations

-0.1

0

0.1

-0.1

-0.05

0

0.05

0.1

-10

0

10

0

5

10

-4

-3

-2

-1

0

0

50

100

150

-10

-5

0

5

0

500

1000

1500

-5

0

5

10

-7

restart=160 maxit=10 RELRES=2.0452e-10 realres=2.3633e-08 diffex=1.0983e-05

Singular=0

with 5x eigenvectors of A inPK We used an adaptation of Trefethens Matlab code: http://www.cs.ox.ac.uk/pseudospectra/software.html

(24)

Turing with preconditioning

To avoid introducing non-normality by an ill-conditioned preconditioner, use AMG (ML) on the Laplacian:

Number of BJDQR(4) iterations 64 128 256 0 200 400 600 grid size outer iterations GMRES GMRES + AMG solve time

(weak scaling on 8, 64 and 512 cores)

64 128 256 0 500 1,000 1,500 2,000 2,500 65% ML 67% ML 79% ML grid size time [s

(25)

Performance portability with PHIST+GHOST

Find 20 left-most eigenpairs of a spin-chain matrix (N2.7M)BJDQR + MINRES

run time determined by main memory bandwidth

HW-12 HW-24 K40 KNL P100 0 2 4 6 8 10 relative sp eed-up BJDQR STREAM

(26)

Scaling on Piz Daint

3D non-symmetric PDE problemblock Jacobi-Davidson + GMRESfind 10 right-most eigenvalues

1 2 4 8 16 32 64 128 256 512 0.05 0.09 0.18 0.36 0.725 1.45 2.9 5.8 11.6 nodes TFlop/s

(27)

Summary: do we provide a useful solver library?

(i) PHIST...

can handle generalized and non-Hermitian problems(with caveats)can be integrated deeply into applicationsby exposing th kernel

interface

can easily be used from Fortranvia Fortran bindings in phist fort and

builtin Fortran kernels

supports GPU accelerators and heterogenous hardwarevia GHOST

and allows Numericists to

implement algorithms using an abstract interface to GHOST and

other libraries

compare algorithms using the same backendand backends with the same algorithm (ii) Portable and maintainable

• ∼10 000 test cases for kernels, core and algorithms (make test)perfcheck: report roofline performance of kernels after solver run

(28)

Future Work

• more memory-efficient variant for GPUs • do not storeAV

use QMR instead of GMRES) • more interoperability

e.g. apply Trilinos preconditioner to GHOST vector

(29)

Questions?

Contact

Jonas Thies

DLR Simulation and Software Technology High Performance Computing

[email protected] Phone 02203 / 601 41 45 http://www.DLR.de/sc Links • Project website http://blogs.fau.de/essex/ • Source code https://bitbucket.org/essex/

Joint work with the group of Gerhard Wellein (U. Erlangen) and Fred Wubs (U. Groningen). Funding was provided by DFG priority programme 1648 (SPPEXA) project ESSEX.

References

Related documents

Building Development/ Construction Operators Owner/ Property Services Capital Markets Real Estate Corporate Homebuilders Multi-Family Housing Senior Housing

PHE vehicles are the best and easiest choice for a sustainable private transport sector in 2050, because of the limited demand for lithium, biofuels and renewable electricity

Once completed, fax or mail this form with the Bank Account Transmittal Sheet and Authorization for Information in Connection with Business Account Application to the fax number

Správna výživa významne ovplyvňuje celkový stav organizmu: telesnú a duševnú výkonnosť, odolnosť k infekciám, lepšie zvládnutie stresu, rýchlejšie hojenie rán.. Môžeme teda

The results indicate that there are cases which are not exposed to currency risk under the conventional measure (exposure coefficient), but significantly exposed to currency

• Latest updates on Section 83 of the MITA and new MIRB’s prescribed forms relating to employer’s tax obligation... Individual Rate

Subsection (c) of section 331 continues stating that the amount of gain or loss recognized shall be determined under section 1001.15 In most instances, this means

Training was delivered by the local authority Prevent Lead and the police education Prevent lead to teachers and other staff at the school to increase awareness of the issues