• No results found

Scalable block methods and preconditioning for hardware efficient sparse eigensolutions

N/A
N/A
Protected

Academic year: 2021

Share "Scalable block methods and preconditioning for hardware efficient sparse eigensolutions"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Scalable block methods and preconditioning

for hardware ecient sparse eigensolutions

Achim Basermann, Melven Röhrig-Zöllner, Jonas Thies Simulation and Software Technology

(2)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Background

Aim: Calculate a set of extreme eigenpairs (λi, vi)of a large

sparse matrix:

Avi = λivi

Project: Equipping Sparse Solvers for Exascale (ESSEX) of the

(3)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

(4)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm Jacobi-Davidson QR method Block JDQR method Performance analysis Implementation Results Interior eigenvalues

(5)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Jacobi-Davidson QR method (Fokkema, 1998)

Sketch of the algorithm

1: while not converged do . Outer iteration

2: Project the problem to a small subspace 3: Solve the small eigenvalue problem

4: Calculate an approximation and its residual

5: Approximately solve the correction equation .Inner iteration

6: Orthogonalize the new direction 7: Enlarge the subspace

(6)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜1 . . . v˜nb

 :

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i= −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(7)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜1 . . . v˜nb

 :

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i = −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(8)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜1 . . . v˜nb

 :

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i = −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(9)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜1 . . . v˜nb

 :

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i = −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(10)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜1 . . . v˜nb

 :

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i = −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(11)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Sketch of the complete algorithm

1:

Setup initial subspace

2: while not converged do

3: Project the problem to a small subspace 4: Solve the small eigenvalue problem

5: Calculate an nb approximations and their residual

6:

Lock converged eigenvalues

7:

Shrink subspace if required (thick restart)

8: Approximately solve the nb correction equations

9: Block-Orthogonalize the new directions (using TSQR) 10: Enlarge the subspace

(12)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block JDQR method

Sketch of the complete algorithm

1: Setup initial subspace

2: while not converged do

3: Project the problem to a small subspace 4: Solve the small eigenvalue problem

5: Calculate an nb approximations and their residual

6: Lock converged eigenvalues

7: Shrink subspace if required (thick restart)

8: Approximately solve the nb correction equations

9: Block-Orthogonalize the new directions (using TSQR) 10: Enlarge the subspace

(13)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm

Performance analysis

Required linear algebra operations Block vector operations

Jacobi-Davidson Operator

Implementation Results

(14)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

I Large distributed sparse matrixAin SELL-C-σ format

I Distributed blocks of vectorsX , Y ∈ Rn×nb

I Shifted spMMVM:yi ← (A − ˜λiI )xi, i =1, . . . , nb

Block vector operations

I Dierent types of operations:

local all-reduction BLAS 1 Y ← X + Y kxik, i =1, . . . , nb

BLAS 3 Y ← XM M ← XTY

(15)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

I Large distributed sparse matrixAin SELL-C-σ format

I Distributed blocks of vectorsX , Y ∈ Rn×nb

I Shifted spMMVM:yi ← (A − ˜λiI )xi, i =1, . . . , nb

Block vector operations

I Dierent types of operations:

local all-reduction BLAS 1 Y ← X + Y kxik, i =1, . . . , nb

BLAS 3 Y ← XM M ← XTY

(16)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

I Large distributed sparse matrixAin SELL-C-σ format

I Distributed blocks of vectorsX , Y ∈ Rn×nb

I Shifted spMMVM:yi ← (A − ˜λiI )xi, i =1, . . . , nb

Block vector operations

I Dierent types of operations:

local all-reduction BLAS 1 Y ← X + Y kxik, i =1, . . . , nb

BLAS 3 Y ← XM M ← XTY

(17)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block vector operations

Background

I All operations arememory bound.

(also the GEMM, as all matrices are very tall and skinny)

I The code balance accurately predicts the node-level performance!

Results of blocking

I Faster BLAS 3 operations (e.g. Y ← XM)

I Message aggregation for all-reductions (e.g. M ← XTY) → Improved performance of some operations

(18)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block vector operations

Background

I All operations arememory bound.

(also the GEMM, as all matrices are very tall and skinny)

I The code balance accurately predicts the node-level performance!

Results of blocking

I Faster BLAS 3 operations (e.g. Y ← XM)

I Message aggregation for all-reductions (e.g. M ← XTY) → Improved performance of some operations

(19)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Complete Jacobi-Davidson Operator

Setup

I spMMVM + 2× GEMM: yi ← (I − QQT)(A − ˜λiI )xi with Q ∈ Rn×8, i = 1, . . . , n b I Matrix with n ≈ 107, nnzr ≈15 I SELL-C-σ format

I 10-core Intel Ivy Bridge CPU 0

0.02 0.04 0.06 0.08 0.1 1 2 4 8 runtime / blo ck size nb [s] block size nb spMMVM GEMM 1 GEMM 2

(20)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm Performance analysis

Implementation

Frameworks from the ESSEX project

Block pipelined GMRES or MINRES preconditioning

Results

(21)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(22)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(23)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers

I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(24)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(25)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen

I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(26)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block pipelined GMRES or MINRES preconditioning

Idea

I Solve nb linear systems Axi = bi at once

I Remove converged systems and add new ones (pipelining) I Standard GMRES or MINRES method (for each system)

˜ vk+1← Avk GMRES subspace 1: GMRES subspace 2:. . .

(27)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block pipelined GMRES or MINRES preconditioning

Idea

I Solve nb linear systems Axi = bi at once

I Remove converged systems and add new ones (pipelining)

I Standard GMRES or MINRES method (for each system)

˜ vk+1← Avk GMRES subspace 1: GMRES subspace 2:. . .

(28)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block pipelined GMRES or MINRES preconditioning

Idea

I Solve nb linear systems Axi = bi at once

I Remove converged systems and add new ones (pipelining) I Standard GMRES or MINRES method (for each system)

˜ vk+1← Avk GMRES subspace 1: GMRES subspace 2:. . .

(29)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Block pipelined GMRES or MINRES preconditioning

Idea

I Solve nb linear systems Axi = bi at once

I Remove converged systems and add new ones (pipelining) I Standard GMRES or MINRES method (for each system)

˜ vk+1← Avk GMRES subspace 1: GMRES subspace 2:. . .

(30)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Numerical behavior

Performance of the complete algorithm Conclusion

(31)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Numerical behavior

Block size 2

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM ) # eigenvalues found Andrews cfd1 nan512 torsion1 ck656 cry10000 dw8192 rdb3200l

Block size 4

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMV M) # eigenvalues found

(32)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Performance of the complete algorithm (0)

Seeking 20 eigenvalues of spinSZ26 (1 · 107 rows, 1.5 · 108 nonzeros)

Single socket (10 cores)

0 100 200 300 400 500 600 tpetra ghost runtime [s] single vector block size 2 block size 4

Complete node (2 × 10 cores)

0 100 200 300 400 500 600 tpetra ghost runtime [s] single vector block size 2 block size 4

(33)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Performance of the complete algorithm (1)

Seeking 20 eigenvalues of spinSZ26 (1.0 · 107rows, 1.5 · 108nonzeros)

Block speedup

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 64 128 sp eedup through blo cking (t1 / tblo c k )

# nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

0 50 100 150 200 250 1 2 4 8 16 32 64 128 runtime [s]

# nodes (with 20 cores each) single vector

block size 2 block size 4

(34)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Performance of the complete algorithm (2)

Seeking 20 eigenvalues of spinSZ30 (1.6 · 108rows, 2.6 · 109nonzeros)

Block speedup

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 16 32 64 sp eedup through blo cking (t1 / tblo c k )

# nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

0 100 200 300 400 500 16 32 64 runtime [s]

# nodes (with 20 cores each) single-vector

block size 2 block size 4

(35)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (1)

Block JDQR algorithm

I Performance analysis of block operations:

I Impact of memory layout(row- instead of col.-major) I Node-level: better code balance

I Inter-node: message aggregation

I Numerical behavior

I Slight increase of operations I Overhead small for > 10 eigenvalues → Improved performance by factor 1.2-1.4

(36)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (1)

Block JDQR algorithm

I Performance analysis of block operations:

I Impact of memory layout(row- instead of col.-major) I Node-level: better code balance

I Inter-node: message aggregation

I Numerical behavior

I Slight increase of operations I Overhead small for > 10 eigenvalues

(37)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (1)

Block JDQR algorithm

I Performance analysis of block operations:

I Impact of memory layout(row- instead of col.-major) I Node-level: better code balance

I Inter-node: message aggregation

I Numerical behavior

I Slight increase of operations I Overhead small for > 10 eigenvalues → Improved performance by factor 1.2-1.4

(38)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (2)

Outlook

I Numerical improvements:

I Parallel preconditioning of the linear problems

I Algorithmic improvements:

I Hiding spMMVM communication behind other operations I More asynchronous communication(all-reductions)

(39)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (2)

Outlook

I Numerical improvements:

I Parallel preconditioning of the linear problems

I Algorithmic improvements:

I Hiding spMMVM communication behind other operations I More asynchronous communication(all-reductions)

(40)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion (2)

Outlook

I Numerical improvements:

I Parallel preconditioning of the linear problems

I Algorithmic improvements:

I Hiding spMMVM communication behind other operations I More asynchronous communication(all-reductions)

(41)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

FEAST eigensolver (Polizzi '09) Linear systems for FEAST/graphene Experiments

(42)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

(43)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Linear systems for FEAST/graphene

Tough:

I very large (N ∼ 109carbon atoms)

I complex symmetric and completely indenite I small random numbers on and around the diagonal I spectrum essentially continuous

I shifts get very close to the spectrum

But also nice in some ways:

I 2D mesh, very sparse (∼ 10 entries/row) I many RHS/shift (block methods, etc.)

(44)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

The CGMN linear solver

I Björck and Elfving, 1979

I CG on the semi-denite problem (I − QSSOR)x = b, where QSSOR = Q1Q2. . . QNQN−1. . . Q1is the SSOR iteration on

AATy = b I Qiv = (I − ω

aTiv aiaTi

)aT

i : project v onto ai (row i of A) I extremely robust: A may be singular, non-square etc. I squaring A remedies small diagonal entries

(45)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Two parallelization strategies

Algebraic Multi-Coloring

Distance-2 coloring re-solves data dependency

I yields ne grained parallelism

(e.g. GPGPU)

CARP:Component-Averaged Row Projection (Gordon & Gordon, 2005)

I sequential sweeps on subdomains I exchange and average halo elements I retains convergence properties of

sequential algorithm

(46)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Weak and strong scaling up to 5k Cores

Graphene, 20M atoms/node (up to 5B)

I Coloring costs performance; I but is more memory ecient;

Strong scaling, 20M unknowns in total

I yields better strong scaling;

I and has better potential for GPUs and

(47)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Socket scaling and block Performance

I Performance penalty of coloring already at core (25%) and socket

(50%) levels

I Blocking alleviates poor memory access patterns with coloring, I PHIST/GHOST: MC_CARP kernel only available in CRS

(48)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion

FEAST for interior eigenvalues

I Robust iterative solution of the arising nearly singular equation

systems with CGMN

I Scalable parallelization of CGMN with algebraic multi-coloring and

CARP

(49)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion

FEAST for interior eigenvalues

I Robust iterative solution of the arising nearly singular equation

systems with CGMN

I Scalable parallelization of CGMN with algebraic multi-coloring and

CARP

(50)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Conclusion

FEAST for interior eigenvalues

I Robust iterative solution of the arising nearly singular equation

systems with CGMN

I Scalable parallelization of CGMN with algebraic multi-coloring and

CARP

References

Related documents

We show that for a given fixed (reciprocal) termination charge, we can find a family of access pricing rules that induce firms to charge on-net price equal to on-net cost and

Electromagnetic actuators are a little more difficult to quantify [59] in terms of power density. In general, there are two designs for force actuation involving the magnetization

I am aware that in addition to my credentials, privileges granted will depend upon the needs and resources of the individual Health Authorities, as well as the requirements of

increase the number of people below 200% of the federal poverty level access- ing mental health services from Gallatin Mental Health Center (GMHC) and Community Health Partners

Once completed, fax or mail this form with the Bank Account Transmittal Sheet and Authorization for Information in Connection with Business Account Application to the fax number

All Girl’s Online Sales Reports collected from the girls that participated with the online magazine program and/or nut promise orders.. All Money and/or Product and Cupboard

(2) an amount or limit, subject to such limit for any one person so injured or killed, of $30,000.00, exclusive of interest and costs, on account of injury to or death of more than

While conventional application software systems might win user acceptance mainly through their functionality and can be positioned against market competitors in that way, a