Scalable block methods and preconditioning for hardware efficient sparse eigensolutions

(1)

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Interior eigenvalues

Scalable block methods and preconditioning

for hardware ecient sparse eigensolutions

Achim Basermann, Melven Röhrig-Zöllner, Jonas Thies Simulation and Software Technology

(2)

Background

Aim: Calculate a set of extreme eigenpairs (λi, vi)of a large

sparse matrix:

Avi = λivi

Project: Equipping Sparse Solvers for Exascale (ESSEX) of the

(3)

Outline

(4)

Outline

Block-Jacobi-Davidson algorithm Jacobi-Davidson QR method Block JDQR method Performance analysis Implementation Results Interior eigenvalues

(5)

Jacobi-Davidson QR method (Fokkema, 1998)

Sketch of the algorithm

1: while not converged do . Outer iteration

2: Project the problem to a small subspace 3: Solve the small eigenvalue problem

4: Calculate an approximation and its residual

5: Approximately solve the correction equation .Inner iteration

6: Orthogonalize the new direction 7: Enlarge the subspace

(6)

Block JDQR method

Idea

I Calculate corrections for nb eigenvalues at once

I Block correction equation with ˜Q = Q v˜₁ . . . v˜nb

:

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i= −ri i =1, . . . , nb

→ Approximately solve nblinear systems at once

I Provides new directionswk+1, . . . , wk+nb for the subspace iteration

Numerical properties

I More robust

(7)

Block JDQR method

Idea

:

(I − ˜Q ˜QT)(A −λ˜iI )(I − ˜Q ˜QT)wk+i = −ri i =1, . . . , nb

Numerical properties

I More robust

(8)

Block JDQR method

Idea

:

Numerical properties

I More robust

(9)

Block JDQR method

Idea

:

Numerical properties

I More robust

(10)

Block JDQR method

Idea

:

Numerical properties

I More robust

(11)

Block JDQR method

Sketch of the complete algorithm

1:

Setup initial subspace

2: while not converged do

5: Calculate an nb approximations and their residual

6:

Lock converged eigenvalues

7:

Shrink subspace if required (thick restart)

8: Approximately solve the nb correction equations

9: Block-Orthogonalize the new directions (using TSQR) 10: Enlarge the subspace

(12)

Block JDQR method

Sketch of the complete algorithm

1: Setup initial subspace

2: while not converged do

5: Calculate an nb approximations and their residual

6: Lock converged eigenvalues

7: Shrink subspace if required (thick restart)

8: Approximately solve the nb correction equations

9: Block-Orthogonalize the new directions (using TSQR) 10: Enlarge the subspace

(13)

Outline

Block-Jacobi-Davidson algorithm

Performance analysis

Required linear algebra operations Block vector operations

Jacobi-Davidson Operator

Implementation Results

(14)

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

I Large distributed sparse matrixAin SELL-C-σ format

I Distributed blocks of vectors_{X , Y ∈ R}n×nb

I Shifted spMMVM:yi ← (A − ˜λiI )xi, i =1, . . . , nb

Block vector operations

I Dierent types of operations:

local all-reduction BLAS 1 Y ← X + Y kxik, i =1, . . . , nb

BLAS 3 Y ← XM M ← XT_Y

(15)

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

Block vector operations

(16)

Required linear algebra operations

Sparse matrix-multiple-vector multiplication (spMMVM)

Block vector operations

(17)

Block vector operations

Background

I All operations arememory bound.

(also the GEMM, as all matrices are very tall and skinny)

I The code balance accurately predicts the node-level performance!

Results of blocking

I Faster BLAS 3 operations (e.g. Y ← XM)

I Message aggregation for all-reductions (e.g. M ← XT_Y₎ → Improved performance of some operations

(18)

Block vector operations

Background

I All operations arememory bound.

(also the GEMM, as all matrices are very tall and skinny)

I The code balance accurately predicts the node-level performance!

Results of blocking

I Faster BLAS 3 operations (e.g. Y ← XM)

I Message aggregation for all-reductions (e.g. M ← XT_Y₎ → Improved performance of some operations

(19)

Complete Jacobi-Davidson Operator

Setup

I spMMVM + 2× GEMM: yi ← (I − QQT)(A − ˜λiI )xi with Q ∈ Rn×₈_{, i = 1, . . . , n} b I Matrix with n ≈ 107, nnzr ≈15 I SELL-C-σ format

I 10-core Intel Ivy Bridge CPU 0

0.02 0.04 0.06 0.08 0.1 1 2 4 8 runtime / blo ck size nb [s] block size nb spMMVM GEMM 1 GEMM 2

(20)

Outline

Block-Jacobi-Davidson algorithm Performance analysis

Implementation

Frameworks from the ESSEX project

Block pipelined GMRES or MINRES preconditioning

Results

(21)

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I General C-interface to linear algebra libraries:

I GHOST

I Trilinos(C++, http://trilinos.sandia.gov)

I builtin(for prototyping, row-major storage, Fortran + C99)

I Iterative solvers I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(22)

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I GHOST

GHOST (General Hybrid Optimized Sparse Toolkit)

(23)

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I GHOST

I Iterative solvers

I Large test framework

GHOST (General Hybrid Optimized Sparse Toolkit)

(24)

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I GHOST

GHOST (General Hybrid Optimized Sparse Toolkit)

(25)

Frameworks from the ESSEX project

phist (Pipelined Hybrid-parallel Linear Solver Toolkit)

I GHOST

GHOST (General Hybrid Optimized Sparse Toolkit)

I Developed at the RRZE in Erlangen

I Hybrid-parallel MPI+OpenMP+CUDA I SELL-C-σ matrix format

(26)

Block pipelined GMRES or MINRES preconditioning

Idea

I Solve nb linear systems Axi = bi at once

I Remove converged systems and add new ones (pipelining) I Standard GMRES or MINRES method (for each system)

˜ v_k+1← Avk GMRES subspace 1: GMRES subspace 2:_. . .

(27)

Block pipelined GMRES or MINRES preconditioning

Idea

I Remove converged systems and add new ones (pipelining)

I Standard GMRES or MINRES method (for each system)

(28)

Block pipelined GMRES or MINRES preconditioning

Idea

(29)

Block pipelined GMRES or MINRES preconditioning

Idea

(30)

Outline

Block-Jacobi-Davidson algorithm Performance analysis Implementation Results Numerical behavior

Performance of the complete algorithm Conclusion

(31)

Numerical behavior

Block size 2

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMVM ) # eigenvalues found Andrews cfd1 nan512 torsion1 ck656 cry10000 dw8192 rdb3200l

Block size 4

1 1.5 2 2.5 3 3.5 5 10 20 30 40 50 relative overhead (# spMV M) # eigenvalues found

(32)

Performance of the complete algorithm (0)

Seeking 20 eigenvalues of spinSZ26 (1 · 107 _{rows, 1.5 · 10}8 _nonzeros)

Single socket (10 cores)

0 100 200 300 400 500 600 tpetra ghost runtime [s] single vector block size 2 block size 4

Complete node (2 × 10 cores)

0 100 200 300 400 500 600 tpetra ghost runtime [s] single vector block size 2 block size 4

(33)

Performance of the complete algorithm (1)

Seeking 20 eigenvalues of spinSZ26 (1.0 · 107_{rows, 1.5 · 10}8_nonzeros)

Block speedup

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1 2 4 8 16 32 64 128 sp eedup through blo cking (t1 / tblo c k )

# nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

0 50 100 150 200 250 1 2 4 8 16 32 64 128 runtime [s]

# nodes (with 20 cores each) single vector

block size 2 block size 4

(34)

Performance of the complete algorithm (2)

Seeking 20 eigenvalues of spinSZ30 (1.6 · 108_{rows, 2.6 · 10}9_nonzeros)

Block speedup

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 16 32 64 sp eedup through blo cking (t1 / tblo c k )

# nodes (with 20 cores each) single vector block size 2 block size 4

Strong scaling

0 100 200 300 400 500 16 32 64 runtime [s]

# nodes (with 20 cores each) single-vector

block size 2 block size 4

(35)

Conclusion (1)

Block JDQR algorithm

I Performance analysis of block operations:

I Impact of memory layout(row- instead of col.-major) I Node-level: better code balance

I Inter-node: message aggregation

I Numerical behavior

I Slight increase of operations I Overhead small for > 10 eigenvalues → Improved performance by factor 1.2-1.4

(36)

Conclusion (1)

Block JDQR algorithm

I Slight increase of operations I Overhead small for > 10 eigenvalues

(37)

Conclusion (1)

Block JDQR algorithm

I Slight increase of operations I Overhead small for > 10 eigenvalues → Improved performance by factor 1.2-1.4

(38)

Conclusion (2)

Outlook

I Numerical improvements:

I Parallel preconditioning of the linear problems

I Algorithmic improvements:

I Hiding spMMVM communication behind other operations I More asynchronous communication(all-reductions)

(39)

Conclusion (2)

Outlook

(40)

Conclusion (2)

Outlook

(41)

Outline

FEAST eigensolver (Polizzi '09) Linear systems for FEAST/graphene Experiments

(42)

(43)

Linear systems for FEAST/graphene

Tough:

I very large (N ∼ 109carbon atoms)

I complex symmetric and completely indenite I small random numbers on and around the diagonal I spectrum essentially continuous

I shifts get very close to the spectrum

But also nice in some ways:

I 2D mesh, very sparse (∼ 10 entries/row) I many RHS/shift (block methods, etc.)

(44)

The CGMN linear solver

I Björck and Elfving, 1979

I CG on the semi-denite problem (I − Q_SSOR)x = b, where QSSOR = Q₁Q₂. . . QNQN−₁. . . Q₁is the SSOR iteration on

AAT_{y = b} I Qiv = (I − ω

aT_iv aiaTi

)aT

i : project v onto ai (row i of A) I extremely robust: A may be singular, non-square etc. I squaring A remedies small diagonal entries

(45)

Two parallelization strategies

Algebraic Multi-Coloring

Distance-2 coloring re-solves data dependency

I yields ne grained parallelism

(e.g. GPGPU)

CARP:Component-Averaged Row Projection (Gordon & Gordon, 2005)

I sequential sweeps on subdomains I exchange and average halo elements I retains convergence properties of

sequential algorithm

(46)

Weak and strong scaling up to 5k Cores

Graphene, 20M atoms/node (up to 5B)

I Coloring costs performance; I but is more memory ecient;

Strong scaling, 20M unknowns in total

I yields better strong scaling;

I and has better potential for GPUs and

(47)

Socket scaling and block Performance

I Performance penalty of coloring already at core (25%) and socket

(50%) levels

I Blocking alleviates poor memory access patterns with coloring, I PHIST/GHOST: MC_CARP kernel only available in CRS

(48)

Conclusion

FEAST for interior eigenvalues

I Robust iterative solution of the arising nearly singular equation

systems with CGMN

I Scalable parallelization of CGMN with algebraic multi-coloring and

CARP

(49)

Conclusion

FEAST for interior eigenvalues

systems with CGMN

CARP

(50)

Conclusion

FEAST for interior eigenvalues

systems with CGMN

CARP

Scalable block methods and preconditioning for hardware efficient sparse eigensolutions