• No results found

Holistic Performance Engineering for Sparse Iterative Solvers

N/A
N/A
Protected

Academic year: 2021

Share "Holistic Performance Engineering for Sparse Iterative Solvers"

Copied!
39
0
0

Loading.... (view fulltext now)

Full text

(1)

www.DLR.de•Chart 1 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Holistic Performance Engineering for Sparse Iterative Solvers

Jonas Thies? Dominik ErnstMelven Röhrig-Zöllner?

?Institute of Simulation and Software Technology German Aerospace Center (DLR)

Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg

(2)

www.DLR.de•Chart 2 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Sparse Eigenvalue Problems

FormulationFind some Eigenpairs

(λj, vj) of a large and sparse matrix (pair) in a target region of the spectrum

Avj= λjBvj

• A Hermitian or general, real or

complex

• B may be identity matrix (or

not)

• ‘some’ may mean ‘quite a few’,

100-1 000 or so Applications Quantum and Fluid Mechanics Graphene Anderson localization Hubbard model DLR applications

(3)

www.DLR.de•Chart 3 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Overview of some Sparse Eigensolvers

Computationally related algorithms

CG ⇐⇒ Lanczos

GMRES ⇐⇒ Arnoldi

Block GMRES ⇐⇒ Block Krylov-Schur Jacobi-Davidson ⇐⇒ inexact Newton etc.

Typical algorithmic pattern:

• obtain new search direction(s) vj (e.g. by using a matvec or solving a

system),

• orthogonalize against previous vectors V = [v0. . . vj −1 and expand

V ← [V , vj],

(4)

www.DLR.de•Chart 4 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Performance of Sparse Matrix Algorithms

Typical operationsare

memory-bounded:

• ‘spMVM’ y ← A · x , • vector operations, s ← xTy ,

x ← αx + βy

... unless the data sets are small:

• CPU/KNL: OpenMP overhead

≈ 25µs

(5)

www.DLR.de•Chart 5 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Why Performance Engineering?

simple(?) operation: C = VT

(6)

www.DLR.de•Chart 5 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Why Performance Engineering?

simple(?) operation: C = VT

(7)

www.DLR.de•Chart 5 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Why Performance Engineering?

simple(?) operation: C = VT

(8)

www.DLR.de•Chart 5 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Why Performance Engineering?

simple(?) operation: C = VT

(9)

www.DLR.de•Chart 6 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Test Hardware

• “Skylake”: Intel Xeon Scalable, 4 × 14 cores @2.6GHz,384 GB DDR4RAM • “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode) • “Volta”: NVidia Tesla V100-SXM2 GPU,16 GB HBM2(+UVM)

benchmark Skylake KNL Volta load 360 338 812 store 200 167 883 triad 260 315 843

Measured streaming memory bandwidth [GB/s]

(10)

www.DLR.de•Chart 6 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Test Hardware

• “Skylake”: Intel Xeon Scalable, 4 × 14 cores @2.6GHz,384 GB DDR4RAM • “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode) • “Volta”: NVidia Tesla V100-SXM2 GPU,16 GB HBM2(+UVM)

benchmark Skylake KNL Volta load 360 338 812 store 200 167 883 triad 260 315 843

Measured streaming memory bandwidth [GB/s]

(11)

www.DLR.de•Chart 6 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Test Hardware

• “Skylake”: Intel Xeon Scalable, 4 × 14 cores @2.6GHz,384 GB DDR4RAM • “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode) • “Volta”: NVidia Tesla V100-SXM2 GPU,16 GB HBM2(+UVM)

benchmark Skylake KNL Volta load 360 338 812 store 200 167 883 triad 260 315 843

Measured streaming memory bandwidth [GB/s] K20 K40 P100 V100 20 30 40 24 42 26 18 memory/bandwidth [ms]

(12)

www.DLR.de•Chart 6 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Test Hardware

• “Skylake”: Intel Xeon Scalable, 4 × 14 cores @2.6GHz,384 GB DDR4RAM • “KNL”: Intel Xeon Phi, 64 cores @1.4GHz, 16 GB HBM (cache mode) • “Volta”: NVidia Tesla V100-SXM2 GPU,16 GB HBM2(+UVM)

benchmark Skylake KNL Volta load 360 338 812 store 200 167 883 triad 260 315 843

Measured streaming memory bandwidth [GB/s] nb 1M 2M 4M 8M 16M 32M 1 12 23 37 58 78 83 2 31 35 53 68 81 88 4 34 53 66 83 88 95 8 51 70 85 87 99 100 “% roofline” of XT Y , X , Y ∈ RN×nb on Volta.

(13)

www.DLR.de•Chart 7 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Example: Conjugate Gradients (CG)

Skylake KNL Volta 0 20 40 60 80 100 32.5 39.4 105 10 24 82 33 21 63 [GFlop/s] BLAS1 CG 2563 CG 5123 nb 1M 2M 4M 8M 16M 32M 1 12 23 37 58 78 83 2 31 35 53 68 81 88 4 34 53 66 83 88 95 8 51 70 85 87 99 100 • 1000 CG iterations • 3D 7-point Laplacian grid N memory 2563 16.7M 2.2 GB 5123 132M 18 GB

(14)

www.DLR.de•Chart 8 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Increasing the Flop Intensity

Aim: push operations to the right

Block solvers(block size nb)

• inner product =⇒ factor n2

bmore flops

• vector updates remain BLAS1 (X ← X + αY )

• Caveat:may increase number of iterations

• Example:block GMRES for multiple RHS

Kernel fusion

• example: compute Y ← AX and simultaneously C = XTY ‘for free’

• requires specialized kernels

• no deterioration of numerics

Mixed Precision(work in progress)

• store (block) vectors in single precision

• compute in double to maintain numerical stability

(15)

www.DLR.de•Chart 8 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Increasing the Flop Intensity

Aim: push operations to the right

Block solvers(block size nb)

• inner product =⇒ factor n2

bmore flops

• vector updates remain BLAS1 (X ← X + αY )

• Caveat:may increase number of iterations

• Example:block GMRES for multiple RHS

Kernel fusion

• example: compute Y ← AX and simultaneously C = XTY ‘for free’

• requires specialized kernels

• no deterioration of numerics

Mixed Precision(work in progress)

• store (block) vectors in single precision

• compute in double to maintain numerical stability

(16)

www.DLR.de•Chart 8 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Increasing the Flop Intensity

Aim: push operations to the right

Block solvers(block size nb)

• inner product =⇒ factor n2

bmore flops

• vector updates remain BLAS1 (X ← X + αY )

• Caveat:may increase number of iterations

• Example:block GMRES for multiple RHS

Kernel fusion

• example: compute Y ← AX and simultaneously C = XTY ‘for free’

• requires specialized kernels

• no deterioration of numerics

Mixed Precision(work in progress)

• store (block) vectors in single precision

• compute in double to maintain numerical stability

(17)

www.DLR.de•Chart 9 > PASC’18 > J. Thies et al.•Sparse Solvers > PASC’18

Example: Block Orthogonalization

Problem definition

• Given orthogonal vectors w1, . . . , wk = W • For X ∈ Rn×nb find orthogonal Y ∈ Rn× ˜nb with

YR1= X − WR2, and WTY = 0

Two phase algorithms

Phase 1 Project: X ← (I − WW¯ T)X

Phase 2 Orthogonalize: Y ← f ( ¯X )

• suitable f :

• SVQB(Stathopoulos and Wu, SISC 2002) • TSQR(Demmel et al., SISC 2012)

(18)

www.DLR.de•Chart 10 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Block Orthogonalization with Kernel Fusion

Rearrange and fuse operationsto reduce memory traffic:

Phase 2 X ← X ¯¯ M, N ← WTX¯

Phase 1 X ← X − WN, M ← ¯¯ XTX¯

Phase 3 X ← X ¯¯ M, M ← ¯XTX¯

⇒ use SVQB or Cholesky-QR

Increased precision

Idea Calculate value and error of each arithmetic operation

• Store intermediate results asdouble-double(DD) numbers • Based on arithmetic building blocks(2Sum, 2Mult)

Muller et al.: Handbook of Floating-Point Arithmetic, Springer 2010 • Exploit FMA operations(AVX2)

(19)

www.DLR.de•Chart 11 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Block Orthog: runtime to convergence

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 10 × 1 10 × 2 10 × 4 20 × 1 20 × 2 20 × 4 40 × 1 40 × 2 40 × 4 runtime/blo ck size nb [s] nproj× nb Iterated project/SVQB " " with fused operations " " with fused DD operations

(20)

www.DLR.de•Chart 12 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Software I: our Kernel Library

General,Hybrid-parallel and

OptimizedSparseToolkit

• provides memory-bounded kernels for sparse solvers

• data structures:

• row- or col-major block vectors • SELL-C − σ for sparse matrices • written (mostly) in C

• ‘MPI+X’with X OpenMP, CUDA and SIMD intrinsics

• runs on Peta-scale systems (Piz Daint, Oakforest-PACS)

• can use heterogeneous systems (e.g. including CPUs, MIC and GPUs)

(21)

www.DLR.de•Chart 13 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Software II: Algorithms and Integration Framework

PHIST

Pipelined,Hybrid-parallel

IterativeSolverToolkit

• Interfaces: C, C++, Fortran,

Python

• testing and benchmarking tools • includes performance models • various linear and eigensolvers

Select ‘backend’ at compile time:

, builtin (Fortran), , PETSc

(22)

www.DLR.de•Chart 14 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Example: Anasazi Block Krylov-Schur on Skylake CPU

Matrix: non-symmetric 7-point stencil, N = 1283

(var. coeff. reaction/convection/diffusion)

nb= 1 nb= 2 nb= 4 nb= 8 0 10 20 30 40 21.4 27.9 24.5 23.7 12.2 16.4 15.1 16.7 8.83 8.15 7.36 8.1 4.84 5.07 4.63 4.7 runtime [s]

Tpetra, their orthog Tpetra, PHIST orthog GHOST, their orthog GHOST, PHIST orthog

• Anasazi’s kernel interface

is mostly a subset of PHIST’s

=⇒ extends PHIST by e.g. BKS and LOBPCG

• not optimized for block

vectors in row-major storage

(23)

www.DLR.de•Chart 14 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Example: Anasazi Block Krylov-Schur on Volta GPU

Matrix: non-symmetric 7-point stencil, N = 1283

(var. coeff. reaction/convection/diffusion)

nb= 1 nb= 2 nb= 4 nb= 8 0 10 20 30 40 16.8 19.5 17.9 20.8 runtime [s]

Tpetra, their orthog

• Anasazi’s kernel interface

is mostly a subset of PHIST’s

=⇒ extends PHIST by e.g. BKS and LOBPCG

• not optimized for block

vectors in row-major storage

(24)

www.DLR.de•Chart 14 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Example: Anasazi Block Krylov-Schur on Volta GPU

Matrix: non-symmetric 7-point stencil, N = 1283

(var. coeff. reaction/convection/diffusion)

nb= 1 nb= 2 nb= 4 nb= 8 0 10 20 30 40 16.8 19.5 17.9 20.8 16.8 23.5 27.6 35.5 runtime [s]

Tpetra, their orthog Tpetra, PHIST orthog

• Anasazi’s kernel interface

is mostly a subset of PHIST’s

=⇒ extends PHIST by e.g. BKS and LOBPCG

• not optimized for block

vectors in row-major storage

(25)

www.DLR.de•Chart 14 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Example: Anasazi Block Krylov-Schur on Volta GPU

Matrix: non-symmetric 7-point stencil, N = 1283

(var. coeff. reaction/convection/diffusion)

nb= 1 nb= 2 nb= 4 nb= 8 0 10 20 30 40 16.8 19.5 17.9 20.8 16.8 23.5 27.6 35.5 17.7 20.2 20.2 31.6 runtime [s]

Tpetra, their orthog Tpetra, PHIST orthog GHOST, their orthog

• Anasazi’s kernel interface

is mostly a subset of PHIST’s

=⇒ extends PHIST by e.g. BKS and LOBPCG

• not optimized for block

vectors in row-major storage

(26)

www.DLR.de•Chart 14 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Example: Anasazi Block Krylov-Schur on Volta GPU

Matrix: non-symmetric 7-point stencil, N = 1283

(var. coeff. reaction/convection/diffusion)

nb= 1 nb= 2 nb= 4 nb= 8 0 10 20 30 40 16.8 19.5 17.9 20.8 16.8 23.5 27.6 35.5 17.7 20.2 20.2 31.6 6.43 6.44 6.67 6.72 runtime [s]

Tpetra, their orthog Tpetra, PHIST orthog GHOST, their orthog GHOST, PHIST orthog

• Anasazi’s kernel interface

is mostly a subset of PHIST’s

=⇒ extends PHIST by e.g. BKS and LOBPCG

• not optimized for block

vectors in row-major storage

(27)

www.DLR.de•Chart 15 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Can we do better?

PHIST PerfCheck: replace timing output by simple performance model

• Anasazi BKS with nb= 4 gives lines like this:

function(dim) / (formula) total time %roofline count phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,*iflag=0) 6.156e+00 11.7 174 STREAM_TRIAD((nV+nW)*n*sizeof(_ST_))

‘realistic’ option: report strided data accesses due to ‘views’ and adjust

perf. model

function(dim) / (formula) total time %roofline count phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,ldV=85,*iflag=0) 6.013e+00 23.8 174 STREAM_TRIAD((nV_+nW_)*n*sizeof(_ST_))

(28)

www.DLR.de•Chart 15 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Can we do better?

PHIST PerfCheck: replace timing output by simple performance model

• Anasazi BKS with nb= 4 gives lines like this:

function(dim) / (formula) total time %roofline count phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,*iflag=0) 6.156e+00 11.7 174 STREAM_TRIAD((nV+nW)*n*sizeof(_ST_))

‘realistic’ option: report strided data accesses due to ‘views’ and adjust

perf. model

function(dim) / (formula) total time %roofline count phist_Dmvec_times_sdMat_inplace(nV=4,nW=4,ldV=85,*iflag=0) 6.013e+00 23.8 174 STREAM_TRIAD((nV_+nW_)*n*sizeof(_ST_))

(29)

www.DLR.de•Chart 16 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Block Jacobi-Davidson QR

• Use inexact Newton rather than Krylov sequence

• JDQR: subspace acceleration, locking and restart (Fokkema’99)

Block Jacobi-Davidson correction equation

nb current approximations: vi− ˜λivi˜ = ri, i = 1, . . . , nb • previously converged Schur vectors q1, . . . , qk = Q • solve approximately (with ˜Q = Q v˜1 . . . v˜n

b ):

(I − ˜Q ˜QT)(A −˜λiI )(I − ˜Q ˜QT)xi= −ri i = 1, . . . , nb

• use some steps of ablock(ed)iterative solver • orthogonalize new directionsx1, . . . , xn

(30)

www.DLR.de•Chart 17 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR: ‘Numerical Overhead’

block size 2 With larger block size...

• number of (outer) iteratios

decreases

• total number of operations

increases

(31)

www.DLR.de•Chart 17 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR: ‘Numerical Overhead’

block size 4 With larger block size...

• number of (outer) iteratios

decreases

• total number of operations

increases

(32)

www.DLR.de•Chart 17 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR: ‘Numerical Overhead’

block size 8 With larger block size...

• number of (outer) iteratios

decreases

• total number of operations

increases

(33)

www.DLR.de•Chart 18 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR on Different Hardware (here: Laplace problem)

Skylake (5123) KNL (2563) Volta (1283) 0 50 100 150 200 32.5 39.4 105 GFlop/s BLAS1 GFlop/s]

(34)

www.DLR.de•Chart 18 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR on Different Hardware (here: Laplace problem)

Skylake (5123) KNL (2563) Volta (1283) 0 50 100 150 200 32.5 39.4 105 32 24 76 GFlop/s BLAS1 GFlop/s] CG GFlop/s

(35)

www.DLR.de•Chart 18 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR on Different Hardware (here: Laplace problem)

Skylake (5123) KNL (2563) Volta (1283) 0 50 100 150 200 32.5 39.4 105 32 24 76 51 35 132 GFlop/s BLAS1 GFlop/s] CG GFlop/s BJDQR-1 GFlop/s

(36)

www.DLR.de•Chart 18 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR on Different Hardware (here: Laplace problem)

Skylake (5123) KNL (2563) Volta (1283) 0 50 100 150 200 32.5 39.4 105 32 24 76 51 35 132 73 47 147 GFlop/s BLAS1 GFlop/s] CG GFlop/s BJDQR-1 GFlop/s BJDQR-2 GFlop/s

(37)

www.DLR.de•Chart 18 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

BJDQR on Different Hardware (here: Laplace problem)

Skylake (5123) KNL (2563) Volta (1283) 0 50 100 150 200 32.5 39.4 105 32 24 76 51 35 132 73 47 147 103 55 221 GFlop/s BLAS1 GFlop/s] CG GFlop/s BJDQR-1 GFlop/s BJDQR-2 GFlop/s BJDQR-4 GFlop/s

(38)

www.DLR.de•Chart 19 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Preconditioning with ML

• non-symm. PDE as before

• use AMG preconditioner ML from Trilinos

(“non-symmetric smoothed aggregation”)

problem size preconditioner iterations spMVMs ttot tgmres

1283 GMRES 447 9 875 52.6s 25.0s GMRES+ML 26 612 21.2s 11.4s 2563 GMRES 781 17 223 1 300s 571s GMRES+ML 40 922 346s 183s 5123 GMRES >1k >22k >1h >1h GMRES+ML 32 746 624s 320s

(39)

www.DLR.de•Chart 20 > PASC’18 > J. Thies et al. •Sparse Solvers > PASC’18

Summary

• Two libs for high-performance sparse solvers: GHOST & PHIST • support algorithm developer by

kernel interface inspired by MPI • kernels & algorithmic core operations • built-in performance models

• Holistic performance engineering needed for sparse block solvers • tight interplayof data structures, kernels, core and algorithm • Example: Block Krylov-Schur with different kernels and

orthogonalization routines Next steps:

• model that takes fast and slow memory segments into account • demonstrate strong scaling advantage of block methods

References

Related documents

• Latest updates on Section 83 of the MITA and new MIRB’s prescribed forms relating to employer’s tax obligation... Individual Rate

However, shortly after the strong negation is added into the language of logic programs, the semantics of logic programs is turned into a more general concept called answer sets

We considered using the periodogram, the spectrogram and the Morlet wavelet scalogram to cover a diversity of time- frequency samplings (see Fig 1); the Wigner-Ville

In this chapter the adjusted model was applied to analyse the implications of four different scenarios on employment creation in South Africa and its implications for

The gross margin calculated as the difference between the gross return and variable cost, was 67 per cent higher in precision than non-precision farming in brinjal cultivation..

The patients were classified into four groups according to the radiation field: the entire mediastinum alone, mediastinum + the bilateral supraclavicular area, mediastinum +

Sakwa (2006), also assessed the skill of the High Resolution Regional Model (HRM) in simula- tion of airflow and rainfall over East Africa, found that the model was able to

Given that the controller is critical for SDN security, the chapter provides a more in-depth anatomical view of its components and its various roles, using ODL as the