Graphic Processing Units: a possible answer to High Performance Computing?

(1)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

4th ABINIT Developer Workshop

RESIDENCE L’ESCANDILLE– AUTRANS

Graphic Processing Units: a possible answer to

High Performance Computing?

Luigi Genovese

ESRF - Grenoble

(2)

HPC & GPU

The BigDFT code

Different numerical operations performed: For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra)

Numerical operations

Convolutions with shortfilters

BLAS

routines FFT (Poisson Solver) Interpolating Daubechies

(3)

HPC & GPU

Separable convolutions

We must calculate

F

(

I1

,

I2

,

I3

) =

L

∑

j1,j2,j3=0 hj1hj2hj3G

(

I1

−

j1

,

I2

−

j2

,

I3

−

j3

)

=

L

∑

j1=0 hj1 L

∑

j2=0 hj2 L

∑

j3=0 hj3G

(

i1

−

j1

,

i2

−

j2

,

i3

−

j3

)

Application of three successive operations

1 _A 3

(

I3

,

i1

,

i2

) = ∑

jhjG

(

i1

,

i2

,

I3

−

j

)

∀

i1

,

i2; 2 _A 2

(

I2

,

I3

,

i1

) = ∑

jhjA3

(

I3

,

i1

,

I2

−

j

)

∀

I3

,

i1; 3 _F

(

_I 1

,

I2

,

I3

) = ∑

jhjA2

(

I2

,

I3

,

I1

−

j

)

∀

I2

,

I3.

Main routine: Convolution + transposition

F

(

I

,

a

) =

_∑

j

(4)

HPC & GPU

Evolution of GPU accelerator

Increasing power in last years

The price (and the power consumption) of a GFlops gets cheaper!

(5)

HPC & GPU

Why not to use?

How to code scientific operations on a GPU?

GPU hardware is designed for . . . graphic calculation Texture shaders

→

rendering

Single precision calculations

(6)

HPC & GPU

The CUDA programming language

NVidia GPU: the CUDA programming language

The API is anextension to ANSI C(++) Low learning curve

The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner

CUBLAS and CUFFT

Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT

4

Can really be used as black-boxes

Since july 2008

(7)

HPC & GPU

Example: BLAS operations used in BigDFT

Double precision calculations for Norb

×

m matrices

(m

=

300kB) 0 10 20 30 40 50 60 70 0 500 1000 1500 2000 2500

GPU speedup (Double prec.)

Number of orbitals

DGEMM DSYRK

(8)

HPC & GPU

BigDFT performances

Free BC: 0 20 40 60 80 100 1 5 8 17 32 65 128 257 512 1025 0.1 1 10 100 1000 Percent

Seconds (log. scale)

Number of atoms LinAlg sumrho PSolver HamApp Precond Other Comm (%) Time (sec)

(9)

HPC & GPU

BigDFT performances

Periodic BC (8 atoms/core): 0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 10 100 1000 Percent

Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)

(10)

HPC & GPU

The convolutions on GPU

Main intensive routine: Convolution + transposition

F

(

I

,

a

) =

_∑

j

hjG

(

a

,

I

−

j

)

∀

a

;

Combined set of 1D convolutions:

4

Easy to parallelize (in GPU sense)

4

Short filters: Loop unrolling, less registers

Optimal for hiding memory latency by arithmetics BigDFT code makes possible to access hybrid CPU-GPU supercomputer

CEA/GENCI “Titane” machine, hybrid section

(11)

HPC & GPU

One dimensional convolutions

June 2008

Preliminary results (stage of M. Ospici, LIG - Bull)

0 10 20 30 40 50 60 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 speedup Data size (Mb) G80 GT200(simple) GT200(double)

(12)

HPC & GPU

All the three dimensional convolution operators

Now

Double precision calculations, full 3D operators

2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70

GPU speedup (Double prec.)

Wavefunction size (MB)

locden locham precond

(13)

HPC & GPU

Full Hybrid code

We can insert it in the full code, in parallel

0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 10 100 1000 Percent

Hybrid code (rel.) CPU code

(14)

HPC & GPU

Around 7 times faster

A lot of operations can still be improved

0 10 20 30 40 50 60 70 80 90 100 1 2 4 6 8 10 12 14 16 1 10 100 Percent

(15)

HPC & GPU

Summary and outlook

Considerations:

Our BigDFT experience

GPU may represent a real alternative for speeding up the computations

Production computations are accessible, not only prototypes

. . . but . . .

Can one draw general conclusions? Probably no. . . How can we estimate the ratio benefit/costs?

Nature of numerical operations

Hot-spot operations (>80% of the overall time) Multi-GPU?