HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
4th ABINIT Developer Workshop
RESIDENCE L’ESCANDILLE– AUTRANS
Graphic Processing Units: a possible answer to
High Performance Computing?
Luigi Genovese
ESRF - Grenoble
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
The BigDFT code
Different numerical operations performed: For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra)
Numerical operations
Convolutions with shortfiltersBLAS
routines FFT (Poisson Solver) Interpolating DaubechiesHPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Separable convolutions
We must calculate
F(
I1,
I2,
I3) =
L∑
j1,j2,j3=0 hj1hj2hj3G(
I1−
j1,
I2−
j2,
I3−
j3)
=
L∑
j1=0 hj1 L∑
j2=0 hj2 L∑
j3=0 hj3G(
i1−
j1,
i2−
j2,
i3−
j3)
Application of three successive operations
1 A 3
(
I3,
i1,
i2) = ∑
jhjG(
i1,
i2,
I3−
j)
∀
i1,
i2; 2 A 2(
I2,
I3,
i1) = ∑
jhjA3(
I3,
i1,
I2−
j)
∀
I3,
i1; 3 F(
I 1,
I2,
I3) = ∑
jhjA2(
I2,
I3,
I1−
j)
∀
I2,
I3.Main routine: Convolution + transposition
F
(
I,
a) =
∑
j
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Evolution of GPU accelerator
Increasing power in last years
The price (and the power consumption) of a GFlops gets cheaper!
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Why not to use?
How to code scientific operations on a GPU?
GPU hardware is designed for . . . graphic calculation Texture shaders
→
renderingSingle precision calculations
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
The CUDA programming language
NVidia GPU: the CUDA programming language
The API is anextension to ANSI C(++) Low learning curve
The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner
CUBLAS and CUFFT
Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT
4
Can really be used as black-boxesSince july 2008
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Example: BLAS operations used in BigDFT
Double precision calculations for Norb
×
m matrices(m
=
300kB) 0 10 20 30 40 50 60 70 0 500 1000 1500 2000 2500GPU speedup (Double prec.)
Number of orbitals
DGEMM DSYRK
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
BigDFT performances
Free BC: 0 20 40 60 80 100 1 5 8 17 32 65 128 257 512 1025 0.1 1 10 100 1000 PercentSeconds (log. scale)
Number of atoms LinAlg sumrho PSolver HamApp Precond Other Comm (%) Time (sec)
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
BigDFT performances
Periodic BC (8 atoms/core): 0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 10 100 1000 PercentSeconds (log. scale)
Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
The convolutions on GPU
Main intensive routine: Convolution + transposition
F
(
I,
a) =
∑
j
hjG
(
a,
I−
j)
∀
a;
Combined set of 1D convolutions:
4
Easy to parallelize (in GPU sense)4
Short filters: Loop unrolling, less registersOptimal for hiding memory latency by arithmetics BigDFT code makes possible to access hybrid CPU-GPU supercomputer
CEA/GENCI “Titane” machine, hybrid section
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
One dimensional convolutions
June 2008
Preliminary results (stage of M. Ospici, LIG - Bull)
0 10 20 30 40 50 60 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 speedup Data size (Mb) G80 GT200(simple) GT200(double)
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
All the three dimensional convolution operators
Now
Double precision calculations, full 3D operators
2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70
GPU speedup (Double prec.)
Wavefunction size (MB)
locden locham precond
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Full Hybrid code
We can insert it in the full code, in parallel
0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 10 100 1000 Percent
Seconds (log. scale)
Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)
Hybrid code (rel.) CPU code
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Around 7 times faster
A lot of operations can still be improved
0 10 20 30 40 50 60 70 80 90 100 1 2 4 6 8 10 12 14 16 1 10 100 Percent
Seconds (log. scale)
Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)
HPC & GPU
BigDFT operations GPU Hardware
Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion
Summary and outlook
Considerations:
Our BigDFT experience
GPU may represent a real alternative for speeding up the computations
Production computations are accessible, not only prototypes
. . . but . . .
Can one draw general conclusions? Probably no. . . How can we estimate the ratio benefit/costs?
Nature of numerical operations
Hot-spot operations (>80% of the overall time) Multi-GPU?