• No results found

Graphic Processing Units: a possible answer to High Performance Computing?

N/A
N/A
Protected

Academic year: 2021

Share "Graphic Processing Units: a possible answer to High Performance Computing?"

Copied!
15
0
0

Loading.... (view fulltext now)

Full text

(1)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

4th ABINIT Developer Workshop

RESIDENCE L’ESCANDILLE– AUTRANS

Graphic Processing Units: a possible answer to

High Performance Computing?

Luigi Genovese

ESRF - Grenoble

(2)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

The BigDFT code

Different numerical operations performed: For each wavefunction (Hamiltonian application) Between wavefunctions (Linear algebra)

Numerical operations

Convolutions with shortfilters

BLAS

routines FFT (Poisson Solver) Interpolating Daubechies

(3)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Separable convolutions

We must calculate

F

(

I1

,

I2

,

I3

) =

L

j1,j2,j3=0 hj1hj2hj3G

(

I1

j1

,

I2

j2

,

I3

j3

)

=

L

j1=0 hj1 L

j2=0 hj2 L

j3=0 hj3G

(

i1

j1

,

i2

j2

,

i3

j3

)

Application of three successive operations

1 A 3

(

I3

,

i1

,

i2

) = ∑

jhjG

(

i1

,

i2

,

I3

j

)

i1

,

i2; 2 A 2

(

I2

,

I3

,

i1

) = ∑

jhjA3

(

I3

,

i1

,

I2

j

)

I3

,

i1; 3 F

(

I 1

,

I2

,

I3

) = ∑

jhjA2

(

I2

,

I3

,

I1

j

)

I2

,

I3.

Main routine: Convolution + transposition

F

(

I

,

a

) =

j

(4)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Evolution of GPU accelerator

Increasing power in last years

The price (and the power consumption) of a GFlops gets cheaper!

(5)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Why not to use?

How to code scientific operations on a GPU?

GPU hardware is designed for . . . graphic calculation Texture shaders

rendering

Single precision calculations

(6)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

The CUDA programming language

NVidia GPU: the CUDA programming language

The API is anextension to ANSI C(++) Low learning curve

The hardware is designed for lightweight runtime High performance with moderate optimisation costs A number of SDK which helps the beginner

CUBLAS and CUFFT

Nvidia provides the CUDA user with pre-built libraries for BLAS and FFT

4

Can really be used as black-boxes

Since july 2008

(7)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Example: BLAS operations used in BigDFT

Double precision calculations for Norb

×

m matrices

(m

=

300kB) 0 10 20 30 40 50 60 70 0 500 1000 1500 2000 2500

GPU speedup (Double prec.)

Number of orbitals

DGEMM DSYRK

(8)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

BigDFT performances

Free BC: 0 20 40 60 80 100 1 5 8 17 32 65 128 257 512 1025 0.1 1 10 100 1000 Percent

Seconds (log. scale)

Number of atoms LinAlg sumrho PSolver HamApp Precond Other Comm (%) Time (sec)

(9)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

BigDFT performances

Periodic BC (8 atoms/core): 0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 10 100 1000 Percent

Seconds (log. scale)

Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)

(10)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

The convolutions on GPU

Main intensive routine: Convolution + transposition

F

(

I

,

a

) =

j

hjG

(

a

,

I

j

)

a

;

Combined set of 1D convolutions:

4

Easy to parallelize (in GPU sense)

4

Short filters: Loop unrolling, less registers

Optimal for hiding memory latency by arithmetics BigDFT code makes possible to access hybrid CPU-GPU supercomputer

CEA/GENCI “Titane” machine, hybrid section

(11)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

One dimensional convolutions

June 2008

Preliminary results (stage of M. Ospici, LIG - Bull)

0 10 20 30 40 50 60 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 speedup Data size (Mb) G80 GT200(simple) GT200(double)

(12)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

All the three dimensional convolution operators

Now

Double precision calculations, full 3D operators

2 4 6 8 10 12 14 16 18 20 0 10 20 30 40 50 60 70

GPU speedup (Double prec.)

Wavefunction size (MB)

locden locham precond

(13)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Full Hybrid code

We can insert it in the full code, in parallel

0 20 40 60 80 100 1 2 4 6 8 10 12 14 16 1 2 4 6 8 10 12 14 16 1 10 100 1000 Percent

Seconds (log. scale)

Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)

Hybrid code (rel.) CPU code

(14)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Around 7 times faster

A lot of operations can still be improved

0 10 20 30 40 50 60 70 80 90 100 1 2 4 6 8 10 12 14 16 1 10 100 Percent

Seconds (log. scale)

Number of cores LinAlg locden locham precond PureCPU Other Comm Time (sec)

(15)

HPC & GPU

BigDFT operations GPU Hardware

Nvidia and CUDA CUBLAS GPU computation BigDFT convolutions Full code Discussion

Summary and outlook

Considerations:

Our BigDFT experience

GPU may represent a real alternative for speeding up the computations

Production computations are accessible, not only prototypes

. . . but . . .

Can one draw general conclusions? Probably no. . . How can we estimate the ratio benefit/costs?

Nature of numerical operations

Hot-spot operations (>80% of the overall time) Multi-GPU?

References

Related documents

In this study, Slicing Frequent Pattern Mining (CSFPM) algorithm was proposed to use GPU and OpenCL to develop efficient parallel Apriori algorithm for FPM

Here we report on our progress towards MP arithmetic libraries on the GPU in four areas: (1) large integer addition, subtraction, and multiplication; (2) high performance

HPC Cluster – Numa Hardware CPU_1 CPU_2 RAM_A RAM_B process P1 is suspended and still allocate Memory on RAM_A HT P2 P2 P3.. HPC Cluster – Numa Hardware CPU_1 CPU_2 RAM_A RAM_B

There are several uses of virtualization in HPC environment [7]. As well as virtualization offers many benefits to the HPC applications, although the use of

I Jobs are submitted via a Shell Script or via a Matlab script I Shell Script examples are available on the HPC web site... Matlab’s

June 4: Using NVIDIA GPUs for Real-time Data Processing in a Holographic Radar System. July 16: GPU Architecture & the CUDA

IBM Cell BE 8+1 cores Intel Core i7 8 cores, 2-way SMT IBM POWER7 8 cores, 4-way SMT Intel SCC 48 cores Nvidia Fermi 448 cores, SMT Tilera TILE Gx 100 cores.. Prediction

High-performance computing (HPC) uses supercomputers and computer clusters to solve advanced computing problems.. The term is most commonly associated with computing used for