• No results found

Accelerating CICE on the GPU Rob T. Aulwes, CCS-7

N/A
N/A
Protected

Academic year: 2021

Share "Accelerating CICE on the GPU Rob T. Aulwes, CCS-7"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Accelerating CICE on the GPU

Rob T. Aulwes, CCS-7

March 19, 2015

LA-UR-15-21044

(2)

§

Mat Colgrove

§

Jeff Larkin

§

Jiri Kraus

§

Carl Ponder

§

Justin Luitjens

§

Tony Scuderio

Acknowledgements

(3)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§

CICE model

§

Strategy to acceleration

Profiling

OpenACC

GPUDirect

§

Results

§

Path forward

Outline

Slide 3

(4)

Los Alamos Sea Ice Model - CICE

§

Global model of sea ice for

climate and forecast

Used by many climate,

forecast groups

§

Coupled to

atmospheric-ice-ocean-land global

climate models

(5)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 5

§

Ice thickness distribution

Multiple discrete thickness bins

All tracers, etc. exist in each thick class

§

Transport

Tracer conservation, horizontal transport, ITD

Incremental remap: efficient for many tracers

§

Dynamics

Momentum: includes forcing, grav

Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor

§

Thermo/salinity/column physics

Melt/freeze, radiation, fluxes at top/bottom, melt ponds

(6)

OpenACC gotchas

§

Must delete device memory before host memory

§

Passing non-contiguous array slice creates

temporary arrays

– 

Means the temporary array is not in device memory if

original array was

§

Use assumed shape for array declaration

– 

Unless lower bound is not 1

• 

real, dimension(:,0:), intent(in) :: foo

(7)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§

Profile

§

Minimize data movement

§

Exploit CUDA streams

§

Use GPUDirect for MPI between devices

(8)

gprof + gprof2dot.py

                                                                                                                                                                                                                                                                                                                                                                              
(9)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

gprof + gprof2dot.py

                                                                                                                                                                                                                                                                                                                                                                              

Slide 9

(10)

Challenges for Accelerating Dynamics

§

Halo updates between computations

§

Many arrays to move

Move all computations to GPU

(11)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§

Profile

§

Minimize data movement

§

Exploit CUDA streams

§

Use GPUDirect for MPI between devices

(12)

Strategy – Minimize Data Movement

§

Use Fortran pointers into large memory chunks

Reduces amount of data movement

allocate( mem_chunk(nx, ny,2) )!

v => mem_chunk(:,:,1)!

w => mem_chunk(:,:,2)!

!$acc enter data create(mem_chunk)!

!

!$acc update device(mem_chunk)!

!$acc data present(v,w)!

!$acc parallel loop collapse(2)!

do i = 1,ny!

do j = 1,nx!

v(i,j) = alpha * w(i,j)!

enddo!

(13)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy

§

Profile

§

Minimize data movement

§

Exploit CUDA streams

§

Use GPUDirect for MPI between devices

(14)

Strategy – Use CUDA Streams

§

Use CUDA streams

If loops are data independent, launch with separate

streams along with data updates to host/device

!$acc parallel loop collapse(2) async(1)!

do i = 1,n!

do j = 1,m!

a(i,j) = a(i,j) * w(i,j)!

enddo!

enddo!

!

!$acc parallel loop collapse(2) async(2)!

do i = 1,n!

do j = 1,m!

b(i,j) = b(i,j) + alpha * t(i,j)!

enddo!

(15)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Strategy – Use CUDA Streams

§

Invoke subroutine calls using streams

Slide 15

do cat = 1,ncat!

call construct_fields(mx, my)!

enddo!

do cat = 1,ncat!

! In construct_fields, use ‘cat’ as the!

! async stream value!

call construct_fields(cat,mx(:,:,cat), &!

my(:,:,cat))!

(16)

Strategy

§

Profile

§

Minimize data movement

§

Exploit CUDA streams

(17)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

GPUDirect

Slide 17

§

Built CUDA-enabled OpenMPI 1.8.5 on moonlight

– 

Used PGI 14.7 for CICE compiler

§

Titan has Cray’s CUDA-enabled version of MPICH

– 

Also used PGI 14.7

– 

However, the XK7 hardware doesn’t support RDMA, so

MPI still goes through CPU

– 

But, coding to GPUDirect now prepares for upcoming

Summit cluster

(18)

GPUDirect

call ice_haloUpdate(dpx, halo_info)!

call ice_haloUpdate(dpy, halo_info)!

call ice_haloUpdate(mx, halo_info)!

Call ice_haloBegin(halo_info, 3, updateInfo)!

call ice_devHaloUpdate(halo_info, updateInfo, dpx)!

call ice_devHaloUpdate(halo_info, updateInfo, dpy)!

call ice_devHaloUpdate(halo_info, updateInfo, mx)!

call ice_haloEnd(halo_info, updateInfo)!

(19)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Results

(20)

§

Used two test problems: gx1 and tp4

gx1: 16 procs, grid size 320x384

tp4: 60 procs, grid size 900x600

§

Ran with longitudinal blocks

Reduces load imbalance

(21)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§

LANL Moonlight

PGI 14.7, OpenMPI 1.8.5, CUDA 6.0

Intel Xeon (SandyBridge) 16 cores/node

2 Nvidia M2090/node

§

ORNL Titan

PGI 14.7, Cray’s MPICH, CUDA 5.5

AMD Interlagos 16 cores/node

1 Nvidia K20x/node

Test Platforms

(22)

§

Nvidia’s PSG cluster

Ivy Bridge E5-2690, dual socket 10 cores/

socket, 6 x K40

OpenMPI 1.8.5

(23)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Test Cases – p4

4 nodes, 15 procs per node

Slide 23

Runtime

(secs)

Moonlight Titan

PSG

Baseline

96

173

78

OpenACC

+

GPUDirect

99

194

91

(24)

p4 Rank Distribution

Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn

Baseline

Moonlight

73

69

67

GPU Moonlight

96

83

96

Baseline Titan

173

169

160

GPU Titan

194

182

163

Baseline PSG

78

74

69

GPU PSG

91

81

80

(25)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

p4 Scaling

Slide 25

Runtime

(secs)

10 procs

5 nodes

20 procs

10 nodes

40 procs

20 nodes

Baseline Titan 337

187

201

GPU

+GPUDirect

325

180

193

(26)

p4 Scaling

Runtime

(secs)

10 procs

2 nodes

20 procs

4 nodes

40 procs

5 nodes

Baseline PSG 161

87

91

GPU

+GPUDirect

197

107

109

(27)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Topology on PSG

Slide 27

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

mlx5_0 CPU Affinity

GPU0

X

PIX

PHB

PHB

SOC

SOC

PHB

0-9

GPU1

PIX

X

PHB

PHB

SOC

SOC

PHB

0-9

GPU2

PHB

PHB

X

PIX

SOC

SOC

PHB

0-9

GPU3

PHB

PHB

PIX

X

SOC

SOC

PHB

0-9

GPU4

SOC

SOC

SOC

SOC

X

PHB

SOC

10-19

GPU5

SOC

SOC

SOC

SOC

PHB

X

SOC

10-19

mlx5_0 PHB

PHB

PHB

PHB

SOC

SOC

X

X = Self

SOC = Path traverses a socket-level link (e.g. QPI)

PHB = Path traverses a PCIe host bridge

PXB = Path traverses multiple PCIe internal switches

PIX = Path traverses a PCIe internal switch

(28)

Conclusion - Path Forward

§

Focus on dynamics/transport

§

Improve use of GPUDirect

– 

Get rid of aggregation device buffer

– 

Restructure code in order to get better communication/

computation overlap

§

Can we find task parallelism?

– 

Fuse kernels (not enough work)

– 

Spawn computation in stream while performing halo

updates

References

Related documents

This research project underpins the value of a semiotic view as a diagnostic tool to determine the position that actors take in the context of existing strategy discourse.. From

Plutonium oxidation state distributions of the supernatant at steady state for Pu 4 + solubility experiments from oversaturation in 0.18 M NaC10 4 solutions at. pH 6, 7, and 8.5

podem usar apetrechos cibernéticos, mas podem comprar Modificações da Tabela de Modificações para Robôs (pág. Como todos os robôs, possuem 5 espaços para Modificações. 12,

If these alternatives are used for less than 20 percent of expected production, revenue products with no guarantee increase are good alternatives (ie. IP, RA base price option,

Using course tools, in one semester she: (1) was offered $45,000 more at work, (2) got $90 back from checks the bank bounced in error, (3) got $100 from Continental Airlines

The DC Jobs or Else coalition of Black workers, community groups, faith leaders and “mad-as-hell” District of Columbia residents marched on the site of the future Smithsonian

9.2 RCC Columns, beams with Rapidwall floor and walls in high rise building: One of the leading architects based in Mumbai proposed an innovative method of construction

Nowadays, out of “ the things seen, things which are, and things which shall come after ” of Rev 1:19, some have derived their basis for the modern theory of dispensational