Accelerating CICE on the GPU Rob T. Aulwes, CCS-7

(1)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

Accelerating CICE on the GPU

Rob T. Aulwes, CCS-7

March 19, 2015

LA-UR-15-21044

(2)

§

Mat Colgrove

§

Jeff Larkin

§

Jiri Kraus

§

Carl Ponder

§

Justin Luitjens

§

Tony Scuderio

Acknowledgements

(3)

Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA

UNCLASSIFIED

§

CICE model

§

Strategy to acceleration

–

Profiling

–

OpenACC

–

GPUDirect

§

Results

§

Path forward

Outline

Slide 3

(4)

Los Alamos Sea Ice Model - CICE

§

Global model of sea ice for

climate and forecast

–

Used by many climate,

forecast groups

§

Coupled to

atmospheric-ice-ocean-land global

climate models

(5)

UNCLASSIFIED

Los Alamos Sea Ice Model - CICE

Slide 5

§

Ice thickness distribution

–

Multiple discrete thickness bins

–

All tracers, etc. exist in each thick class

§

Transport

–

Tracer conservation, horizontal transport, ITD

–

Incremental remap: efficient for many tracers

§

Dynamics

–

Momentum: includes forcing, grav

–

Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor

§

Thermo/salinity/column physics

–

Melt/freeze, radiation, fluxes at top/bottom, melt ponds

(6)

OpenACC gotchas

§

Must delete device memory before host memory

§

Passing non-contiguous array slice creates

temporary arrays

– 

Means the temporary array is not in device memory if

original array was

§

Use assumed shape for array declaration

– 

Unless lower bound is not 1

• 

real, dimension(:,0:), intent(in) :: foo

(7)

UNCLASSIFIED

Strategy

§

Profile

§

Minimize data movement

§

Exploit CUDA streams

§

Use GPUDirect for MPI between devices

(8)

gprof + gprof2dot.py

                                                                                                                                                                                                                                                                                                                                                                              

(9)

UNCLASSIFIED

gprof + gprof2dot.py

                                                                                                                                                                                                                                                                                                                                                                              

Slide 9

(10)

Challenges for Accelerating Dynamics

§

Halo updates between computations

§

Many arrays to move

–

Move all computations to GPU

(11)

UNCLASSIFIED

Strategy

§

Profile

§

Use GPUDirect for MPI between devices

(12)

Strategy – Minimize Data Movement

§

Use Fortran pointers into large memory chunks

–

Reduces amount of data movement

allocate( mem_chunk(nx, ny,2) )!

v => mem_chunk(:,:,1)!

w => mem_chunk(:,:,2)!

!$acc enter data create(mem_chunk)!

!

!$acc update device(mem_chunk)!

!$acc data present(v,w)!

!$acc parallel loop collapse(2)!

do i = 1,ny!

do j = 1,nx!

v(i,j) = alpha * w(i,j)!

enddo!

(13)

UNCLASSIFIED

Strategy

§

Profile

§

Use GPUDirect for MPI between devices

(14)

Strategy – Use CUDA Streams

§

Use CUDA streams

–

If loops are data independent, launch with separate

streams along with data updates to host/device

!$acc parallel loop collapse(2) async(1)!

do i = 1,n!

do j = 1,m!

a(i,j) = a(i,j) * w(i,j)!

enddo!

!

!$acc parallel loop collapse(2) async(2)!

do i = 1,n!

do j = 1,m!

b(i,j) = b(i,j) + alpha * t(i,j)!

enddo!

(15)

UNCLASSIFIED

Strategy – Use CUDA Streams

§

Invoke subroutine calls using streams

Slide 15

do cat = 1,ncat!

call construct_fields(mx, my)!

enddo!

do cat = 1,ncat!

! In construct_fields, use ‘cat’ as the!

! async stream value!

call construct_fields(cat,mx(:,:,cat), &!

my(:,:,cat))!

(16)

Strategy

§

Profile

§

(17)

UNCLASSIFIED

GPUDirect

Slide 17

§

Built CUDA-enabled OpenMPI 1.8.5 on moonlight

– 

Used PGI 14.7 for CICE compiler

§

Titan has Cray’s CUDA-enabled version of MPICH

– 

Also used PGI 14.7

– 

However, the XK7 hardware doesn’t support RDMA, so

MPI still goes through CPU

– 

But, coding to GPUDirect now prepares for upcoming

Summit cluster

(18)

GPUDirect

call ice_haloUpdate(dpx, halo_info)!

call ice_haloUpdate(dpy, halo_info)!

call ice_haloUpdate(mx, halo_info)!

Call ice_haloBegin(halo_info, 3, updateInfo)!

call ice_devHaloUpdate(halo_info, updateInfo, dpx)!

call ice_devHaloUpdate(halo_info, updateInfo, dpy)!

call ice_devHaloUpdate(halo_info, updateInfo, mx)!

call ice_haloEnd(halo_info, updateInfo)!

(19)

UNCLASSIFIED

Results

(20)

§

Used two test problems: gx1 and tp4

–

gx1: 16 procs, grid size 320x384

–

tp4: 60 procs, grid size 900x600

§

Ran with longitudinal blocks

–

Reduces load imbalance

(21)

UNCLASSIFIED

§

LANL Moonlight

–

PGI 14.7, OpenMPI 1.8.5, CUDA 6.0

–

Intel Xeon (SandyBridge) 16 cores/node

–

2 Nvidia M2090/node

§

ORNL Titan

–

PGI 14.7, Cray’s MPICH, CUDA 5.5

–

AMD Interlagos 16 cores/node

–

1 Nvidia K20x/node

Test Platforms

(22)

§

Nvidia’s PSG cluster

–

Ivy Bridge E5-2690, dual socket 10 cores/

socket, 6 x K40

–

OpenMPI 1.8.5

(23)

UNCLASSIFIED

Test Cases – p4

4 nodes, 15 procs per node

Slide 23

Runtime

(secs)

Moonlight Titan

PSG

Baseline

96

173

78

OpenACC

+

GPUDirect

99

194

91

(24)

p4 Rank Distribution

Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn

Baseline

Moonlight

73

69

67

GPU Moonlight

96

83

96

Baseline Titan

173

169

160

GPU Titan

194

182

163

Baseline PSG

78

74

69

GPU PSG

91

81

80

(25)

UNCLASSIFIED

p4 Scaling

Slide 25

Runtime

(secs)

10 procs

5 nodes

20 procs

10 nodes

40 procs

20 nodes

Baseline Titan 337

187

201

GPU

+GPUDirect

325

180

193

(26)

p4 Scaling

Runtime

(secs)

10 procs

2 nodes

20 procs

4 nodes

40 procs

5 nodes

Baseline PSG 161

87

91

GPU

+GPUDirect

197

107

109

(27)

UNCLASSIFIED

Topology on PSG

Slide 27

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

mlx5_0 CPU Affinity

GPU0

X

PIX

PHB

SOC

PHB

0-9

GPU1

PIX

X

PHB

SOC

PHB

0-9

GPU2

PHB

X

PIX

SOC

PHB

0-9

GPU3

PHB

PIX

X

SOC

PHB

0-9

GPU4

SOC

X

PHB

SOC

10-19

GPU5

SOC

PHB

X

SOC

10-19

mlx5_0 PHB

PHB

SOC

X

X = Self

SOC = Path traverses a socket-level link (e.g. QPI)

PHB = Path traverses a PCIe host bridge

PXB = Path traverses multiple PCIe internal switches

PIX = Path traverses a PCIe internal switch

(28)

Conclusion - Path Forward

§

Focus on dynamics/transport

§

Improve use of GPUDirect

– 

Get rid of aggregation device buffer

– 

Restructure code in order to get better communication/

computation overlap

§

Can we find task parallelism?

– 

Fuse kernels (not enough work)

– 

Spawn computation in stream while performing halo