Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Accelerating CICE on the GPU
Rob T. Aulwes, CCS-7
March 19, 2015
LA-UR-15-21044
§
Mat Colgrove
§
Jeff Larkin
§
Jiri Kraus
§
Carl Ponder
§
Justin Luitjens
§
Tony Scuderio
Acknowledgements
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§
CICE model
§
Strategy to acceleration
–
Profiling
–
OpenACC
–
GPUDirect
§
Results
§
Path forward
Outline
Slide 3
Los Alamos Sea Ice Model - CICE
§
Global model of sea ice for
climate and forecast
–
Used by many climate,
forecast groups
§
Coupled to
atmospheric-ice-ocean-land global
climate models
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Los Alamos Sea Ice Model - CICE
Slide 5
§
Ice thickness distribution
–
Multiple discrete thickness bins
–
All tracers, etc. exist in each thick class
§
Transport
–
Tracer conservation, horizontal transport, ITD
–
Incremental remap: efficient for many tracers
§
Dynamics
–
Momentum: includes forcing, grav
–
Stress: Elastic-Viscous-Plastic, EAP (anisotropic) rheology, stress tensor
§
Thermo/salinity/column physics
–
Melt/freeze, radiation, fluxes at top/bottom, melt ponds
OpenACC gotchas
§
Must delete device memory before host memory
§
Passing non-contiguous array slice creates
temporary arrays
–
Means the temporary array is not in device memory if
original array was
§
Use assumed shape for array declaration
–
Unless lower bound is not 1
•
real, dimension(:,0:), intent(in) :: foo
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§
Profile
§
Minimize data movement
§
Exploit CUDA streams
§
Use GPUDirect for MPI between devices
gprof + gprof2dot.py
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
gprof + gprof2dot.py
Slide 9
Challenges for Accelerating Dynamics
§
Halo updates between computations
§
Many arrays to move
–
Move all computations to GPU
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§
Profile
§
Minimize data movement
§
Exploit CUDA streams
§
Use GPUDirect for MPI between devices
Strategy – Minimize Data Movement
§
Use Fortran pointers into large memory chunks
–
Reduces amount of data movement
allocate( mem_chunk(nx, ny,2) )!
v => mem_chunk(:,:,1)!
w => mem_chunk(:,:,2)!
!$acc enter data create(mem_chunk)!
!
!$acc update device(mem_chunk)!
!$acc data present(v,w)!
!$acc parallel loop collapse(2)!
do i = 1,ny!
do j = 1,nx!
v(i,j) = alpha * w(i,j)!
enddo!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy
§
Profile
§
Minimize data movement
§
Exploit CUDA streams
§
Use GPUDirect for MPI between devices
Strategy – Use CUDA Streams
§
Use CUDA streams
–
If loops are data independent, launch with separate
streams along with data updates to host/device
!$acc parallel loop collapse(2) async(1)!
do i = 1,n!
do j = 1,m!
a(i,j) = a(i,j) * w(i,j)!
enddo!
enddo!
!
!$acc parallel loop collapse(2) async(2)!
do i = 1,n!
do j = 1,m!
b(i,j) = b(i,j) + alpha * t(i,j)!
enddo!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Strategy – Use CUDA Streams
§
Invoke subroutine calls using streams
Slide 15
do cat = 1,ncat!
call construct_fields(mx, my)!
enddo!
do cat = 1,ncat!
! In construct_fields, use ‘cat’ as the!
! async stream value!
call construct_fields(cat,mx(:,:,cat), &!
my(:,:,cat))!
Strategy
§
Profile
§
Minimize data movement
§
Exploit CUDA streams
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
GPUDirect
Slide 17
§
Built CUDA-enabled OpenMPI 1.8.5 on moonlight
–
Used PGI 14.7 for CICE compiler
§
Titan has Cray’s CUDA-enabled version of MPICH
–
Also used PGI 14.7
–
However, the XK7 hardware doesn’t support RDMA, so
MPI still goes through CPU
–
But, coding to GPUDirect now prepares for upcoming
Summit cluster
GPUDirect
call ice_haloUpdate(dpx, halo_info)!
call ice_haloUpdate(dpy, halo_info)!
call ice_haloUpdate(mx, halo_info)!
Call ice_haloBegin(halo_info, 3, updateInfo)!
call ice_devHaloUpdate(halo_info, updateInfo, dpx)!
call ice_devHaloUpdate(halo_info, updateInfo, dpy)!
call ice_devHaloUpdate(halo_info, updateInfo, mx)!
call ice_haloEnd(halo_info, updateInfo)!
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Results
§
Used two test problems: gx1 and tp4
–
gx1: 16 procs, grid size 320x384
–
tp4: 60 procs, grid size 900x600
§
Ran with longitudinal blocks
–
Reduces load imbalance
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
§
LANL Moonlight
–
PGI 14.7, OpenMPI 1.8.5, CUDA 6.0
–
Intel Xeon (SandyBridge) 16 cores/node
–
2 Nvidia M2090/node
§
ORNL Titan
–
PGI 14.7, Cray’s MPICH, CUDA 5.5
–
AMD Interlagos 16 cores/node
–
1 Nvidia K20x/node
Test Platforms
§
Nvidia’s PSG cluster
–
Ivy Bridge E5-2690, dual socket 10 cores/
socket, 6 x K40
–
OpenMPI 1.8.5
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Test Cases – p4
4 nodes, 15 procs per node
Slide 23
Runtime
(secs)
Moonlight Titan
PSG
Baseline
96
173
78
OpenACC
+
GPUDirect
99
194
91
p4 Rank Distribution
Runtime (secs) 4 nodes/15 ppn 6 nodes/10 ppn 10 nodes/6 ppn
Baseline
Moonlight
73
69
67
GPU Moonlight
96
83
96
Baseline Titan
173
169
160
GPU Titan
194
182
163
Baseline PSG
78
74
69
GPU PSG
91
81
80
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
p4 Scaling
Slide 25
Runtime
(secs)
10 procs
5 nodes
20 procs
10 nodes
40 procs
20 nodes
Baseline Titan 337
187
201
GPU
+GPUDirect
325
180
193
p4 Scaling
Runtime
(secs)
10 procs
2 nodes
20 procs
4 nodes
40 procs
5 nodes
Baseline PSG 161
87
91
GPU
+GPUDirect
197
107
109
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy's NNSA
UNCLASSIFIED
Topology on PSG
Slide 27
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
mlx5_0 CPU Affinity
GPU0
X
PIX
PHB
PHB
SOC
SOC
PHB
0-9
GPU1
PIX
X
PHB
PHB
SOC
SOC
PHB
0-9
GPU2
PHB
PHB
X
PIX
SOC
SOC
PHB
0-9
GPU3
PHB
PHB
PIX
X
SOC
SOC
PHB
0-9
GPU4
SOC
SOC
SOC
SOC
X
PHB
SOC
10-19
GPU5
SOC
SOC
SOC
SOC
PHB
X
SOC
10-19
mlx5_0 PHB
PHB
PHB
PHB
SOC
SOC
X
X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch
Conclusion - Path Forward
§
Focus on dynamics/transport
§
Improve use of GPUDirect
–
Get rid of aggregation device buffer
–
Restructure code in order to get better communication/
computation overlap
§
Can we find task parallelism?
–
Fuse kernels (not enough work)
–
Spawn computation in stream while performing halo