Center for Information Services and High Performance Computing (ZIH)
IUmd
Performance Analysis of a Molecular Dynamics Code
Overview
• IUmd – Introduction • First Look with Vampir
• Comparison of Code-versions • Hardware Counters
• Source Code Analysis
• Source Code Optimization • Porting IUmd to GPGPU
Initial Motivation
• Code had been in development for many years
• Lots of different code versions doing “similar science” • Developers wanted to compare these versions
• Search for Performance optimization possibilities • They (IU) provide domain specific expertise
Introduction: IUmd Code
• Classical molecular dynamics simulation of dense
nuclear matter consisting of either fully ionized atoms or free neutrons and protons
• Main targets are studies of the dense matter in white
dwarf and neutron stars.
• Interaction potentials between particles treated as
classical two-particle central potentials
• No complicated particle geometry or orientation • Electrons can be treated as a uniform background
DIGGING INTO THE CODE
IUmd Code Details
• Particle-particle-interactions have been implemented
in multiple ways (nucleon, pure-ion, ion-mixture) • Located in different files PP01, PP02 and PP03
• The code blocks are selectable using preprocessor
macros
• PP01 is the original implementation with no division into
the Ax, Bx or NBS blocks
PP02 implements the versions in use by the physicists today
• 3 different implementations for the Ax block
• 3 implementations for the Bx block
IUmd Code Details
• IUmd can be compiled as a serial, OpenMP, MPI, or
MPI+OpenMP (Hybrid)
MD Code Details
Two sections of code each have two or more variations
One section is labelled
A
and the other
B
Variations are numbered
MD can be compiled as a serial, OpenMP, MPI or
MPI+OpenMP
⌥
make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 "
IUmd Workflow
The structure of the program is simple:
• First reads in a parameter file “runmd.in”
• And an initial particle configuration file “md.in”
• This allows checkpointing
• Program then calculates the initial set of accelerations • Enters a time-stepping loop, a triply nested ”do” loop
Runtime Parameter
Runtime Parameters
⌥
! Parameters :
sim_type = ’ ion mixture ’ , ! s i m u l a t i o n type
t s t a r t = 0.00 , ! s t a r t time
dt = 25.00 , ! time step ( fm / c ) ! Warmup :
nwgroup = 2 , ! groups
nwsteps = 50 , ! steps per group ! Measurement :
ngroup = 2 , ! groups
n t o t = 2 , ! per group
nind = 25 , ! steps between
t n o r m a l i z e = 50 , ! temp normal .
ncom = 50 , ! center of mass
! motion cancel .
The Main Loop
Figure: Simplified version of the main loop
The Main Loop
⌥ do 100 i g =1 , ngroup i n i t i a l i z e group i g s t a t i s t i c s do 40 j =1 , n t o t do i =1 , nind ! computes f o r c e s ! updates x and v c a l l newton enddo c a l l v t o t enddo compute group i g s t a t i s t i c s 40 continue 100 continue ⌦⌃ ⇧
MD Implementation Details - newton module
• Forces are calculated in the newton module in a pair
of nested do-loops
• Outer loop goes over target particles
• Inner loop over source particles
• Targets are assigned to MPI processes in a round-robin
fashion
• Within each MPI process, the work is shared among
FIRST LOOK WITH VAMPIR
COMPARISON OF
CODE-VERSIONS
Cray XT5mTM
• XT5m is a 2D mesh of nodes
• Each node has two sockets each having four cores • AMD Opteron 23 ”Shanghai” (45mm)
• Running at 2.4 GHz • 84 compute nodes
• a total of 672 cores • pgi/9.0.4
Time Constraints
First runs showed that:
• 5k particles measurement takes 5 minutes • 27k particles measurement takes 1 hour
• 55k particles measurement takes 10 hours
Overview PP01, PP02, and PP03
Figure: Overview of the runtimes for all (143)
HARDWARE COUNTERS
SOURCE CODE ANALYSIS
Block B
cut-off sphere reduces
Why is -O2 as fast as -O3
SOURCE CODE
OPTIMIZATION
Code Changes
• Omp parallel do for the j-loop is removed
• Omp parallel region around the i-loop • J-loop is parallelized explicitly
Entire ij-loop section in a 3rd loop over threads
• Each iteration of this outer loop is assigned to an
OpenMP thread
Code Changes
omp parallel do for the j-loop is removed
omp parallel region around the i-loop
j-loop is parallelized explicitly
Entire ij-loop section in a 3rd loop over threads
Each iteration of this outer loop is assigned to an OpenMP thread ⌥ allocate( f j ( 3 , 0 : n 1)) do 100 i =myrank , n 2,nprocs f i ( : ) = 0 . 0 d0 ! $omp p a r a l l e l do p r i v a t e ( r2 , k , xx , r , f c ) , \ r e d u c t i o n ( + : f i ) , schedule ( runtime ) do 90 j = i +1 ,n 1 ! A Block . . . . . . ⌥ ! $omp p a r a l l e l p r i v a t e ( nthrd , nj , nr , j0 , j1 , xx , r2 , r , fc , f i ) nthrd = omp_get_num_threads ( ) allocate( f i ( 3 , 0 : n 1)) ! $omp do schedule ( s t a t i c , 1 ) do 110 i t h r d =0 , nthrd 1 do 100 i =myrank , n 2,nprocs f i ( : , i )=0.0 d0 n j =(n i 1)/ nthrd nr=mod( n i 1,nthrd ) j 0 =( i +1)+ i t h r d⇤n j +min ( i t h r d , nr ) j 1 =( i +1)+( i t h r d +1)⇤n j +min ( i t h r d +1 , nr) 1 do 90 j =j0 , j 1
AMD Interlagos OpenMP Efficiency
PORTING IUMD TO GPGPU
3 Different Approaches
• HMPP
• Only a proof of concept so far • OpenACC
• CUDA
• PGI CUDA Fortran
Code Changes
Goal: Combine MPI Version with CUDA Needs new parameters to define
• The dimensions of the MPI process grid • Number of source chunks
• Thread block dimensions for the CUDA kernel
• [1] must exactly divide the number of particles N
• [1] must be a multiple of the warp size (32)
Changes to MPI
Split COM_WORLD into Cartesian Grid
Example: Calculate kinetic energy (ttot)
• Master column adds up kinetic energies (Allreduce) • Broadcast new values along each row
CUDA
Step A:
• Declare arrays in shared memory (xss, ffs) • Define grid, block and thread parameters
• Copy target data from global memory to registers • Initialize force
CUDA
Step B:
• Thread (i,j) applies its sources (xss) to its target
Step C:
• Sum the forces (ffs) from all threads of my block
Step D:
• Copy shared memory array (ffs) to global memory • Each block copies to its own array section
CUDA
Step E:
• Last block in row that copies its results then sums the
contribution from each block
• Threads share the work, accumulating results into
registers
• Threads take turns adding their contributions to
shared memory
Fermi vs Kepler (1 GPU, pgfortran CUDA)
Baseline:
• 8 OpenMP Threads (icc) • Pure_ion simulation
• Weak scaling (increasing the number of ions)
2 4 6 8 Spe edup FERMI KEPLER
Comparison OpenACC vs CUDA
• OpenMP version as baseline:
• 2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz)
• Intel 12.1 Compiler
• 8 OpenMP threads • 1 Keppler K20
• Cuda-version: 5.0
• CAPS 3.3.1 for ACC
• PGI 13.1 for ACC/CUDA • IUmd parameter:
Comparison OpenACC vs CUDA
• Simulation type: ion-mixture • Input data set changed!
2 4 6 8 10 12 OpenACC CAPS OpenACC PGI CUDA
Combined Benchmark – first results 74.05 38.54 19.27 10.14 5.07 2.54 2.60 1.28 5.12 2.56 1.28 0.64 0.22 0.11 0.06 0.05 0.01 0.1 1 10 100 1 2 4 8 16 32 2x16 2x32 4x4 4x8 4x16 4x32 1 2x1 4x1 6x1 Ru n2 m e in s ec on ds Benchmark (IUmd 6.3.0)
Note of Thanks to Contributors
(Listing Names in Alphabetical Order)
Donald Berry Robert Henschel Joseph Schuchart Robert Dietrich Jonas Stolle Frank Winkler