IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

(1)

Center for Information Services and High Performance Computing (ZIH)

IUmd

Performance Analysis of a Molecular Dynamics Code

(2)

Overview

•  IUmd – Introduction •  First Look with Vampir

•  Comparison of Code-versions •  Hardware Counters

•  Source Code Analysis

•  Source Code Optimization •  Porting IUmd to GPGPU

(3)

Initial Motivation

•  Code had been in development for many years

•  Lots of different code versions doing “similar science” •  Developers wanted to compare these versions

•  Search for Performance optimization possibilities •  They (IU) provide domain specific expertise

(4)

Introduction: IUmd Code

•  Classical molecular dynamics simulation of dense

nuclear matter consisting of either fully ionized atoms or free neutrons and protons

•  Main targets are studies of the dense matter in white

dwarf and neutron stars.

•  Interaction potentials between particles treated as

classical two-particle central potentials

•  No complicated particle geometry or orientation •  Electrons can be treated as a uniform background

(5)

(6)

DIGGING INTO THE CODE

(7)

IUmd Code Details

•  Particle-particle-interactions have been implemented

in multiple ways (nucleon, pure-ion, ion-mixture) •  Located in different files PP01, PP02 and PP03

•  The code blocks are selectable using preprocessor

macros

•  PP01 is the original implementation with no division into

the Ax, Bx or NBS blocks

PP02 implements the versions in use by the physicists today

•  3 different implementations for the Ax block

•  3 implementations for the Bx block

(8)

IUmd Code Details

•  IUmd can be compiled as a serial, OpenMP, MPI, or

MPI+OpenMP (Hybrid)

MD Code Details

Two sections of code each have two or more variations

One section is labelled

A

and the other

B

Variations are numbered

MD can be compiled as a serial, OpenMP, MPI or

MPI+OpenMP

⌥

make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 "

(9)

IUmd Workflow

The structure of the program is simple:

•  First reads in a parameter file “runmd.in”

•  And an initial particle configuration file “md.in”

•  This allows checkpointing

•  Program then calculates the initial set of accelerations •  Enters a time-stepping loop, a triply nested ”do” loop

(10)

Runtime Parameter

Runtime Parameters

⌥

! Parameters :

sim_type = ’ ion mixture ’ , ! s i m u l a t i o n type

t s t a r t = 0.00 , ! s t a r t time

dt = 25.00 , ! time step ( fm / c ) ! Warmup :

nwgroup = 2 , ! groups

nwsteps = 50 , ! steps per group ! Measurement :

ngroup = 2 , ! groups

n t o t = 2 , ! per group

nind = 25 , ! steps between

t n o r m a l i z e = 50 , ! temp normal .

ncom = 50 , ! center of mass

! motion cancel .

(11)

The Main Loop

Figure: Simplified version of the main loop

The Main Loop

⌥ do 100 i g =1 , ngroup i n i t i a l i z e group i g s t a t i s t i c s do 40 j =1 , n t o t do i =1 , nind ! computes f o r c e s ! updates x and v c a l l newton enddo c a l l v t o t enddo compute group i g s t a t i s t i c s 40 continue 100 continue ⌦⌃ ⇧

(12)

MD Implementation Details - newton module

•  Forces are calculated in the newton module in a pair

of nested do-loops

•  Outer loop goes over target particles

•  Inner loop over source particles

•  Targets are assigned to MPI processes in a round-robin

fashion

•  Within each MPI process, the work is shared among

(13)

FIRST LOOK WITH VAMPIR

(14)

(15)

(16)

COMPARISON OF

CODE-VERSIONS

(17)

Cray XT5mTM

•  XT5m is a 2D mesh of nodes

•  Each node has two sockets each having four cores •  AMD Opteron 23 ”Shanghai” (45mm)

•  Running at 2.4 GHz •  84 compute nodes

•  a total of 672 cores •  pgi/9.0.4

(18)

Time Constraints

First runs showed that:

•  5k particles measurement takes 5 minutes •  27k particles measurement takes 1 hour

•  55k particles measurement takes 10 hours

(19)

Overview PP01, PP02, and PP03

Figure: Overview of the runtimes for all (143)

(20)

(21)

HARDWARE COUNTERS

(22)

(23)

SOURCE CODE ANALYSIS

(24)

(25)

Block B

cut-off sphere reduces

(26)

(27)

Why is -O2 as fast as -O3

(28)

SOURCE CODE

OPTIMIZATION

(29)

(30)

Code Changes

•  Omp parallel do for the j-loop is removed

•  Omp parallel region around the i-loop •  J-loop is parallelized explicitly

Entire ij-loop section in a 3rd loop over threads

•  Each iteration of this outer loop is assigned to an

OpenMP thread

Code Changes

omp parallel do for the j-loop is removed

omp parallel region around the i-loop

j-loop is parallelized explicitly

Entire ij-loop section in a 3rd loop over threads

Each iteration of this outer loop is assigned to an OpenMP thread ⌥ allocate( f j ( 3 , 0 : n 1)) do 100 i =myrank , n 2,nprocs f i ( : ) = 0 . 0 d0 ! $omp p a r a l l e l do p r i v a t e ( r2 , k , xx , r , f c ) , \ r e d u c t i o n ( + : f i ) , schedule ( runtime ) do 90 j = i +1 ,n 1 ! A Block . . . . . . ⌥ ! $omp p a r a l l e l p r i v a t e ( nthrd , nj , nr , j0 , j1 , xx , r2 , r , fc , f i ) nthrd = omp_get_num_threads ( ) allocate( f i ( 3 , 0 : n 1)) ! $omp do schedule ( s t a t i c , 1 ) do 110 i t h r d =0 , nthrd 1 do 100 i =myrank , n 2,nprocs f i ( : , i )=0.0 d0 n j =(n i 1)/ nthrd nr=mod( n i 1,nthrd ) j 0 =( i +1)+ i t h r d⇤n j +min ( i t h r d , nr ) j 1 =( i +1)+( i t h r d +1)⇤n j +min ( i t h r d +1 , nr) 1 do 90 j =j0 , j 1

(31)

(32)

(33)

AMD Interlagos OpenMP Efficiency

(34)

PORTING IUMD TO GPGPU

(35)

3 Different Approaches

•  HMPP

•  Only a proof of concept so far •  OpenACC

•  CUDA

•  PGI CUDA Fortran

(36)

Code Changes

Goal: Combine MPI Version with CUDA Needs new parameters to define

•  The dimensions of the MPI process grid •  Number of source chunks

•  Thread block dimensions for the CUDA kernel

•  [1] must exactly divide the number of particles N

•  [1] must be a multiple of the warp size (32)

(37)

Changes to MPI

Split COM_WORLD into Cartesian Grid

Example: Calculate kinetic energy (ttot)

•  Master column adds up kinetic energies (Allreduce) •  Broadcast new values along each row

(38)

(39)

(40)

CUDA

Step A:

•  Declare arrays in shared memory (xss, ffs) •  Define grid, block and thread parameters

•  Copy target data from global memory to registers •  Initialize force

(41)

CUDA

Step B:

•  Thread (i,j) applies its sources (xss) to its target

Step C:

•  Sum the forces (ffs) from all threads of my block

Step D:

•  Copy shared memory array (ffs) to global memory •  Each block copies to its own array section

(42)

CUDA

Step E:

•  Last block in row that copies its results then sums the

contribution from each block

•  Threads share the work, accumulating results into

registers

•  Threads take turns adding their contributions to

shared memory

(43)

Fermi vs Kepler (1 GPU, pgfortran CUDA)

Baseline:

•  8 OpenMP Threads (icc) •  Pure_ion simulation

•  Weak scaling (increasing the number of ions)

2 4 6 8 Spe edup FERMI KEPLER

(44)

Comparison OpenACC vs CUDA

•  OpenMP version as baseline:

•  2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz)

•  Intel 12.1 Compiler

•  8 OpenMP threads •  1 Keppler K20

•  Cuda-version: 5.0

•  CAPS 3.3.1 for ACC

•  PGI 13.1 for ACC/CUDA •  IUmd parameter:

(45)

Comparison OpenACC vs CUDA

•  Simulation type: ion-mixture •  Input data set changed!

2 4 6 8 10 12 OpenACC CAPS OpenACC PGI CUDA

(46)

Combined Benchmark – first results 74.05 38.54 19.27 10.14 5.07 2.54 2.60 1.28 5.12 2.56 1.28 0.64 0.22 0.11 0.06 0.05 0.01 0.1 1 10 100 1 2 4 8 16 32 2x16 2x32 4x4 4x8 4x16 4x32 1 2x1 4x1 6x1 Ru n2 m e in s ec on ds Benchmark (IUmd 6.3.0)

(47)

Note of Thanks to Contributors

(Listing Names in Alphabetical Order)

Donald Berry Robert Henschel Joseph Schuchart Robert Dietrich Jonas Stolle Frank Winkler

(48)

(49)

(50)