• No results found

IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13

N/A
N/A
Protected

Academic year: 2021

Share "IUmd. Performance Analysis of a Molecular Dynamics Code. Thomas William. Dresden, 5/27/13"

Copied!
50
0
0

Loading.... (view fulltext now)

Full text

(1)

Center for Information Services and High Performance Computing (ZIH)

IUmd

Performance Analysis of a Molecular Dynamics Code

(2)

Overview

•  IUmd – Introduction •  First Look with Vampir

•  Comparison of Code-versions •  Hardware Counters

•  Source Code Analysis

•  Source Code Optimization •  Porting IUmd to GPGPU

(3)

Initial Motivation

•  Code had been in development for many years

•  Lots of different code versions doing “similar science” •  Developers wanted to compare these versions

•  Search for Performance optimization possibilities •  They (IU) provide domain specific expertise

(4)

Introduction: IUmd Code

•  Classical molecular dynamics simulation of dense

nuclear matter consisting of either fully ionized atoms or free neutrons and protons

•  Main targets are studies of the dense matter in white

dwarf and neutron stars.

•  Interaction potentials between particles treated as

classical two-particle central potentials

•  No complicated particle geometry or orientation •  Electrons can be treated as a uniform background

(5)
(6)

DIGGING INTO THE CODE

(7)

IUmd Code Details

•  Particle-particle-interactions have been implemented

in multiple ways (nucleon, pure-ion, ion-mixture) •  Located in different files PP01, PP02 and PP03

•  The code blocks are selectable using preprocessor

macros

•  PP01 is the original implementation with no division into

the Ax, Bx or NBS blocks

PP02 implements the versions in use by the physicists today

•  3 different implementations for the Ax block

•  3 implementations for the Bx block

(8)

IUmd Code Details

•  IUmd can be compiled as a serial, OpenMP, MPI, or

MPI+OpenMP (Hybrid)

MD Code Details

Two sections of code each have two or more variations

One section is labelled

A

and the other

B

Variations are numbered

MD can be compiled as a serial, OpenMP, MPI or

MPI+OpenMP

make MDEF=XRay md_mpi ALG=PP02 BLKA=A0 BLKB=B2 NBS= "NBSX=0 "

(9)

IUmd Workflow

The structure of the program is simple:

•  First reads in a parameter file “runmd.in”

•  And an initial particle configuration file “md.in”

•  This allows checkpointing

•  Program then calculates the initial set of accelerations •  Enters a time-stepping loop, a triply nested ”do” loop

(10)

Runtime Parameter

Runtime Parameters

! Parameters :

sim_type = ’ ion mixture ’ , ! s i m u l a t i o n type

t s t a r t = 0.00 , ! s t a r t time

dt = 25.00 , ! time step ( fm / c ) ! Warmup :

nwgroup = 2 , ! groups

nwsteps = 50 , ! steps per group ! Measurement :

ngroup = 2 , ! groups

n t o t = 2 , ! per group

nind = 25 , ! steps between

t n o r m a l i z e = 50 , ! temp normal .

ncom = 50 , ! center of mass

! motion cancel .

(11)

The Main Loop

Figure: Simplified version of the main loop

The Main Loop

do 100 i g =1 , ngroup i n i t i a l i z e group i g s t a t i s t i c s do 40 j =1 , n t o t do i =1 , nind ! computes f o r c e s ! updates x and v c a l l newton enddo c a l l v t o t enddo compute group i g s t a t i s t i c s 40 continue 100 continue ⌦⌃ ⇧

(12)

MD Implementation Details - newton module

•  Forces are calculated in the newton module in a pair

of nested do-loops

•  Outer loop goes over target particles

•  Inner loop over source particles

•  Targets are assigned to MPI processes in a round-robin

fashion

•  Within each MPI process, the work is shared among

(13)

FIRST LOOK WITH VAMPIR

(14)
(15)
(16)

COMPARISON OF

CODE-VERSIONS

(17)

Cray XT5mTM

•  XT5m is a 2D mesh of nodes

•  Each node has two sockets each having four cores •  AMD Opteron 23 ”Shanghai” (45mm)

•  Running at 2.4 GHz •  84 compute nodes

•  a total of 672 cores •  pgi/9.0.4

(18)

Time Constraints

First runs showed that:

•  5k particles measurement takes 5 minutes •  27k particles measurement takes 1 hour

•  55k particles measurement takes 10 hours

(19)

Overview PP01, PP02, and PP03

Figure: Overview of the runtimes for all (143)

(20)
(21)

HARDWARE COUNTERS

(22)
(23)

SOURCE CODE ANALYSIS

(24)
(25)

Block B

cut-off sphere reduces

(26)
(27)

Why is -O2 as fast as -O3

(28)

SOURCE CODE

OPTIMIZATION

(29)
(30)

Code Changes

•  Omp parallel do for the j-loop is removed

•  Omp parallel region around the i-loop •  J-loop is parallelized explicitly

Entire ij-loop section in a 3rd loop over threads

•  Each iteration of this outer loop is assigned to an

OpenMP thread

Code Changes

omp parallel do for the j-loop is removed

omp parallel region around the i-loop

j-loop is parallelized explicitly

Entire ij-loop section in a 3rd loop over threads

Each iteration of this outer loop is assigned to an OpenMP thread ⌥ allocate( f j ( 3 , 0 : n 1)) do 100 i =myrank , n 2,nprocs f i ( : ) = 0 . 0 d0 ! $omp p a r a l l e l do p r i v a t e ( r2 , k , xx , r , f c ) , \ r e d u c t i o n ( + : f i ) , schedule ( runtime ) do 90 j = i +1 ,n 1 ! A Block . . . . . . ⌥ ! $omp p a r a l l e l p r i v a t e ( nthrd , nj , nr , j0 , j1 , xx , r2 , r , fc , f i ) nthrd = omp_get_num_threads ( ) allocate( f i ( 3 , 0 : n 1)) ! $omp do schedule ( s t a t i c , 1 ) do 110 i t h r d =0 , nthrd 1 do 100 i =myrank , n 2,nprocs f i ( : , i )=0.0 d0 n j =(n i 1)/ nthrd nr=mod( n i 1,nthrd ) j 0 =( i +1)+ i t h r d⇤n j +min ( i t h r d , nr ) j 1 =( i +1)+( i t h r d +1)⇤n j +min ( i t h r d +1 , nr) 1 do 90 j =j0 , j 1

(31)
(32)
(33)

AMD Interlagos OpenMP Efficiency

(34)

PORTING IUMD TO GPGPU

(35)

3 Different Approaches

•  HMPP

•  Only a proof of concept so far •  OpenACC

•  CUDA

•  PGI CUDA Fortran

(36)

Code Changes

Goal: Combine MPI Version with CUDA Needs new parameters to define

•  The dimensions of the MPI process grid •  Number of source chunks

•  Thread block dimensions for the CUDA kernel

•  [1] must exactly divide the number of particles N

•  [1] must be a multiple of the warp size (32)

(37)

Changes to MPI

Split COM_WORLD into Cartesian Grid

Example: Calculate kinetic energy (ttot)

•  Master column adds up kinetic energies (Allreduce) •  Broadcast new values along each row

(38)
(39)
(40)

CUDA

Step A:

•  Declare arrays in shared memory (xss, ffs) •  Define grid, block and thread parameters

•  Copy target data from global memory to registers •  Initialize force

(41)

CUDA

Step B:

•  Thread (i,j) applies its sources (xss) to its target

Step C:

•  Sum the forces (ffs) from all threads of my block

Step D:

•  Copy shared memory array (ffs) to global memory •  Each block copies to its own array section

(42)

CUDA

Step E:

•  Last block in row that copies its results then sums the

contribution from each block

•  Threads share the work, accumulating results into

registers

•  Threads take turns adding their contributions to

shared memory

(43)

Fermi vs Kepler (1 GPU, pgfortran CUDA)

Baseline:

•  8 OpenMP Threads (icc) •  Pure_ion simulation

•  Weak scaling (increasing the number of ions)

2   4   6   8   Spe edup   FERMI   KEPLER  

(44)

Comparison OpenACC vs CUDA

•  OpenMP version as baseline:

•  2 Intel(R) Xeon(R) CPU (E5620, 2.4GHz)

•  Intel 12.1 Compiler

•  8 OpenMP threads •  1 Keppler K20

•  Cuda-version: 5.0

•  CAPS 3.3.1 for ACC

•  PGI 13.1 for ACC/CUDA •  IUmd parameter:

(45)

Comparison OpenACC vs CUDA

•  Simulation type: ion-mixture •  Input data set changed!

2   4   6   8   10   12   OpenACC  CAPS   OpenACC  PGI   CUDA  

(46)

Combined Benchmark – first results 74.05   38.54   19.27   10.14   5.07   2.54   2.60   1.28   5.12   2.56   1.28   0.64   0.22   0.11   0.06   0.05   0.01   0.1   1   10   100   1   2   4   8   16   32   2x16   2x32   4x4   4x8   4x16   4x32   1   2x1   4x1   6x1   Ru n2 m e   in  s ec on ds   Benchmark  (IUmd  6.3.0)  

(47)

Note of Thanks to Contributors

(Listing Names in Alphabetical Order)

Donald Berry Robert Henschel Joseph Schuchart Robert Dietrich Jonas Stolle Frank Winkler

(48)
(49)
(50)

Figure

Figure : Simplified version of the main loop

References

Related documents

If the customer already has a Trial installation of Capture Pro when they purchase a Production license, they can upgrade their installation by running the License Manager

Cytokines such as IL-5 [63], or chemokines such as eotaxin are produced in elevated levels and attract ele- vated numbers of eosinophils to the lumen and bronchi of the lungs in

Table-1 highlighted the growth rate and variation in area, production and productivity of sugarcane cultivation in Haryana and sampled district Karnal during 2004-05

The protocol is distributed, and has the favorable feature that it guarantees local optimality in that every node ends up with a wakeup point that cannot be further improved in

Since the degree of all the vertices in rhomtree is less than or equal to four the graphical structure of rhomtree is isomorphic to some molecular

A comprehensive method to personalize BPT for families across multiple ethnic groups that are at high risk for externalizing behavior disorders as well as treatment fail- ure

Observing the performance trade-off between immediate and eventual forward secrecy in onion routing circuit construction, we also develop a λ - pass circuit construction, which

Pattern Diversity MIMO 4G and 5G Wideband Circularly Polarized Antenna with Integrated LTE Band for Mobile Handset.. Prashant Chaudhary 1 , Ashwani Kumar 2, * , and Avanish