HPC Applications Scalability.

(1)

HPC Applications Scalability

(2)

2

Applications Best Practices

• LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator

• D. E. Shaw Research Desmond

• NWChem

• MPQC - Massively Parallel Quantum Chemistry Program

• ANSYS FLUENT and CFX

• CD-adapco STAR-CCM+

• CD-adapco STAR-CD

• MM5 - The Fifth-Generation Mesoscale Model

• NAMD

• LSTC LS-DYNA

• Schlumberger ECLIPSE

• CPMD - Car-Parrinello Molecular Dynamics

• WRF - Weather Research and Forecast Model

• Dacapo - total energy program based on density functional theory

• Lattice QCD • Open Foam • GAMESS • HOMME • OpenAtom • PopPerf

(3)

Applications Best Practices

•

Performance evaluation

•

Profiling – network, I/O

•

Recipes (installation, debug, command lines etc)

•

MPI libraries

•

Power management

•

Optimizations at small scale

(4)

HPC Applications

Small Scale

(5)

ECLIPSE Performance Results - Interconnect

•

InfiniBand enables highest scalability

– Performance accelerates with cluster size

•

Performance over GigE and 10GigE is not scaling

– Slowdown occurs beyond 8 nodes

Lower is better Single job per cluster size

Schlumberger ECLIPSE (FOURMILL) 0 1000 2000 3000 4000 5000 6000 4 8 12 16 20 22 24 Number of Nodes E lap sed T im e ( S eco n d s)

(6)

MM5 Performance Results - Interconnect

Platform MPI Higher is better

•

Input Data: T3A

–

Resolution 36KM, grid size 112x136, 33 vertical levels

–

81 second time-step, 3 hour forecast

•

InfiniBand DDR delivers higher performance in any cluster size

–

Up to 46% versus GigE, 30% versus 10GigE

MM5 Benchmark Results - T3A

0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flo p /s

(7)

MPQC Benchmark Result (aug-cc-pVTZ) 0 4000 8000 12000 16000 8 16 24 Number of Nodes E la ps e d Tim e (S e c onds )

GigE 10GigE InfiniBand DDR

MPQC Performance Results - Interconnect

•

Input Dataset

– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set

•

InfiniBand enables higher scalability

– Performance accelerates with cluster size

– Outperforms GigE by up to 124% and 10GigE by up to 38%

(8)

NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /D ay

GigE 10GigE InfiniBand

NAMD Performance Results – Interconnect

Higher is better Open MPI

• ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water

– Models a bloodstream lipoprotein particle

– One of the most used data sets for benchmarking NAMD

• InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size

(9)

CPMD Performance - MPI

Lower is better

•

Platform MPI shows better scalability over Open MPI

These results are based on InfiniBand CPM D (Wat32 inp-1) 0 10 20 30 40 50 60 70 80 90 100 4 8 16

Num ber of Nodes

E lap sed t im e ( S eco n d s)

Open MPI Platform MPI

CPM D (Wat32 inp-2) 0 20 40 60 80 100 120 140 160 180 4 8 16

Num ber of Nodes

E lap sed t im e ( S eco n d s)

(10)

MM5 Performance - MPI

•

Platform MPI demonstrates higher performance versus MVAPICH

– Up to 25% higher performance

– Platform MPI advantage increases with increased cluster size

Single job on each node Higher is better

MM5 Benchmark Results - T3A

0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flop/s

(11)

NAMD Performance - MPI

Higher is better

•

Platform MPI and Open MPI provides same level of performance

– Platform MPI has better performance for cluster size lower than 20 nodes

– Open MPI becomes better with 24 nodes

• Higher configurations than 24 nodes were not tested

These results are based on InfiniBand NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /Da y

(12)

STAR-CD Performance - MPI

•

Test case

– Single job over the entire systems

– Input Dataset (A-Class)

•

HP-MPI has slightly better performance with CPU affinity enabled

Lower is better

STAR-CD Benchmark Results (A-Class) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 4 8 16 20 24 Number of Nodes T o tal E lap sed T im e ( s)

(13)

CPMD – MPI Profiling

•

MPI_AlltoAll is the key collective function in CPMD

– Number of AlltoAll messages increases dramatically with cluster size

MPI Function Distribution (C120 inp-1) 0 0.5 1 1.5 2 2.5 3 3.5 4 MP I_A llgat her MP I_A llred uce MP I_A lltoa ll MP I_B cas t MP I_Ise nd MP I_R ecv MP I_S end N u m b er o f M essag es ( M il li o n s)

(14)

Eclipse MPI Profiliing 30% 40% 50% 60% 4 8 12 16 20 24 Number of Nodes P er cen tag e o f M essag es

MPI_Isend < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 128 Bytes MPI_Recv < 256 KB

ECLIPSE - MPI Profiling

•

Majority of MPI messages are large size

(15)

Abaqus/Standard - MPI Profiling

•

MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest

communication overhead

Abaqus/Standard MPI Profiliing (S2A) 1 10 100 1000 10000 100000 1000000 MP I_A llred uce MP I_A lltoa ll MP I_B arrier MP I_B cast MP I_Irecv MP I_Isen d MP I_R ecv MP I_R educe MP I_S end MP I_S sen d MP I_T est MP I_W aital l MPI Function M P I R u n ti m e ( s)

(16)

OpenFOAM MPI Profiling – MPI Functions

•

Mostly used MPI functions

– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv

(17)

ECLIPSE - Interconnect Usage

•

Total server throughput increases rapidly with cluster size

Data Sent 0 200 400 600 800 1000 1200 1400 1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 Timing (s) D at a T ran sf er red ( M B /s) 4 Nodes 8 Nodes 16 Nodes 24 Nodes

This data is per node based

Data Sent 0 200 400 600 800 1000 1200 1400 1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847 Timing (s) D at a T ran sf er red ( M B /s)

(18)

NWChem MPI Profiling – Message Transferred

• Most data related MPI messages are within 8KB-256KB in size

– Number of messages increases with cluster size

• Shows the need for highest throughput to ensure highest system utilization

NWChem Benchmark Profiliing (Siosi7) 0 10000 20000 30000 40000 50000 60000 <128B <1KB <8KB <256KB <1MB >1MB Message Size N u m b er o f M essag es ( K )

(19)

STAR-CD MPI Profiling (MPI_Sendrecv) 0 1 2 3 4 5 6 7 [0-128 B> [128B -1K> [1K-8K> [8K-25 6K> [256K -1M> Message Size N u m b er o f M essag es (M illio n s )

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes

STAR-CD MPI Profiling – Data Transferred

•

Most point-to-point MPI messages are within 1KB to 8KB in size

(20)

FLUENT 12 MPI Profiling – Message Transferred

• Most data related MPI messages are within 256B-1KB in size

• Typical MPI synchronization messages are lower than 64B in size

• Number of messages increases with cluster size

FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) 0 4000 8000 12000 16000 20000 24000 28000 [0..64B ] [64 ..256B ] [256 B..1 KB] [1..4 KB] [4..16K B] [16 ..64K B] [64 ..256K B] [25 6KB ..1M ] [1..4M ] [4M ..2GB ] Message Size N u m b er o f M essag es (Thous a nds )

(21)

ECLIPSE Performance - Productivity

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously

– Providing required productivity for reservoir simulations

• Three cases are presented

– Four jobs, each on two cores per CPU per server

– Eight jobs, each on one CPU core per server

• Eight jobs per node increases productivity by up to 142%

Higher is better InfiniBand

Schlumberger ECLIPSE (FOURMILL) 0 50 100 150 200 250 300 8 12 16 20 22 24 Number of Nodes N um be r of J obs

(22)

MPQC Benchmark Result (aug-cc-pVDZ) 0 50 100 150 200 250 300 8 12 16 20 24 Number of Nodes Jo b s/ D ay 1 Job 2 Jobs

MPQC Performance - Productivity

• InfiniBand increases productivity by allowing multiple jobs to run simultaneously

– Providing required productivity for MPQC computation

• Two cases are presented

– Two jobs, each on four cores per server

• Two jobs per node increases productivity by up to 35%

Higher is better

35%

(23)

LS-DYNA Performance - Productivity

LS-DYNA - 3 Vehicle Collision

0 20 40 60 80 100 120 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

LS-DYNA - Neon Refined Revised

0 200 400 600 800 1000 1200 1400 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

(24)

FLUENT Performance - Productivity

• Test cases

– 2 jobs, each runs on four cores per server

• Running multiple jobs simultaneously improves FLUENT productivity

– Up to 90% more jobs per day for Eddy_417K

– Up to 30% more jobs per day for Aircraft_2M

– Up to 3% more jobs per day for Truck_14M

• As bigger the # of elements, higher node count is required for increased productivity

– The CPU is the bottleneck for larger number of servers

Higher is better InfiniBand DDR

FLUENT 12.0 Productivity Result

(2 jobs in parallel vs 1 job)

0% 20% 40% 60% 80% 100% 4 8 12 16 20 24 Number of Nodes P roduc tiv it y G a in

(25)

MM5 Performance with Power Management

InfiniBand DDR Higher is better

• Test Scenario

– 24 servers, 4 processes per node, 2 processes per CPU (socket)

• Similar performance with power management enabled or disabled

– Only 1.4% performance degradation

MM5 Benchmark Results - T3A

10 15 20 25 30 35 40

Power Management Enabled Power Management Disabled

G

(26)

MM5 Benchmark – Power Cost Savings

•

Power management saves 673$/year for the 24-node cluster

•

As cluster size increases, bigger saving are expected

MM5 Benchmark - T3A Power Cost Comparison

9000 9400 9800 10200 10600 11000

Power Management Enabled Power Management Disabled

$/

year

InfiniBand DDR 24 Node Cluster

7%

$/year = Total power consumption/year (KWh) * $0.20

(27)

STAR-CD Benchmark – Power Consumption

•

Power management reduces 2% of total system power consumption

InfiniBand DDR

STAR-CD Benchmark Results

(AClass) 5000 5500 6000 6500 7000 7500 Power Management Enabled Power Management Disabled P o w e r C o n s u m p ti o n ( Wa tt ) Lower is better

(28)

STAR-CD Performance with Power Management

InfiniBand DDR Lower is better

• Test Scenario

– 24 servers, 4-Cores/Proc

• Nearly identical performance with power management enabled or disabled

STAR-CD Benchmark Results

(AClass) 1000 1200 1400 1600 1800 Power Management Enabled Power Management Disabled T o tal E lap sed T im e ( s)

(29)

MQPC Performance with CPU Frequency Scaling

InfiniBand DDR 24 Node Cluster MPQC Performance 0 500 1000 1500 2000 2500 3000 ug-cc-pVDZ ug-cc-pVTZ Benchmark Dataset E lap sed T im e ( s)

Userspace (1.9GHz) Performance OnDemand

• Enabling CPU Frequency Scaling

– Userspace – reducing CPU frequency to 1.9GHz

– Performance – setting for maximum performance (CPU frequency of 2.6GHz)

– OnDemand – Maximum performance per application activity

• Userspace increases job run time since CPU frequency is reduced

• Performance and OnDemand enable similar performance

(30)

NWChem Benchmark Results

InfiniBand QDR

NWChem Benchmark Result (Siosi7) 0 50 100 150 200 250 300 8 12 16 Number of Nodes P er fo rm an ce ( Jo b s/ D ay) HP-M PI + GCC HP-M PI + Intel M KL Open M PI + GCC Open M PI + Intel M KL

•

Input Dataset - Siosi7

•

Open MPI and HP-MPI provides similar performance and scalability

–

Intel MKL library enables better performance versus GCC

(31)

NWChem Benchmark Results

•

Input Dataset

-

H2O7

•

AMD ACML provides higher performance and scalability versus

the default BLAS library

InfiniBand DDR Lower is better

NWChem Benchmark Result

(H2O7 MP2) 0 50 100 150 200 250 300 4 8 16 24 Number of Nodes W al l t im e ( S eco n d s)

MVAPICH HP-MPI Open MPI

(32)

Tuning MPI For Performance Improvements

• MPI Performance tuning

– Enabling CPU affinity

• mca mpi_paffinity_alone 1

– Increasing eager limit over infiniBand to 32K

• mca btl_openib_eager_limit 32767 • Performance increase of up to 10% NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes D ays/ n s

Open MPI - Default Open MPI - Tuned

(33)

HPC Applications

Large Scale

(34)

Acknowledgments

•

The following research was performed under the HPC

Advisory Council activities

–

Participating members: AMD, Dell, Jülich, Mellanox NCAR,

ParTec, Sun

–

Compute resource: HPC Advisory Council Cluster Center,

Jülich Supercomputing Centre

•

For more info please refer to

(35)

HOMME

• High-Order Methods Modeling Environment (HOMME)

– Framework for creating a high-performance scalable global atmospheric model

– Configurable for shallow water or the dry/moist primitive equations

– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate System Model (CCSM)

– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP

(36)

POPperf

• POP (Parallel Ocean Program) is an ocean circulation model

– Simulations of the global ocean

– Ocean-ice coupled simulations

– Developed at Los Alamos National Lab

• POPperf is a modified version of POP 2.0 (Parallel Ocean Program)

• POPperf improves POP scalability on large processor counts

– Re-writing of the conjugate gradient solver to use a 1D data structure

– The addition of a space-filling curve partitioning technique

– Low memory binary parallel I/O functionality

(37)

Objectives and Environments

•

The presented research was done to provide best practices

– HOMME and POPperf scalability and optimizations

– Understanding communication patterns

•

HPC Advisory Council system

– Dell™ PowerEdge™ SC 1435 24-node cluster

– Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

– Mellanox® InfiniBand ConnectX HCAs and switches

•

Jülich Supercomputing Centre - JuRoPA

– Sun Blade servers with Intel Nehalem processors

– Mellanox 40Gb/s InfiniBand HCAs and switches

(38)

ParTec MPI Parameters

•

PSP_ONDEMAND

– Each MPI connection needs a certain amount of memory by default (0.5MB)

– PSP_ONDEMAND disabled

• MPI connections establish when application starts

• Reduce overhead to start connection dynamically

– PSP_ONDEMAND enabled

• MPI connections establish per need

• Reduce unnecessary message checking from large number of connections

• Reduce memory footprint

•

PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE

– Define default send/receive queue size (Default is16)

(39)

Lessons Learn for Large Scale

•

Each application needs customized MPI tuning

–

Dynamic MPI connection establishment should be enabled if

• Simultaneous connections setup is not a must

• Some MPI function needs to check incoming messages from any source

–

Dynamic MPI connection establishment should be disabled if

• MPI_Alltoall is used in the application

• Number of calls to check MPI_ANY_SROUCE is small

• Enough memory in the system

•

At different scale, different parameters may be used

–

PSP_ONDEMAND shouldn’t be enabled at very small scale cluster

•

Different MPI libraries has different characteristics

(40)

(41)

(42)

(43)

(44)

(45)

Summary

•

HOMME and POPperf performance

depends on low latency network

•

InfiniBand enables both HOMME and

POPperf to scale

–

Tested over 13824 cores

–

Same scalability is expected at even higher

core count

•

Optimized MPI settings can dramatically

improve application performance

–

Understanding application communication

pattern

(46)

Acknowledgements

•

Thanks to all the participated HPC Advisory members

–

Ben Mayer and John Dennis from NCAR who provided the code and

benchmark instructions

–

Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the

JuRoPA system available for benchmarking

–

Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec

who helped review the report

(47)

Thank You

HPC Advisory Council