• No results found

HPC Applications Scalability.

N/A
N/A
Protected

Academic year: 2021

Share "HPC Applications Scalability."

Copied!
47
0
0

Loading.... (view fulltext now)

Full text

(1)

HPC Applications Scalability

(2)

2

Applications Best Practices

LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator

D. E. Shaw Research Desmond

NWChem

MPQC - Massively Parallel Quantum Chemistry Program

ANSYS FLUENT and CFX

CD-adapco STAR-CCM+

CD-adapco STAR-CD

MM5 - The Fifth-Generation Mesoscale Model

NAMD

• LSTC LS-DYNA

Schlumberger ECLIPSE

CPMD - Car-Parrinello Molecular Dynamics

WRF - Weather Research and Forecast Model

Dacapo - total energy program based on density functional theory

Lattice QCDOpen FoamGAMESSHOMMEOpenAtomPopPerf

(3)

Applications Best Practices

Performance evaluation

Profiling – network, I/O

Recipes (installation, debug, command lines etc)

MPI libraries

Power management

Optimizations at small scale

(4)

HPC Applications

Small Scale

(5)

ECLIPSE Performance Results - Interconnect

InfiniBand enables highest scalability

– Performance accelerates with cluster size

Performance over GigE and 10GigE is not scaling

– Slowdown occurs beyond 8 nodes

Lower is better Single job per cluster size

Schlumberger ECLIPSE (FOURMILL) 0 1000 2000 3000 4000 5000 6000 4 8 12 16 20 22 24 Number of Nodes E lap sed T im e ( S eco n d s)

(6)

MM5 Performance Results - Interconnect

Platform MPI Higher is better

Input Data: T3A

Resolution 36KM, grid size 112x136, 33 vertical levels

81 second time-step, 3 hour forecast

InfiniBand DDR delivers higher performance in any cluster size

Up to 46% versus GigE, 30% versus 10GigE

MM5 Benchmark Results - T3A

0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flo p /s

(7)

MPQC Benchmark Result (aug-cc-pVTZ) 0 4000 8000 12000 16000 8 16 24 Number of Nodes E la ps e d Tim e (S e c onds )

GigE 10GigE InfiniBand DDR

MPQC Performance Results - Interconnect

Input Dataset

– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set

InfiniBand enables higher scalability

– Performance accelerates with cluster size

– Outperforms GigE by up to 124% and 10GigE by up to 38%

(8)

NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /D ay

GigE 10GigE InfiniBand

NAMD Performance Results – Interconnect

Higher is better Open MPI

ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water

– Models a bloodstream lipoprotein particle

– One of the most used data sets for benchmarking NAMD

InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size

(9)

CPMD Performance - MPI

Lower is better

Platform MPI shows better scalability over Open MPI

These results are based on InfiniBand CPM D (Wat32 inp-1) 0 10 20 30 40 50 60 70 80 90 100 4 8 16

Num ber of Nodes

E lap sed t im e ( S eco n d s)

Open MPI Platform MPI

CPM D (Wat32 inp-2) 0 20 40 60 80 100 120 140 160 180 4 8 16

Num ber of Nodes

E lap sed t im e ( S eco n d s)

(10)

MM5 Performance - MPI

Platform MPI demonstrates higher performance versus MVAPICH

– Up to 25% higher performance

– Platform MPI advantage increases with increased cluster size

Single job on each node Higher is better

MM5 Benchmark Results - T3A

0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flop/s

(11)

NAMD Performance - MPI

Higher is better

Platform MPI and Open MPI provides same level of performance

– Platform MPI has better performance for cluster size lower than 20 nodes

– Open MPI becomes better with 24 nodes

• Higher configurations than 24 nodes were not tested

These results are based on InfiniBand NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /Da y

(12)

STAR-CD Performance - MPI

Test case

– Single job over the entire systems

– Input Dataset (A-Class)

HP-MPI has slightly better performance with CPU affinity enabled

Lower is better

STAR-CD Benchmark Results (A-Class) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 4 8 16 20 24 Number of Nodes T o tal E lap sed T im e ( s)

(13)

CPMD – MPI Profiling

MPI_AlltoAll is the key collective function in CPMD

– Number of AlltoAll messages increases dramatically with cluster size

MPI Function Distribution (C120 inp-1) 0 0.5 1 1.5 2 2.5 3 3.5 4 MP I_A llgat her MP I_A llred uce MP I_A lltoa ll MP I_B cas t MP I_Ise nd MP I_R ecv MP I_S end N u m b er o f M essag es ( M il li o n s)

(14)

Eclipse MPI Profiliing 30% 40% 50% 60% 4 8 12 16 20 24 Number of Nodes P er cen tag e o f M essag es

MPI_Isend < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 128 Bytes MPI_Recv < 256 KB

ECLIPSE - MPI Profiling

Majority of MPI messages are large size

(15)

Abaqus/Standard - MPI Profiling

MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest

communication overhead

Abaqus/Standard MPI Profiliing (S2A) 1 10 100 1000 10000 100000 1000000 MP I_A llred uce MP I_A lltoa ll MP I_B arrier MP I_B cast MP I_Irecv MP I_Isen d MP I_R ecv MP I_R educe MP I_S end MP I_S sen d MP I_T est MP I_W aital l MPI Function M P I R u n ti m e ( s)

(16)

OpenFOAM MPI Profiling – MPI Functions

Mostly used MPI functions

– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv

(17)

ECLIPSE - Interconnect Usage

Total server throughput increases rapidly with cluster size

Data Sent 0 200 400 600 800 1000 1200 1400 1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 Timing (s) D at a T ran sf er red ( M B /s) 4 Nodes 8 Nodes 16 Nodes 24 Nodes

This data is per node based

Data Sent 0 200 400 600 800 1000 1200 1400 1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847 Timing (s) D at a T ran sf er red ( M B /s)

(18)

NWChem MPI Profiling – Message Transferred

Most data related MPI messages are within 8KB-256KB in size

– Number of messages increases with cluster size

Shows the need for highest throughput to ensure highest system utilization

NWChem Benchmark Profiliing (Siosi7) 0 10000 20000 30000 40000 50000 60000 <128B <1KB <8KB <256KB <1MB >1MB Message Size N u m b er o f M essag es ( K )

(19)

STAR-CD MPI Profiling (MPI_Sendrecv) 0 1 2 3 4 5 6 7 [0-128 B> [128B -1K> [1K-8K> [8K-25 6K> [256K -1M> Message Size N u m b er o f M essag es (M illio n s )

4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes

STAR-CD MPI Profiling – Data Transferred

Most point-to-point MPI messages are within 1KB to 8KB in size

(20)

FLUENT 12 MPI Profiling – Message Transferred

Most data related MPI messages are within 256B-1KB in size

Typical MPI synchronization messages are lower than 64B in size

Number of messages increases with cluster size

FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) 0 4000 8000 12000 16000 20000 24000 28000 [0..64B ] [64 ..256B ] [256 B..1 KB] [1..4 KB] [4..16K B] [16 ..64K B] [64 ..256K B] [25 6KB ..1M ] [1..4M ] [4M ..2GB ] Message Size N u m b er o f M essag es (Thous a nds )

(21)

ECLIPSE Performance - Productivity

InfiniBand increases productivity by allowing multiple jobs to run simultaneously

– Providing required productivity for reservoir simulations

Three cases are presented

– Single job over the entire systems

– Four jobs, each on two cores per CPU per server

– Eight jobs, each on one CPU core per server

Eight jobs per node increases productivity by up to 142%

Higher is better InfiniBand

Schlumberger ECLIPSE (FOURMILL) 0 50 100 150 200 250 300 8 12 16 20 22 24 Number of Nodes N um be r of J obs

(22)

MPQC Benchmark Result (aug-cc-pVDZ) 0 50 100 150 200 250 300 8 12 16 20 24 Number of Nodes Jo b s/ D ay 1 Job 2 Jobs

MPQC Performance - Productivity

InfiniBand increases productivity by allowing multiple jobs to run simultaneously

– Providing required productivity for MPQC computation

Two cases are presented

– Single job over the entire systems

– Two jobs, each on four cores per server

Two jobs per node increases productivity by up to 35%

Higher is better

35%

(23)

LS-DYNA Performance - Productivity

LS-DYNA - 3 Vehicle Collision

0 20 40 60 80 100 120 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

LS-DYNA - Neon Refined Revised

0 200 400 600 800 1000 1200 1400 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay

1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs

(24)

FLUENT Performance - Productivity

Test cases

– Single job over the entire systems

– 2 jobs, each runs on four cores per server

Running multiple jobs simultaneously improves FLUENT productivity

– Up to 90% more jobs per day for Eddy_417K

– Up to 30% more jobs per day for Aircraft_2M

– Up to 3% more jobs per day for Truck_14M

As bigger the # of elements, higher node count is required for increased productivity

– The CPU is the bottleneck for larger number of servers

Higher is better InfiniBand DDR

FLUENT 12.0 Productivity Result

(2 jobs in parallel vs 1 job)

0% 20% 40% 60% 80% 100% 4 8 12 16 20 24 Number of Nodes P roduc tiv it y G a in

(25)

MM5 Performance with Power Management

InfiniBand DDR Higher is better

Test Scenario

– 24 servers, 4 processes per node, 2 processes per CPU (socket)

Similar performance with power management enabled or disabled

– Only 1.4% performance degradation

MM5 Benchmark Results - T3A

10 15 20 25 30 35 40

Power Management Enabled Power Management Disabled

G

(26)

MM5 Benchmark – Power Cost Savings

Power management saves 673$/year for the 24-node cluster

As cluster size increases, bigger saving are expected

MM5 Benchmark - T3A Power Cost Comparison

9000 9400 9800 10200 10600 11000

Power Management Enabled Power Management Disabled

$/

year

InfiniBand DDR 24 Node Cluster

7%

$/year = Total power consumption/year (KWh) * $0.20

(27)

STAR-CD Benchmark – Power Consumption

Power management reduces 2% of total system power consumption

InfiniBand DDR

STAR-CD Benchmark Results

(AClass) 5000 5500 6000 6500 7000 7500 Power Management Enabled Power Management Disabled P o w e r C o n s u m p ti o n ( Wa tt ) Lower is better

(28)

STAR-CD Performance with Power Management

InfiniBand DDR Lower is better

Test Scenario

– 24 servers, 4-Cores/Proc

Nearly identical performance with power management enabled or disabled

STAR-CD Benchmark Results

(AClass) 1000 1200 1400 1600 1800 Power Management Enabled Power Management Disabled T o tal E lap sed T im e ( s)

(29)

MQPC Performance with CPU Frequency Scaling

InfiniBand DDR 24 Node Cluster MPQC Performance 0 500 1000 1500 2000 2500 3000 ug-cc-pVDZ ug-cc-pVTZ Benchmark Dataset E lap sed T im e ( s)

Userspace (1.9GHz) Performance OnDemand

Enabling CPU Frequency Scaling

– Userspace – reducing CPU frequency to 1.9GHz

– Performance – setting for maximum performance (CPU frequency of 2.6GHz)

– OnDemand – Maximum performance per application activity

Userspace increases job run time since CPU frequency is reduced

Performance and OnDemand enable similar performance

(30)

NWChem Benchmark Results

InfiniBand QDR

NWChem Benchmark Result (Siosi7) 0 50 100 150 200 250 300 8 12 16 Number of Nodes P er fo rm an ce ( Jo b s/ D ay) HP-M PI + GCC HP-M PI + Intel M KL Open M PI + GCC Open M PI + Intel M KL

Input Dataset - Siosi7

Open MPI and HP-MPI provides similar performance and scalability

Intel MKL library enables better performance versus GCC

(31)

NWChem Benchmark Results

Input Dataset

-

H2O7

AMD ACML provides higher performance and scalability versus

the default BLAS library

InfiniBand DDR Lower is better

NWChem Benchmark Result

(H2O7 MP2) 0 50 100 150 200 250 300 4 8 16 24 Number of Nodes W al l t im e ( S eco n d s)

MVAPICH HP-MPI Open MPI

(32)

Tuning MPI For Performance Improvements

MPI Performance tuning

– Enabling CPU affinity

• mca mpi_paffinity_alone 1

– Increasing eager limit over infiniBand to 32K

• mca btl_openib_eager_limit 32767 • Performance increase of up to 10% NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes D ays/ n s

Open MPI - Default Open MPI - Tuned

(33)

HPC Applications

Large Scale

(34)

Acknowledgments

The following research was performed under the HPC

Advisory Council activities

Participating members: AMD, Dell, Jülich, Mellanox NCAR,

ParTec, Sun

Compute resource: HPC Advisory Council Cluster Center,

Jülich Supercomputing Centre

For more info please refer to

(35)

HOMME

High-Order Methods Modeling Environment (HOMME)

– Framework for creating a high-performance scalable global atmospheric model

– Configurable for shallow water or the dry/moist primitive equations

– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate System Model (CCSM)

– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP

(36)

POPperf

POP (Parallel Ocean Program) is an ocean circulation model

– Simulations of the global ocean

– Ocean-ice coupled simulations

– Developed at Los Alamos National Lab

POPperf is a modified version of POP 2.0 (Parallel Ocean Program)

POPperf improves POP scalability on large processor counts

– Re-writing of the conjugate gradient solver to use a 1D data structure

– The addition of a space-filling curve partitioning technique

– Low memory binary parallel I/O functionality

(37)

Objectives and Environments

The presented research was done to provide best practices

– HOMME and POPperf scalability and optimizations

– Understanding communication patterns

HPC Advisory Council system

– Dell™ PowerEdge™ SC 1435 24-node cluster

– Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs

– Mellanox® InfiniBand ConnectX HCAs and switches

Jülich Supercomputing Centre - JuRoPA

– Sun Blade servers with Intel Nehalem processors

– Mellanox 40Gb/s InfiniBand HCAs and switches

(38)

ParTec MPI Parameters

PSP_ONDEMAND

– Each MPI connection needs a certain amount of memory by default (0.5MB)

– PSP_ONDEMAND disabled

• MPI connections establish when application starts

• Reduce overhead to start connection dynamically

– PSP_ONDEMAND enabled

• MPI connections establish per need

• Reduce unnecessary message checking from large number of connections

• Reduce memory footprint

PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE

– Define default send/receive queue size (Default is16)

(39)

Lessons Learn for Large Scale

Each application needs customized MPI tuning

Dynamic MPI connection establishment should be enabled if

• Simultaneous connections setup is not a must

• Some MPI function needs to check incoming messages from any source

Dynamic MPI connection establishment should be disabled if

• MPI_Alltoall is used in the application

• Number of calls to check MPI_ANY_SROUCE is small

• Enough memory in the system

At different scale, different parameters may be used

PSP_ONDEMAND shouldn’t be enabled at very small scale cluster

Different MPI libraries has different characteristics

(40)
(41)
(42)
(43)
(44)
(45)

Summary

HOMME and POPperf performance

depends on low latency network

InfiniBand enables both HOMME and

POPperf to scale

Tested over 13824 cores

Same scalability is expected at even higher

core count

Optimized MPI settings can dramatically

improve application performance

Understanding application communication

pattern

(46)

Acknowledgements

Thanks to all the participated HPC Advisory members

Ben Mayer and John Dennis from NCAR who provided the code and

benchmark instructions

Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the

JuRoPA system available for benchmarking

Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec

who helped review the report

(47)

Thank You

HPC Advisory Council

References

Related documents