HPC Applications Scalability
2
Applications Best Practices
• LAMMPS - Large-scale Atomic/Molecular Massively Parallel Simulator
• D. E. Shaw Research Desmond
• NWChem
• MPQC - Massively Parallel Quantum Chemistry Program
• ANSYS FLUENT and CFX
• CD-adapco STAR-CCM+
• CD-adapco STAR-CD
• MM5 - The Fifth-Generation Mesoscale Model
• NAMD
• LSTC LS-DYNA
• Schlumberger ECLIPSE
• CPMD - Car-Parrinello Molecular Dynamics
• WRF - Weather Research and Forecast Model
• Dacapo - total energy program based on density functional theory
• Lattice QCD • Open Foam • GAMESS • HOMME • OpenAtom • PopPerf
Applications Best Practices
•
Performance evaluation
•
Profiling – network, I/O
•
Recipes (installation, debug, command lines etc)
•
MPI libraries
•
Power management
•
Optimizations at small scale
HPC Applications
Small Scale
ECLIPSE Performance Results - Interconnect
•
InfiniBand enables highest scalability
– Performance accelerates with cluster size
•
Performance over GigE and 10GigE is not scaling
– Slowdown occurs beyond 8 nodes
Lower is better Single job per cluster size
Schlumberger ECLIPSE (FOURMILL) 0 1000 2000 3000 4000 5000 6000 4 8 12 16 20 22 24 Number of Nodes E lap sed T im e ( S eco n d s)
MM5 Performance Results - Interconnect
Platform MPI Higher is better
•
Input Data: T3A
–
Resolution 36KM, grid size 112x136, 33 vertical levels
–
81 second time-step, 3 hour forecast
•
InfiniBand DDR delivers higher performance in any cluster size
–
Up to 46% versus GigE, 30% versus 10GigE
MM5 Benchmark Results - T3A
0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flo p /s
MPQC Benchmark Result (aug-cc-pVTZ) 0 4000 8000 12000 16000 8 16 24 Number of Nodes E la ps e d Tim e (S e c onds )
GigE 10GigE InfiniBand DDR
MPQC Performance Results - Interconnect
•
Input Dataset
– MP2 calculations of the uracil dimer binding energy using the aug-cc-pVTZ basis set
•
InfiniBand enables higher scalability
– Performance accelerates with cluster size
– Outperforms GigE by up to 124% and 10GigE by up to 38%
NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /D ay
GigE 10GigE InfiniBand
NAMD Performance Results – Interconnect
Higher is better Open MPI
• ApoA1 case - benchmark comprises 92K atoms of lipid, protein, and water
– Models a bloodstream lipoprotein particle
– One of the most used data sets for benchmarking NAMD
• InfiniBand 20Gb/s outperforms GigE and 10GigE in every cluster size
CPMD Performance - MPI
Lower is better
•
Platform MPI shows better scalability over Open MPI
These results are based on InfiniBand CPM D (Wat32 inp-1) 0 10 20 30 40 50 60 70 80 90 100 4 8 16
Num ber of Nodes
E lap sed t im e ( S eco n d s)
Open MPI Platform MPI
CPM D (Wat32 inp-2) 0 20 40 60 80 100 120 140 160 180 4 8 16
Num ber of Nodes
E lap sed t im e ( S eco n d s)
MM5 Performance - MPI
•
Platform MPI demonstrates higher performance versus MVAPICH
– Up to 25% higher performance
– Platform MPI advantage increases with increased cluster size
Single job on each node Higher is better
MM5 Benchmark Results - T3A
0 10000 20000 30000 40000 50000 60000 4 6 8 10 12 14 16 18 20 22 Number of Nodes M flop/s
NAMD Performance - MPI
Higher is better
•
Platform MPI and Open MPI provides same level of performance
– Platform MPI has better performance for cluster size lower than 20 nodes
– Open MPI becomes better with 24 nodes
• Higher configurations than 24 nodes were not tested
These results are based on InfiniBand NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes ns /Da y
STAR-CD Performance - MPI
•
Test case
– Single job over the entire systems
– Input Dataset (A-Class)
•
HP-MPI has slightly better performance with CPU affinity enabled
Lower is better
STAR-CD Benchmark Results (A-Class) 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 4 8 16 20 24 Number of Nodes T o tal E lap sed T im e ( s)
CPMD – MPI Profiling
•
MPI_AlltoAll is the key collective function in CPMD
– Number of AlltoAll messages increases dramatically with cluster size
MPI Function Distribution (C120 inp-1) 0 0.5 1 1.5 2 2.5 3 3.5 4 MP I_A llgat her MP I_A llred uce MP I_A lltoa ll MP I_B cas t MP I_Ise nd MP I_R ecv MP I_S end N u m b er o f M essag es ( M il li o n s)
Eclipse MPI Profiliing 30% 40% 50% 60% 4 8 12 16 20 24 Number of Nodes P er cen tag e o f M essag es
MPI_Isend < 128 Bytes MPI_Isend < 256 KB MPI_Recv < 128 Bytes MPI_Recv < 256 KB
ECLIPSE - MPI Profiling
•
Majority of MPI messages are large size
Abaqus/Standard - MPI Profiling
•
MPI_Test, MPI_Recv, and MPI_Barrier/Bcast show the highest
communication overhead
Abaqus/Standard MPI Profiliing (S2A) 1 10 100 1000 10000 100000 1000000 MP I_A llred uce MP I_A lltoa ll MP I_B arrier MP I_B cast MP I_Irecv MP I_Isen d MP I_R ecv MP I_R educe MP I_S end MP I_S sen d MP I_T est MP I_W aital l MPI Function M P I R u n ti m e ( s)
OpenFOAM MPI Profiling – MPI Functions
•
Mostly used MPI functions
– MPI_Allreduce, MPI_Waitall, MPI_Isend, and MPI_recv
ECLIPSE - Interconnect Usage
•
Total server throughput increases rapidly with cluster size
Data Sent 0 200 400 600 800 1000 1200 1400 1 45 89 133 177 221 265 309 353 397 441 485 529 573 617 661 705 749 793 Timing (s) D at a T ran sf er red ( M B /s) 4 Nodes 8 Nodes 16 Nodes 24 Nodes
This data is per node based
Data Sent 0 200 400 600 800 1000 1200 1400 1 146 291 436 581 726 871 1016 1161 1306 1451 1596 1741 1886 2031 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 86 171 256 341 426 511 596 681 766 851 936 1021 1106 1191 Timing (s) D at a T ran sf er red ( M B /s) Data Sent 0 200 400 600 800 1000 1200 1400 1 48 95 142 189 236 283 330 377 424 471 518 565 612 659 706 753 800 847 Timing (s) D at a T ran sf er red ( M B /s)
NWChem MPI Profiling – Message Transferred
• Most data related MPI messages are within 8KB-256KB in size
– Number of messages increases with cluster size
• Shows the need for highest throughput to ensure highest system utilization
NWChem Benchmark Profiliing (Siosi7) 0 10000 20000 30000 40000 50000 60000 <128B <1KB <8KB <256KB <1MB >1MB Message Size N u m b er o f M essag es ( K )
STAR-CD MPI Profiling (MPI_Sendrecv) 0 1 2 3 4 5 6 7 [0-128 B> [128B -1K> [1K-8K> [8K-25 6K> [256K -1M> Message Size N u m b er o f M essag es (M illio n s )
4 Nodes 8 Nodes 12 Nodes 16 Nodes 20 Nodes 22 Nodes
STAR-CD MPI Profiling – Data Transferred
•
Most point-to-point MPI messages are within 1KB to 8KB in size
FLUENT 12 MPI Profiling – Message Transferred
• Most data related MPI messages are within 256B-1KB in size
• Typical MPI synchronization messages are lower than 64B in size
• Number of messages increases with cluster size
FLUENT 12 Benchmark Profiling Result (Truck_poly_14M) 0 4000 8000 12000 16000 20000 24000 28000 [0..64B ] [64 ..256B ] [256 B..1 KB] [1..4 KB] [4..16K B] [16 ..64K B] [64 ..256K B] [25 6KB ..1M ] [1..4M ] [4M ..2GB ] Message Size N u m b er o f M essag es (Thous a nds )
ECLIPSE Performance - Productivity
• InfiniBand increases productivity by allowing multiple jobs to run simultaneously
– Providing required productivity for reservoir simulations
• Three cases are presented
– Single job over the entire systems
– Four jobs, each on two cores per CPU per server
– Eight jobs, each on one CPU core per server
• Eight jobs per node increases productivity by up to 142%
Higher is better InfiniBand
Schlumberger ECLIPSE (FOURMILL) 0 50 100 150 200 250 300 8 12 16 20 22 24 Number of Nodes N um be r of J obs
MPQC Benchmark Result (aug-cc-pVDZ) 0 50 100 150 200 250 300 8 12 16 20 24 Number of Nodes Jo b s/ D ay 1 Job 2 Jobs
MPQC Performance - Productivity
• InfiniBand increases productivity by allowing multiple jobs to run simultaneously
– Providing required productivity for MPQC computation
• Two cases are presented
– Single job over the entire systems
– Two jobs, each on four cores per server
• Two jobs per node increases productivity by up to 35%
Higher is better
35%
LS-DYNA Performance - Productivity
LS-DYNA - 3 Vehicle Collision
0 20 40 60 80 100 120 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay
1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs
LS-DYNA - Neon Refined Revised
0 200 400 600 800 1000 1200 1400 4 (32 Cor es) 8 (64 Cor es) 12 (96 Cor es) 16 (1 28 C ores) 20 (1 60 C ores) 24 (1 92 C ores) Number of Nodes Jo b s p er D ay
1 Job 2 Parallel Jobs 4 Parallel Jobs 8 Parallel Jobs
FLUENT Performance - Productivity
• Test cases
– Single job over the entire systems
– 2 jobs, each runs on four cores per server
• Running multiple jobs simultaneously improves FLUENT productivity
– Up to 90% more jobs per day for Eddy_417K
– Up to 30% more jobs per day for Aircraft_2M
– Up to 3% more jobs per day for Truck_14M
• As bigger the # of elements, higher node count is required for increased productivity
– The CPU is the bottleneck for larger number of servers
Higher is better InfiniBand DDR
FLUENT 12.0 Productivity Result
(2 jobs in parallel vs 1 job)
0% 20% 40% 60% 80% 100% 4 8 12 16 20 24 Number of Nodes P roduc tiv it y G a in
MM5 Performance with Power Management
InfiniBand DDR Higher is better
• Test Scenario
– 24 servers, 4 processes per node, 2 processes per CPU (socket)
• Similar performance with power management enabled or disabled
– Only 1.4% performance degradation
MM5 Benchmark Results - T3A
10 15 20 25 30 35 40
Power Management Enabled Power Management Disabled
G
MM5 Benchmark – Power Cost Savings
•
Power management saves 673$/year for the 24-node cluster
•
As cluster size increases, bigger saving are expected
MM5 Benchmark - T3A Power Cost Comparison
9000 9400 9800 10200 10600 11000
Power Management Enabled Power Management Disabled
$/
year
InfiniBand DDR 24 Node Cluster
7%
$/year = Total power consumption/year (KWh) * $0.20
STAR-CD Benchmark – Power Consumption
•
Power management reduces 2% of total system power consumption
InfiniBand DDR
STAR-CD Benchmark Results
(AClass) 5000 5500 6000 6500 7000 7500 Power Management Enabled Power Management Disabled P o w e r C o n s u m p ti o n ( Wa tt ) Lower is better
STAR-CD Performance with Power Management
InfiniBand DDR Lower is better
• Test Scenario
– 24 servers, 4-Cores/Proc
• Nearly identical performance with power management enabled or disabled
STAR-CD Benchmark Results
(AClass) 1000 1200 1400 1600 1800 Power Management Enabled Power Management Disabled T o tal E lap sed T im e ( s)
MQPC Performance with CPU Frequency Scaling
InfiniBand DDR 24 Node Cluster MPQC Performance 0 500 1000 1500 2000 2500 3000 ug-cc-pVDZ ug-cc-pVTZ Benchmark Dataset E lap sed T im e ( s)Userspace (1.9GHz) Performance OnDemand
• Enabling CPU Frequency Scaling
– Userspace – reducing CPU frequency to 1.9GHz
– Performance – setting for maximum performance (CPU frequency of 2.6GHz)
– OnDemand – Maximum performance per application activity
• Userspace increases job run time since CPU frequency is reduced
• Performance and OnDemand enable similar performance
NWChem Benchmark Results
InfiniBand QDR
NWChem Benchmark Result (Siosi7) 0 50 100 150 200 250 300 8 12 16 Number of Nodes P er fo rm an ce ( Jo b s/ D ay) HP-M PI + GCC HP-M PI + Intel M KL Open M PI + GCC Open M PI + Intel M KL
•
Input Dataset - Siosi7
•
Open MPI and HP-MPI provides similar performance and scalability
–
Intel MKL library enables better performance versus GCC
NWChem Benchmark Results
•
Input Dataset
-
H2O7•
AMD ACML provides higher performance and scalability versus
the default BLAS library
InfiniBand DDR Lower is better
NWChem Benchmark Result
(H2O7 MP2) 0 50 100 150 200 250 300 4 8 16 24 Number of Nodes W al l t im e ( S eco n d s)
MVAPICH HP-MPI Open MPI
Tuning MPI For Performance Improvements
• MPI Performance tuning
– Enabling CPU affinity
• mca mpi_paffinity_alone 1
– Increasing eager limit over infiniBand to 32K
• mca btl_openib_eager_limit 32767 • Performance increase of up to 10% NAMD (ApoA1) 0 1 2 3 4 5 6 7 4 8 12 16 20 24 Number of Nodes D ays/ n s
Open MPI - Default Open MPI - Tuned
HPC Applications
Large Scale
Acknowledgments
•
The following research was performed under the HPC
Advisory Council activities
–
Participating members: AMD, Dell, Jülich, Mellanox NCAR,
ParTec, Sun
–
Compute resource: HPC Advisory Council Cluster Center,
Jülich Supercomputing Centre
•
For more info please refer to
HOMME
• High-Order Methods Modeling Environment (HOMME)
– Framework for creating a high-performance scalable global atmospheric model
– Configurable for shallow water or the dry/moist primitive equations
– Serves as a prototype for the Community Atmospheric Model (CAM) component of the Community Climate System Model (CCSM)
– HOMME supports execution on parallel computers using either MPI, OpenMP or a combination of MPI/OpenMP
POPperf
• POP (Parallel Ocean Program) is an ocean circulation model
– Simulations of the global ocean
– Ocean-ice coupled simulations
– Developed at Los Alamos National Lab
• POPperf is a modified version of POP 2.0 (Parallel Ocean Program)
• POPperf improves POP scalability on large processor counts
– Re-writing of the conjugate gradient solver to use a 1D data structure
– The addition of a space-filling curve partitioning technique
– Low memory binary parallel I/O functionality
Objectives and Environments
•
The presented research was done to provide best practices
– HOMME and POPperf scalability and optimizations
– Understanding communication patterns
•
HPC Advisory Council system
– Dell™ PowerEdge™ SC 1435 24-node cluster
– Quad-Core AMD Opteron™ 2382 (“Shanghai”) CPUs
– Mellanox® InfiniBand ConnectX HCAs and switches
•
Jülich Supercomputing Centre - JuRoPA
– Sun Blade servers with Intel Nehalem processors
– Mellanox 40Gb/s InfiniBand HCAs and switches
ParTec MPI Parameters
•
PSP_ONDEMAND
– Each MPI connection needs a certain amount of memory by default (0.5MB)
– PSP_ONDEMAND disabled
• MPI connections establish when application starts
• Reduce overhead to start connection dynamically
– PSP_ONDEMAND enabled
• MPI connections establish per need
• Reduce unnecessary message checking from large number of connections
• Reduce memory footprint
•
PSP_OPENIB_SENDQ_SIZE and PSP_OPENIB_RECVQ_SIZE
– Define default send/receive queue size (Default is16)
Lessons Learn for Large Scale
•
Each application needs customized MPI tuning
–
Dynamic MPI connection establishment should be enabled if
• Simultaneous connections setup is not a must
• Some MPI function needs to check incoming messages from any source
–
Dynamic MPI connection establishment should be disabled if
• MPI_Alltoall is used in the application
• Number of calls to check MPI_ANY_SROUCE is small
• Enough memory in the system
•
At different scale, different parameters may be used
–
PSP_ONDEMAND shouldn’t be enabled at very small scale cluster
•
Different MPI libraries has different characteristics
Summary
•
HOMME and POPperf performance
depends on low latency network
•
InfiniBand enables both HOMME and
POPperf to scale
–
Tested over 13824 cores
–
Same scalability is expected at even higher
core count
•
Optimized MPI settings can dramatically
improve application performance
–
Understanding application communication
pattern
Acknowledgements
•
Thanks to all the participated HPC Advisory members
–
Ben Mayer and John Dennis from NCAR who provided the code and
benchmark instructions
–
Bastian Tweddell, Norbert Eicker, and Thomas Lippert who made the
JuRoPA system available for benchmarking
–
Axel Koehler from Sun, Jens Hauke and Hugo R. Falter from ParTec
who helped review the report
Thank You
HPC Advisory Council