Analysis of MPI Inter-Chip Communication Patterns
on Multi-Core Distributed Shared-Memory Computers
Manfred Mücke, Wilfried Gansterer
Research Lab Computational Technologies and Applications University of Vienna
http://rlcta.univie.ac.at
WHTRA February 9th, 2011
Manfred Mücke, Wilfried Gansterer
1 Motivation
2 Sun Fire X4600 M2
3 MPI_ALLREDUCE
4 Measurements
MPI Applications on clusters are the workhorses of scientic computing. Some applications (or systems simulated) do not scale beyond a certain number of cores.
→execute MPI applications on single ccNUMA node (currently up to 48 cores):
→lower latency, →higher bandwidth, →improved performance.
Manfred Mücke, Wilfried Gansterer
CHARMM (Chemistry at HARvard Macromolecular Mechanics)
CHARMM Molecular Dynamics (MD) Simulation: 20+ years of code development
Fortran (77/95) + MPI
Viscosity studies → long timelines (µs) Benchmark system: JAC1000
(protein in water, 23,558 atoms, 1 ps) tJ AC1000= 60 s → 1.44 ns/day
CHARMM JAC1000 Scaling (Cluster with IB)
Manfred Mücke, Wilfried Gansterer
Sun Fire X4600 M2
Migrate CHARMM from cluster to single distributed shared-memory (DSM) / ccNUMA node (16/32 cores).
→eliminate IB interconnect
Sun Fire X4600 M2 - System Overview
Manfred Mücke, Wilfried Gansterer
CHARMM JAC1000 on Sun Fire X4600 M2
Moving CHARMM from cluster to DSM yields no signicant eect! Signicant variations in MPI function call times:
→Investigate MPI collectives.
→Observe MPI_ALLREDUCE microbenchmark.
Manfred Mücke, Wilfried Gansterer
CHARMM JAC1000 on Sun Fire X4600 M2
Manfred Mücke, Wilfried Gansterer
MPI_ALLREDUCE on Sun Fire X4600 M2
Counter Measures
→reduce OS jitter (sporadic OS activity)!
BUT: Is core- or cHT-activity aecting performance?
If cHT: Imbalanced applications could generate similar patterns. →Investigate correlation tMPI_ALLREDUCE↔cHT link load.
Manfred Mücke, Wilfried Gansterer
t
MPI_ALLREDUCE↔
cHT
tMPI_ALLREDUCEand cHT bandwidth of selected cHT links are correlated!
Manfred Mücke, Wilfried Gansterer
How to Measure cHT Link Load?
Data available in Opteron's Link Event registers (0F6h,0F7h, 0F8h, 1F9h, HyperTransport Link x Transmit Bandwidth)
1 Manual instrumentation of code
PAPI, cpclib, ..
acceptable for benchmarks, infeasible for (our) applications
2 Dynamic instrumentation
DTrace
Monitor whole system (kernel, user, HW)
Summary
CHARMM JAC1000 (OpenMPI) does not benet from Cluster/IB → ccNUMA/cHT,
one issue being varying execution times of MPI function calls
which are largely due to varying activity on the cHT inter-chip network. Immediate solution: Reduce OS activity → reduce cHT trac → improve MPI performance.
Concern: Application imbalance or overlapping comm./comp. could also result in increased cHT trac. → How robust against these cases is shared-memory OpenMPI via ccNUMA/cHT?
Manfred Mücke, Wilfried Gansterer
Future Work
Extract communication patterns from multiple link load observations. Looking for cHT collective comm. microbenchmark to dierentiate between network load eects and additional trac generated by MPI implementation. Alternative cHT observables?
Thanks for your attention!
Questions? Suggestions?
Research Lab Computational Technologies and Applications University of Vienna
http://rlcta.univie.ac.at
Manfred Mücke, Wilfried Gansterer