Analysis of MPI Inter-Chip Communication Patterns on Multi-Core Distributed Shared-Memory Computers

(1)

Analysis of MPI Inter-Chip Communication Patterns

on Multi-Core Distributed Shared-Memory Computers

Manfred Mücke, Wilfried Gansterer

Research Lab Computational Technologies and Applications University of Vienna

http://rlcta.univie.ac.at

WHTRA February 9th, 2011

(2)

1 Motivation

2 Sun Fire X4600 M2

3 MPI_ALLREDUCE

4 Measurements

(3)

MPI Applications on clusters are the workhorses of scientic computing. Some applications (or systems simulated) do not scale beyond a certain number of cores.

→_{execute MPI applications on single ccNUMA node (currently up to 48 cores):}

→lower latency, →higher bandwidth, →improved performance.

(4)

CHARMM (Chemistry at HARvard Macromolecular Mechanics)

CHARMM Molecular Dynamics (MD) Simulation: 20+ years of code development

Fortran (77/95) + MPI

Viscosity studies → long timelines (µs) Benchmark system: JAC1000

(protein in water, 23,558 atoms, 1 ps) tJ AC1000= 60 s → 1.44 ns/day

(5)

CHARMM JAC1000 Scaling (Cluster with IB)

(6)

Sun Fire X4600 M2

Migrate CHARMM from cluster to single distributed shared-memory (DSM) / ccNUMA node (16/32 cores).

→_{eliminate IB interconnect}

(7)

Sun Fire X4600 M2 - System Overview

(8)

(9)

CHARMM JAC1000 on Sun Fire X4600 M2

Moving CHARMM from cluster to DSM yields no signicant eect! Signicant variations in MPI function call times:

→_{Investigate MPI collectives.}

→Observe MPI_ALLREDUCE microbenchmark.

(10)

(11)

CHARMM JAC1000 on Sun Fire X4600 M2

(12)

MPI_ALLREDUCE on Sun Fire X4600 M2

(13)

Counter Measures

→_{reduce OS jitter (sporadic OS activity)!}

BUT: Is core- or cHT-activity aecting performance?

If cHT: Imbalanced applications could generate similar patterns. →_{Investigate correlation t}_{MPI_ALLREDUCE}↔_{cHT link load.}

(14)

(15)

t

_{MPI_ALLREDUCE}

↔

cHT

t_{MPI_ALLREDUCE}and cHT bandwidth of selected cHT links are correlated!

(16)

How to Measure cHT Link Load?

Data available in Opteron's Link Event registers (0F6h,0F7h, 0F8h, 1F9h, HyperTransport Link x Transmit Bandwidth)

1 Manual instrumentation of code

PAPI, cpclib, ..

acceptable for benchmarks, infeasible for (our) applications

2 Dynamic instrumentation

DTrace

Monitor whole system (kernel, user, HW)

(17)

Summary

CHARMM JAC1000 (OpenMPI) does not benet from Cluster/IB → ccNUMA/cHT,

one issue being varying execution times of MPI function calls

which are largely due to varying activity on the cHT inter-chip network. Immediate solution: Reduce OS activity → reduce cHT trac → improve MPI performance.

Concern: Application imbalance or overlapping comm./comp. could also result in increased cHT trac. → How robust against these cases is shared-memory OpenMPI via ccNUMA/cHT?

(18)

Future Work

Extract communication patterns from multiple link load observations. Looking for cHT collective comm. microbenchmark to dierentiate between network load eects and additional trac generated by MPI implementation. Alternative cHT observables?

(19)

Thanks for your attention!

Questions? Suggestions?

Research Lab Computational Technologies and Applications University of Vienna

http://rlcta.univie.ac.at