• No results found

Analysis of MPI Inter-Chip Communication Patterns on Multi-Core Distributed Shared-Memory Computers

N/A
N/A
Protected

Academic year: 2021

Share "Analysis of MPI Inter-Chip Communication Patterns on Multi-Core Distributed Shared-Memory Computers"

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Analysis of MPI Inter-Chip Communication Patterns

on Multi-Core Distributed Shared-Memory Computers

Manfred Mücke, Wilfried Gansterer

Research Lab Computational Technologies and Applications University of Vienna

http://rlcta.univie.ac.at

WHTRA February 9th, 2011

Manfred Mücke, Wilfried Gansterer

(2)

1 Motivation

2 Sun Fire X4600 M2

3 MPI_ALLREDUCE

4 Measurements

(3)

MPI Applications on clusters are the workhorses of scientic computing. Some applications (or systems simulated) do not scale beyond a certain number of cores.

execute MPI applications on single ccNUMA node (currently up to 48 cores):

→lower latency, →higher bandwidth, →improved performance.

Manfred Mücke, Wilfried Gansterer

(4)

CHARMM (Chemistry at HARvard Macromolecular Mechanics)

CHARMM Molecular Dynamics (MD) Simulation: 20+ years of code development

Fortran (77/95) + MPI

Viscosity studies → long timelines (µs) Benchmark system: JAC1000

(protein in water, 23,558 atoms, 1 ps) tJ AC1000= 60 s → 1.44 ns/day

(5)

CHARMM JAC1000 Scaling (Cluster with IB)

Manfred Mücke, Wilfried Gansterer

(6)

Sun Fire X4600 M2

Migrate CHARMM from cluster to single distributed shared-memory (DSM) / ccNUMA node (16/32 cores).

eliminate IB interconnect

(7)

Sun Fire X4600 M2 - System Overview

Manfred Mücke, Wilfried Gansterer

(8)
(9)

CHARMM JAC1000 on Sun Fire X4600 M2

Moving CHARMM from cluster to DSM yields no signicant eect! Signicant variations in MPI function call times:

Investigate MPI collectives.

→Observe MPI_ALLREDUCE microbenchmark.

Manfred Mücke, Wilfried Gansterer

(10)
(11)

CHARMM JAC1000 on Sun Fire X4600 M2

Manfred Mücke, Wilfried Gansterer

(12)

MPI_ALLREDUCE on Sun Fire X4600 M2

(13)

Counter Measures

reduce OS jitter (sporadic OS activity)!

BUT: Is core- or cHT-activity aecting performance?

If cHT: Imbalanced applications could generate similar patterns. →Investigate correlation tMPI_ALLREDUCEcHT link load.

Manfred Mücke, Wilfried Gansterer

(14)
(15)

t

MPI_ALLREDUCE

cHT

tMPI_ALLREDUCEand cHT bandwidth of selected cHT links are correlated!

Manfred Mücke, Wilfried Gansterer

(16)

How to Measure cHT Link Load?

Data available in Opteron's Link Event registers (0F6h,0F7h, 0F8h, 1F9h, HyperTransport Link x Transmit Bandwidth)

1 Manual instrumentation of code

PAPI, cpclib, ..

acceptable for benchmarks, infeasible for (our) applications

2 Dynamic instrumentation

DTrace

Monitor whole system (kernel, user, HW)

(17)

Summary

CHARMM JAC1000 (OpenMPI) does not benet from Cluster/IB → ccNUMA/cHT,

one issue being varying execution times of MPI function calls

which are largely due to varying activity on the cHT inter-chip network. Immediate solution: Reduce OS activity → reduce cHT trac → improve MPI performance.

Concern: Application imbalance or overlapping comm./comp. could also result in increased cHT trac. → How robust against these cases is shared-memory OpenMPI via ccNUMA/cHT?

Manfred Mücke, Wilfried Gansterer

(18)

Future Work

Extract communication patterns from multiple link load observations. Looking for cHT collective comm. microbenchmark to dierentiate between network load eects and additional trac generated by MPI implementation. Alternative cHT observables?

(19)

Thanks for your attention!

Questions? Suggestions?

Research Lab Computational Technologies and Applications University of Vienna

http://rlcta.univie.ac.at

Manfred Mücke, Wilfried Gansterer

References

Related documents