• No results found

Performance of a Multi Paradigm Messaging Runtime on Multicore Systems

N/A
N/A
Protected

Academic year: 2020

Share "Performance of a Multi Paradigm Messaging Runtime on Multicore Systems"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Performance of a Multi-Paradigm

Messaging Runtime on Multicore Systems

Poster at Grid 2007

Omni Austin Downtown Hotel Austin Texas

September 19 2007

Xiaohong Qiu

Research Computing UITS

,

Indiana University Bloomington IN

Geoffrey Fox, H. Yuan, Seung-Hee Bae

Community Grids Laboratory, Indiana University Bloomington IN 47404

George Chrysanthakopoulos, Henrik Frystyk Nielsen

Microsoft Research, Redmond WA

(2)

2

Motivation

Exploring possible applications for tomorrow’s

multicore chips (especially clients) with

64 or

more cores

(about 5 years)

One plausible set of applications is data-mining

of Internet and local sensors

Developing Library of efficient

data-mining

algorithms

Clustering (

GIS, Cheminformatics

) and Hidden

Markov Methods (

Speech Recognition

)

(3)

3

Approach

Need 3 forms of parallelism

MPI Style

Dynamic threads

as in pruned search

Coarse Grain

functional

parallelism

Do not use an integrated language approach as in

Darpa HPCS

Rather use “

mash-ups

” or “

workflow

” to link

together modules in optimized parallel libraries

Use

Microsoft CCR/DSS

where DSS is mash-up

(4)

4

Microsoft CCR

Supports exchange of messages between threads using

named

ports

FromHandler:

Spawn threads without reading ports

Receive:

Each handler reads one item from a single port

MultipleItemReceive:

Each handler reads a prescribed number of

items of a given type from a given port. Note items in a port can

be general structures but all must have same type.

MultiplePortReceive:

Each handler reads a one item of a given

type from multiple ports.

JoinedReceive:

Each handler reads one item from each of two

ports. The items can be of different type.

Choice:

Execute a choice of two or more port-handler pairings

Interleave:

Consists of a set of arbiters (port -- handler pairs) of 3

types that are Concurrent, Exclusive or Teardown (called at end

for clean up). Concurrent arbiters are run concurrently but

exclusive handlers are

(5)

Preliminary Results

Parallel Deterministic Annealing Clustering

in

C# with

speed-up of 7

on Intel 2 quadcore

systems

Analysis of performance of

Java, C, C# in

MPI

and dynamic threading with XP, Vista,

Windows Server, Fedora, Redhat

on

Intel/AMD systems

Study of

cache effects

coming with MPI

thread-based parallelism

Study of

execution time fluctuations

in

(6)

Machines Used

Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores L2 Cache 4x4M, Memory 4GB,

Vista Ultimate 64bit, Fedora 7

C# Benchmark Computational unit: 1.188 µs

Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

Red Hat 5.0, Fedora 7

Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

XP Pro 64bit

C# Benchmark Computational unit: 1.696 µs

Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores L2 Cache 4x2MB, Memory 4GB,

XP Pro 64bit

C# Benchmark Computational unit: 1.475 µs

AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores L2 Cache 4x1MB (summing both chips), Memory 4GB,

(7)

21.38

11.3

16.3

15.5

10.32

Exchange

22.6

11.78

19.14

15.9

14.1

Exchange As

Two

Shifts

11.16

2.74

9.36

8.42

6.8

Shift

14.98

8.54

6.74

6.52

5.88

3.7

Pipeline

(MPI

23.92

12.74

10.18

8.9

7.44

Two Shifts

8.94

0.84

4.8

4.62

4.48

Shift

8.54

1.42

4.84

4.4

4.52

1.76

Pipeline

Spawned

8

7

4

3

2

1

(μs)

Number of Parallel Computations

AMD4: 4 Core

CCR Overhead for a computation

of 27.76 µs between messaging

(8)

CCR Overhead for a computation of

29.5 µs between messaging

Rende

vous

34.56

20

25.76

24.02

18.48

Exchange

36.16

22.14

30.64

27.48

23.76

Exchange As

Two Shifts

15.94

4.72

14.4

13.7

12.56

Shift

25.68

16.68

13.58

13.02

12.08

9.36

Pipeline

MPI

44.02

28.74

21

19.32

17.64

Two Shifts

13.52

4.38

10.08

9.34

8.3

Shift

12.12

3.02

10.18

9.38

8.3

3.32

Pipeline

Spawned

8

7

4

3

2

1

(μs)

(9)

CCR Overhead for a computation of

23.76 µs between messaging

Rende

vous

20.16

18.78

13.3

11.22

6.94

Exchange

35.62

31.86

14.16

11.64

7.4

Exchange As

Two Shifts

11.74

10.86

5.86

6.42

4.46

Shift

7.18

6.82

5.78

4.52

3.96

2.48

Pipeline

MPI

19.44

14.32

6.84

5.9

4.94

Two Shifts

5.14

5.26

3.38

3.2

2.42

Shift

5.06

4.5

2.94

3

2.44

1.58

Pipeline

Spawned

8

7

4

3

2

1

(μs)

(10)

25.8 4 Thread CCR XP Intel4 16.3 4 Thread CCR XP 39.3 4 Process MPICH2 Redhat 99.4 4 Process mpiJava Redhat 152 4 Process MPJE Redhat 185 4 Process MPJE XP AMD4 20.2 8 Thread CCR Vista 100 8 Process mpiJava Fedora 142 8 Process MPJE Fedora 170 8 Process MPJE Vista Intel8b 64.2 8 Process MPICH2 111 8 Process mpiJava 157 8 Process MPJE Fedora Intel8c:gf20 4.21 8 Process Nemesis 39.3 8 Process MPICH2: Fast 40.0 8 Process MPICH2 181 8 Process MPJE Redhat Intel8c:gf12

MPI Exchange Latency Parallelism

Grains Runtime

OS Machine

(11)

Overhead (latency) of AMD4 PC with 4 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two shifts

or as custom CCR pattern

Stages (millions) Time

(12)

Overhead (latency) of Intel8b PC with 8 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two

shifts or as custom CCR pattern

Stages (millions) Time

(13)

MPICH mpiJava MPJE

MPI Exchange Latency on AMD4

0

2

4

6

8

10

(14)

One thread on each core

Thread i stores sum in A(i) is separation 1 – no variable access interference but cache line

interference

Thread i stores sum in A(X*i) is separation X

Serious degradation if X < 64 bytes (8 words) and Vista or XP

A is a double (8 bytes)

(15)

Deterministic Annealing

See

K. Rose

, "Deterministic Annealing for

Clustering, Compression, Classification,

Regression, and Related Optimization

Problems," Proceedings of the IEEE, vol. 80, pp.

2210-2239, November 1998

Parallelization

is similar to ordinary K-Means as

we are calculating global sums which are

decomposed into local averages and then

summed over components calculated in each

processor

(16)

Clustering by Deterministic Annealing

(17)

Deterministically find cluster centers y

j

using “mean field

(18)
(19)
(20)
(21)

Parallel Multicor

Deterministic Annealing

Clustering

Parallel Overhea

on 8 Threads Intel 8b

Speedup = 8/(1+Overhead)

10000/(Grain Size

n

= points per core)

Overhead =

Constant1

+

Constant2

/

n

Constant1 =

0.05 to 0.1 (Client Windows)

10 Clusters

(22)

Parallel Multicore

Deterministic Annealing

Clustering

“Constant1”

Increasing number of clusters decreases

communication/memory bandwidth overheads

(23)

Intel 8b C# with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

Run time for same workload per thread normalized by number of

data points

Expect Run Time independent of Number of threads if not for

parallel and memory bandwidth overheads

Work per data point proportional to number of clusters

(24)

Intel 8b C# with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

Work per data point proportional to number of

clusters so memory bandwidth and parallel

overheads decrease as # clusters increase

(25)

Intel 8c C with 80 Clusters: Redhat

Run Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(26)

Intel 8c C with 80 Clusters: Redhat

Scaled Run Time for Clustering

Kernel

Work per data point proportional to number of

clusters so memory bandwidth and parallel

overheads decrease as # clusters increase

(27)

Intel 8b C# with 1 Cluster: Vista Run

Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(28)

Intel 8b C# with 80 Clusters: Vista

Run Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(29)

DSS Section

We view system as a collection of

services – in this case

One to supply data

One to run parallel clustering

One to visualize results – in this by

spawning a Google maps browser

Note we are clustering Indiana census data

(30)

PC07Intro [email protected] 30

Timing of HP Opteron Multicore as a function of number of simultaneous

two-way service messages processed (November 2006 DSS Release)

n

CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

(31)

Clustering algorithm annealing by decreasing distance scale and gradually finds more

clusters as resolution improved

(32)
(33)
(34)
(35)
(36)
(37)

References

Related documents

A.S.M’s Institute of Management &amp; Computer Studies C-4, Wagle Industrial Estate,Near Mulund (W) Check Naka,Opp.. engine the virtual information or object can be overlaid on it

Chapter Two - GENERAL DESCRIPTION, provides descriptions of the MINDSET Computer hardware, including the Expansion Unit, option modules and cartridges.. Chapter

While it is desirable to accurately predict the percent change in power loss for individual turbines within an array, a primary goal of the proposed low-order model is to be able

Using the air separation unit, paper and vinyl could be separated from the mixed components of CFLs with considering particle size distribution and airflow rate. The optimum of

The predicted hard-sphere chemical potentials as a function of den- sity using the primitive and self-consistent molecular field quasi-chemical theories are compared to the

Keywords: Potential of mean force; Method of weighted residuals; Free energy; Thermodynamic integration; Histogram

(I retained my interest in this type of approach, and some years later Gabriel Balint- Kurti joined my group as a graduate student, and we proposed an improved version of the

We derive a test problem for evaluating the ability of time-stepping methods to preserve the statistical properties of systems in molecular dynamics. We consider a family