Performance of a Multi Paradigm Messaging Runtime on Multicore Systems

(1)

1

Performance of a Multi-Paradigm

Messaging Runtime on Multicore Systems

Poster at Grid 2007

Omni Austin Downtown Hotel Austin Texas

September 19 2007

Xiaohong Qiu

Research Computing UITS

,

Indiana University Bloomington IN

Geoffrey Fox, H. Yuan, Seung-Hee Bae

Community Grids Laboratory, Indiana University Bloomington IN 47404

George Chrysanthakopoulos, Henrik Frystyk Nielsen

Microsoft Research, Redmond WA

(2)

2

Motivation

• Exploring possible applications for tomorrow’s

multicore chips (especially clients) with

64 or

more cores

(about 5 years)

• One plausible set of applications is data-mining

of Internet and local sensors

• Developing Library of efficient

data-mining

algorithms

–

Clustering (

GIS, Cheminformatics

) and Hidden

Markov Methods (

Speech Recognition

)

(3)

3

Approach

• Need 3 forms of parallelism

–

MPI Style

–

Dynamic threads

as in pruned search

–

Coarse Grain

functional

parallelism

• Do not use an integrated language approach as in

Darpa HPCS

• Rather use “

mash-ups

” or “

workflow

” to link

together modules in optimized parallel libraries

• Use

Microsoft CCR/DSS

where DSS is mash-up

(4)

4

Microsoft CCR

• Supports exchange of messages between threads using

named

ports

• FromHandler:

Spawn threads without reading ports

• Receive:

Each handler reads one item from a single port

• MultipleItemReceive:

Each handler reads a prescribed number of

items of a given type from a given port. Note items in a port can

be general structures but all must have same type.

• MultiplePortReceive:

Each handler reads a one item of a given

type from multiple ports.

• JoinedReceive:

Each handler reads one item from each of two

ports. The items can be of different type.

• Choice:

Execute a choice of two or more port-handler pairings

• Interleave:

Consists of a set of arbiters (port -- handler pairs) of 3

types that are Concurrent, Exclusive or Teardown (called at end

for clean up). Concurrent arbiters are run concurrently but

exclusive handlers are

(5)

Preliminary Results

• Parallel Deterministic Annealing Clustering

in

C# with

speed-up of 7

on Intel 2 quadcore

systems

• Analysis of performance of

Java, C, C# in

MPI

and dynamic threading with XP, Vista,

Windows Server, Fedora, Redhat

on

Intel/AMD systems

• Study of

cache effects

coming with MPI

thread-based parallelism

• Study of

execution time fluctuations

in

(6)

Machines Used

Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores L2 Cache 4x4M, Memory 4GB,

Vista Ultimate 64bit, Fedora 7

C# Benchmark Computational unit: 1.188 µs

Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

Red Hat 5.0, Fedora 7

Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

XP Pro 64bit

Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores L2 Cache 4x2MB, Memory 4GB,

XP Pro 64bit

AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores L2 Cache 4x1MB (summing both chips), Memory 4GB,

(7)

21.38

11.3

16.3

15.5

10.32 Exchange

22.6

11.78

19.14

15.9

14.1 Exchange As

Two

Shifts

11.16

2.74

9.36

8.42

6.8 Shift

14.98

8.54

6.74

6.52

5.88

3.7 Pipeline

(MPI

23.92

12.74

10.18

8.9

7.44 Two Shifts

8.94

0.84

4.8

4.62

4.48 Shift

8.54

1.42

4.84

4.4

4.52

1.76 Pipeline

Spawned

8

7

4

3

2

1 (μs)

Number of Parallel Computations

AMD4: 4 Core

CCR Overhead for a computation

of 27.76 µs between messaging

(8)

CCR Overhead for a computation of

29.5 µs between messaging

Rende

vous

34.56

20

25.76

24.02

18.48 Exchange

36.16

22.14

30.64

27.48

23.76 Exchange As

Two Shifts

15.94

4.72

14.4

13.7

12.56 Shift

25.68

16.68

13.58

13.02

12.08

9.36 Pipeline

MPI

44.02

28.74

21

19.32

17.64 Two Shifts

13.52

4.38

10.08

9.34

8.3 Shift

12.12

3.02

10.18

9.38

8.3

3.32 Pipeline

Spawned

8

7

4

3

2

1 (μs)

(9)

CCR Overhead for a computation of

23.76 µs between messaging

Rende

vous

20.16

18.78

13.3

11.22

6.94 Exchange

35.62

31.86

14.16

11.64

7.4 Exchange As

Two Shifts

11.74

10.86

5.86

6.42

4.46 Shift

7.18

6.82

5.78

4.52

3.96

2.48 Pipeline

MPI

19.44

14.32

6.84

5.9

4.94 Two Shifts

5.14

5.26

3.38

3.2

2.42 Shift

5.06

4.5

2.94

3

2.44

1.58 Pipeline

Spawned

8

7

4

3

2

1 (μs)

(10)

25.8 4 Thread CCR XP Intel4 16.3 4 Thread CCR XP 39.3 4 Process MPICH2 Redhat 99.4 4 Process mpiJava Redhat 152 4 Process MPJE Redhat 185 4 Process MPJE XP AMD4 20.2 8 Thread CCR Vista 100 8 Process mpiJava Fedora 142 8 Process MPJE Fedora 170 8 Process MPJE Vista Intel8b 64.2 8 Process MPICH2 111 8 Process mpiJava 157 8 Process MPJE Fedora Intel8c:gf20 4.21 8 Process Nemesis 39.3 8 Process MPICH2: Fast 40.0 8 Process MPICH2 181 8 Process MPJE Redhat Intel8c:gf12

MPI Exchange Latency Parallelism

Grains Runtime

OS Machine

(11)

Overhead (latency) of AMD4 PC with 4 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two shifts

or as custom CCR pattern

Stages (millions) Time

(12)

Overhead (latency) of Intel8b PC with 8 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two

shifts or as custom CCR pattern

Stages (millions) Time

(13)

MPICH mpiJava MPJE

MPI Exchange Latency on AMD4

0

2

4

6

8

10

(14)

• One thread on each core

• Thread i stores sum in A(i) is separation 1 – no variable access interference but cache line

interference

• **Thread i stores sum in A(X*i) is separation X**

• Serious degradation if X < 64 bytes (8 words) and Vista or XP

• A is a double (8 bytes)

(15)

Deterministic Annealing

• See

K. Rose

, "Deterministic Annealing for

Clustering, Compression, Classification,

Regression, and Related Optimization

Problems," Proceedings of the IEEE, vol. 80, pp.

2210-2239, November 1998

• Parallelization

is similar to ordinary K-Means as

we are calculating global sums which are

decomposed into local averages and then

summed over components calculated in each

processor

(16)

Clustering by Deterministic Annealing

(17)

Deterministically find cluster centers y

j

using “mean field

(18)

(19)

(20)

(21)

Parallel Multicor

Deterministic Annealing

Clustering

Parallel Overhea

on 8 Threads Intel 8b

Speedup = 8/(1+Overhead)

10000/(Grain Size

n

= points per core)

Overhead =

Constant1

+

Constant2

/

n

Constant1 =

0.05 to 0.1 (Client Windows)

10 Clusters

(22)

Parallel Multicore

Deterministic Annealing

Clustering

“Constant1”

Increasing number of clusters decreases

communication/memory bandwidth overheads

(23)

Intel 8b C# with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

• Run time for same workload per thread normalized by number of

data points

• Expect Run Time independent of Number of threads if not for

parallel and memory bandwidth overheads

• Work per data point proportional to number of clusters

(24)

Intel 8b C# with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

• Work per data point proportional to number of

clusters so memory bandwidth and parallel

overheads decrease as # clusters increase

(25)

Intel 8c C with 80 Clusters: Redhat

Run Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(26)

Intel 8c C with 80 Clusters: Redhat

Scaled Run Time for Clustering

Kernel

• Work per data point proportional to number of

clusters so memory bandwidth and parallel

overheads decrease as # clusters increase

(27)

Intel 8b C# with 1 Cluster: Vista Run

Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(28)

Intel 8b C# with 80 Clusters: Vista

Run Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(29)

DSS Section

• We view system as a collection of

services – in this case

–

One to supply data

–

One to run parallel clustering

–

One to visualize results – in this by

spawning a Google maps browser

–

Note we are clustering Indiana census data

(30)

PC07Intro [email protected] 30

Timing of HP Opteron Multicore as a function of number of simultaneous

two-way service messages processed (November 2006 DSS Release)

n

CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

(31)

Clustering algorithm annealing by decreasing distance scale and gradually finds more

clusters as resolution improved

(32)

(33)

(34)

(35)

(36)

(37)