Performance Measurements of CCR and MPI on Multicore Systems

(1)

1

Performance Measurements of CCR and

MPI

on Multicore Systems

Expanded from a Poster at Grid 2007 Austin Texas

September 21 2007

Xiaohong Qiu

Research Computing UITS

,

Indiana University Bloomington IN

Geoffrey Fox, H. Yuan, Seung-Hee Bae

Community Grids Laboratory, Indiana University Bloomington IN 47404

George Chrysanthakopoulos, Henrik Frystyk Nielsen

Microsoft Research, Redmond WA

(2)

2

Motivation

• Exploring possible applications for tomorrow’s

multicore chips (especially clients) with

64 or

more cores

(about 5 years)

• One plausible set of applications is data-mining

of Internet and local sensors

• Developing Library of efficient

data-mining

algorithms

–

Clustering (

GIS, Cheminformatics

) and Hidden

Markov Methods (

Speech Recognition

)

(3)

3

Approach

• Need 3 forms of parallelism

–

MPI Style

–

Dynamic threads

as in pruned search

–

Coarse Grain

functional

parallelism

• Do not use an integrated language approach as in

Darpa HPCS

• Rather use “

mash-ups

” or “

workflow

” to link

together modules in optimized parallel libraries

• Use

Microsoft CCR/DSS

where DSS is

(4)

4

Microsoft CCR

• Supports exchange of messages between threads using

named

ports

• FromHandler:

Spawn threads without reading ports

• Receive:

Each handler reads one item from a single port

• MultipleItemReceive:

Each handler reads a prescribed number of

items of a given type from a given port. Note items in a port can

be general structures but all must have same type.

• MultiplePortReceive:

Each handler reads a one item of a given

type from multiple ports.

• JoinedReceive:

Each handler reads one item from each of two

ports. The items can be of different type.

• Choice:

Execute a choice of two or more port-handler pairings

• Interleave:

Consists of a set of arbiters (port -- handler pairs) of 3

types that are Concurrent, Exclusive or Teardown (called at end

for clean up). Concurrent arbiters are run concurrently but

exclusive handlers are

(5)

Preliminary Results

• Parallel Deterministic Annealing Clustering

in

C# with

speed-up of 7

on Intel 2 quadcore

systems

• Analysis of performance of

Java, C, C# in

MPI

and dynamic threading with XP, Vista,

Windows Server, Fedora, Redhat

on

Intel/AMD systems

• Study of

cache effects

coming with MPI

thread-based parallelism

(6)

Machines Used

Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores L2 Cache 4x4M, Memory 4GB,

Vista Ultimate 64bit, Fedora 7

C# Benchmark Computational unit: 1.188 µs

Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

Red Hat 5.0, Fedora 7

Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

XP Pro 64bit

Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores L2 Cache 4x2MB, Memory 4GB,

XP Pro 64bit

AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores L2 Cache 4x1MB (summing both chips), Memory 4GB,

(7)

(8)

21.38

11.3

16.3

15.5

10.32 Exchange

22.6

11.78

19.14

15.9

14.1 Exchange As

Two

Shifts

11.16

2.74

9.36

8.42

6.8 Shift

14.98

8.54

6.74

6.52

5.88

3.7 Pipeline

(MPI

23.92

12.74

10.18

8.9

7.44 Two Shifts

8.94

0.84

4.8

4.62

4.48 Shift

8.54

1.42

4.84

4.4

4.52

1.76 Pipeline

Spawned

8

7

4

3

2

1 (μs)

Number of Parallel Computations

AMD4: 4 Core

CCR Overhead for a computation

of 27.76 µs between messaging

(9)

CCR Overhead for a computation of

29.5 µs between messaging

Rende

vous

34.56

20

25.76

24.02

18.48 Exchange

36.16

22.14

30.64

27.48

23.76 Exchange As

Two Shifts

15.94

4.72

14.4

13.7

12.56 Shift

25.68

16.68

13.58

13.02

12.08

9.36 Pipeline

MPI

44.02

28.74

21

19.32

17.64 Two Shifts

13.52

4.38

10.08

9.34

8.3 Shift

12.12

3.02

10.18

9.38

8.3

3.32 Pipeline

Spawned

8

7

4

3

2

1 (μs)

(10)

CCR Overhead for a computation of

23.76 µs between messaging

Rende

vous

20.16

18.78

13.3

11.22

6.94 Exchange

35.62

31.86

14.16

11.64

7.4 Exchange As

Two Shifts

11.74

10.86

5.86

6.42

4.46 Shift

7.18

6.82

5.78

4.52

3.96

2.48 Pipeline

MPI

19.44

14.32

6.84

5.9

4.94 Two Shifts

5.14

5.26

3.38

3.2

2.42 Shift

5.06

4.5

2.94

3

2.44

1.58 Pipeline

Spawned

8

7

4

3

2

1 (μs)

(11)

Overhead (latency) of AMD4 PC with 4 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two shifts

or as custom CCR pattern

Stages (millions) Time

(12)

Overhead (latency) of Intel8b PC with 8 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two

shifts or as custom CCR pattern

Stages (millions) Time

(13)

(14)

25.8 4 Thread CCR XP Intel4 16.3 4 Thread CCR XP 39.3 4 Process MPICH2 Redhat 99.4 4 Process mpiJava Redhat 152 4 Process MPJE Redhat 185 4 Process MPJE XP AMD4 20.2 8 Thread CCR Vista 100 8 Process mpiJava Fedora 142 8 Process MPJE Fedora 170 8 Process MPJE Vista Intel8b 64.2 8 Process MPICH2 111 8 Process mpiJava 157 8 Process MPJE Fedora Intel8c:gf20 4.21 8 Process Nemesis 39.3 8 Process MPICH2: Fast 40.0 8 Process MPICH2 181 8 Process MPJE Redhat Intel8c:gf12

MPI Exchange Latency Parallelism

Grains Runtime

OS Machine

(15)

0

2

4

6

8

10 Stages (millions)

MPICH mpiJava MPJE

(16)

0

2

4

6

8

10 Stages (millions)

MPICH mpiJava MPJE

(17)

0

2

4

6

8

10 Stages (millions)

MPICH Nemesis MPJE

(18)

(19)

Cache Line Interference

• Early implementations of our clustering algorithm

showed large fluctuations due to the cache line

interference effect discussed here and on next slide

in a simple case

• We have one thread on each core each calculating a

sum of same complexity storing result in a common

array A with different cores using different array

locations

• Thread i stores sum in A(i) is separation 1 – no

variable access interference but cache line

interference

• **Thread i stores sum in A(X*i) is separation X**

• Serious degradation if X < 8 (64 bytes) with Windows

–

Note A is a double (8 bytes)

(20)

Cache Line Interference

• Note measurements at a separation of 8 (and values between 8 and 1024 not shown) are essentially identical

• Measurements at 7 (not shown) are higher than that at 8 (except for Red Hat which shows essentially no enhancement at X<8)

• If effects due to co-location of thread variables in a 64 byte cache line, the array must be aligned with cache boundaries

(21)

(22)

Deterministic Annealing

• See

K. Rose, "Deterministic Annealing for Clustering,

Compression, Classification, Regression, and Related

Optimization Problems," Proceedings of the IEEE, vol. 80,

pp. 2210-2239, November 1998

• Parallelization

is similar to ordinary K-Means as we are

calculating global sums which are decomposed into local

averages and then summed over components calculated in

each processor

• Many similar data mining algorithms (such as annealing for

E-M

expectation maximization) which have high parallel

efficiency and avoid local minima

• For more details see

–

http

://grids.ucs.indiana.edu/ptliupages/presentations/Grid

2007PosterSept19-07.ppt and

(23)

Parallel Multicor

Deterministic Annealing

Clustering

Parallel Overhea

on 8 Threads Intel 8b

Speedup = 8/(1+Overhead)

10000/(Grain Size

n

= points per core)

Overhead =

Constant1

+

Constant2

/

n

Constant1 =

0.05 to 0.1 (Client Windows) due to threa

runtime fluctuations

10 Clusters

(24)

Parallel Multicore

Deterministic Annealing

Clustering

“Constant1”

Increasing number of clusters decreases

communication/memory bandwidth overheads

Parallel Overhead for large (2M points) Indiana Census clusterin

on 8 Threads Intel 8

(25)

Scaled Speed up Tests

• The full clustering algorithm involves different values of

the number of clusters N

C

as computation progresses

• The amount of computation per data point is proportional

to N

C

and so overhead due to memory bandwidth (cache

misses) declines as N

C

increases

• We did a set of tests on the clustering kernel with fixed N

C

• Further we adopted the

scaled speed-up

approach looking

at the performance as a function of number of parallel

threads with constant number of data points assigned to

each thread

–

This contrasts with fixed problem size scenario where the number

of data points per thread is inversely proportional to number of

threads

• We plot Run time for same workload per thread divided by

number of data points multiplied by number of clusters

multiped by time at smallest data set (10,000 data points

per thread)

• Expect this normalized run time to be independent of

number of threads if not for parallel and memory

bandwidth overheads

(26)

Intel 8b C with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

• Note the smallest dataset has highest overheads as we increase

the number of threads

–

Not clear why this is

(27)

Intel 8b C with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

• As we increase number of clusters, the effects at

10,000 data points decrease

Number of Threads

(28)

Intel 8b C# with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

• C# is similar to C with larger effects

(29)

Intel 8b C# with 1 Cluster: Vista Run

Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(30)

Intel 8b C# with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

• C# is similar to C with larger effects

(31)

AMD4 C with 1 Cluster: XP Scaled

Run Time for Clustering Kernel

• This is significantly more stable than Intel runs

and shows little or no memory bandwidth effect

(32)

AMD4 C# with 1 Cluster: XP Scaled

Run Time for Clustering Kernel

• This is significantly more stable than Intel C# 1

Cluster runs

(33)

AMD4 C# with 80 Clusters: XP

Scaled Run Time for Clustering

Kernel

• This is broadly similar to 80 Cluster Intel C# runs

unlike one cluster case that was very different

(34)

AMD4 C# with 1 Cluster: Windows Server

Scaled Run Time for Clustering Kernel

• This is significantly more stable than Intel C# runs

(35)

AMD4 C# with 80 Clusters: Windows

Server Scaled Run Time for Clustering

Kernel

• Curiously run time decreases a bit as number of

threads increases in some AMD4 scenarios

(36)

Intel 8c C with 1 Cluster: Red Hat

Scaled Run Time for Clustering

Kernel

• Deviations from “perfect” scaled speed-up are

much less for Red Hat than for Windows

(37)

Intel 8c C with 80 Clusters: Red Hat

Scaled Run Time for Clustering

Kernel

• Deviations from “perfect” scaled speed-up are

much less for Red Hat

(38)

Intel 8b C# with 80 Clusters: Vista

Run Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(39)

AMD4 with 1 Cluster: Windows Server

Run Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time of the 8 threads

between messaging synchronization points

• XP (not shown) is similar

(40)

Intel 8c with 80 Clusters: Redhat Run

Time Fluctuations for Clustering

Kernel

• This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(41)

DSS Section

• We view system as a collection of

services – in this case

–

One to supply data

–

One to run parallel clustering

–

One to visualize results – in this by

spawning a Google maps browser

–

Note we are clustering Indiana census data

(42)

42

Timing of HP Opteron Multicore as a function of number of simultaneous

two-way service messages processed (November 2006 DSS Release)

n

Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

(43)

Clustering algorithm annealing by decreasing distance scale and gradually finds more

clusters as resolution improved

(44)

(45)

(46)

(47)

(48)

(49)