• No results found

Performance Measurements of CCR and MPI on Multicore Systems

N/A
N/A
Protected

Academic year: 2020

Share "Performance Measurements of CCR and MPI on Multicore Systems"

Copied!
49
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Performance Measurements of CCR and

MPI

on Multicore Systems

Expanded from a Poster at Grid 2007 Austin Texas

September 21 2007

Xiaohong Qiu

Research Computing UITS

,

Indiana University Bloomington IN

Geoffrey Fox, H. Yuan, Seung-Hee Bae

Community Grids Laboratory, Indiana University Bloomington IN 47404

George Chrysanthakopoulos, Henrik Frystyk Nielsen

Microsoft Research, Redmond WA

(2)

2

Motivation

Exploring possible applications for tomorrow’s

multicore chips (especially clients) with

64 or

more cores

(about 5 years)

One plausible set of applications is data-mining

of Internet and local sensors

Developing Library of efficient

data-mining

algorithms

Clustering (

GIS, Cheminformatics

) and Hidden

Markov Methods (

Speech Recognition

)

(3)

3

Approach

Need 3 forms of parallelism

MPI Style

Dynamic threads

as in pruned search

Coarse Grain

functional

parallelism

Do not use an integrated language approach as in

Darpa HPCS

Rather use “

mash-ups

” or “

workflow

” to link

together modules in optimized parallel libraries

Use

Microsoft CCR/DSS

where DSS is

(4)

4

Microsoft CCR

Supports exchange of messages between threads using

named

ports

FromHandler:

Spawn threads without reading ports

Receive:

Each handler reads one item from a single port

MultipleItemReceive:

Each handler reads a prescribed number of

items of a given type from a given port. Note items in a port can

be general structures but all must have same type.

MultiplePortReceive:

Each handler reads a one item of a given

type from multiple ports.

JoinedReceive:

Each handler reads one item from each of two

ports. The items can be of different type.

Choice:

Execute a choice of two or more port-handler pairings

Interleave:

Consists of a set of arbiters (port -- handler pairs) of 3

types that are Concurrent, Exclusive or Teardown (called at end

for clean up). Concurrent arbiters are run concurrently but

exclusive handlers are

(5)

Preliminary Results

Parallel Deterministic Annealing Clustering

in

C# with

speed-up of 7

on Intel 2 quadcore

systems

Analysis of performance of

Java, C, C# in

MPI

and dynamic threading with XP, Vista,

Windows Server, Fedora, Redhat

on

Intel/AMD systems

Study of

cache effects

coming with MPI

thread-based parallelism

(6)

Machines Used

Intel8b: Dell Precision PWS690, 2 Intel Xeon CPUs E5355 at 2.66GHz, 8 cores L2 Cache 4x4M, Memory 4GB,

Vista Ultimate 64bit, Fedora 7

C# Benchmark Computational unit: 1.188 µs

Intel8c: Dell Precision PWS690, 2 Intel Xeon CPUs E5345 at 2.33GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

Red Hat 5.0, Fedora 7

Intel8a: Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, 8 cores L2 Cache 4x4M, Memory 8GB,

XP Pro 64bit

C# Benchmark Computational unit: 1.696 µs

Intel4: Dell Precision PWS670, 2 Intel Xeon Paxville CPUs at 2.80GHz, 4 cores L2 Cache 4x2MB, Memory 4GB,

XP Pro 64bit

C# Benchmark Computational unit: 1.475 µs

AMD4: HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, 4 cores L2 Cache 4x1MB (summing both chips), Memory 4GB,

(7)
(8)

21.38

11.3

16.3

15.5

10.32

Exchange

22.6

11.78

19.14

15.9

14.1

Exchange As

Two

Shifts

11.16

2.74

9.36

8.42

6.8

Shift

14.98

8.54

6.74

6.52

5.88

3.7

Pipeline

(MPI

23.92

12.74

10.18

8.9

7.44

Two Shifts

8.94

0.84

4.8

4.62

4.48

Shift

8.54

1.42

4.84

4.4

4.52

1.76

Pipeline

Spawned

8

7

4

3

2

1

(μs)

Number of Parallel Computations

AMD4: 4 Core

CCR Overhead for a computation

of 27.76 µs between messaging

(9)

CCR Overhead for a computation of

29.5 µs between messaging

Rende

vous

34.56

20

25.76

24.02

18.48

Exchange

36.16

22.14

30.64

27.48

23.76

Exchange As

Two Shifts

15.94

4.72

14.4

13.7

12.56

Shift

25.68

16.68

13.58

13.02

12.08

9.36

Pipeline

MPI

44.02

28.74

21

19.32

17.64

Two Shifts

13.52

4.38

10.08

9.34

8.3

Shift

12.12

3.02

10.18

9.38

8.3

3.32

Pipeline

Spawned

8

7

4

3

2

1

(μs)

(10)

CCR Overhead for a computation of

23.76 µs between messaging

Rende

vous

20.16

18.78

13.3

11.22

6.94

Exchange

35.62

31.86

14.16

11.64

7.4

Exchange As

Two Shifts

11.74

10.86

5.86

6.42

4.46

Shift

7.18

6.82

5.78

4.52

3.96

2.48

Pipeline

MPI

19.44

14.32

6.84

5.9

4.94

Two Shifts

5.14

5.26

3.38

3.2

2.42

Shift

5.06

4.5

2.94

3

2.44

1.58

Pipeline

Spawned

8

7

4

3

2

1

(μs)

(11)

Overhead (latency) of AMD4 PC with 4 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two shifts

or as custom CCR pattern

Stages (millions) Time

(12)

Overhead (latency) of Intel8b PC with 8 execution threads on MPI style

Rendezvous Messaging for Shift and Exchange implemented either as two

shifts or as custom CCR pattern

Stages (millions) Time

(13)
(14)

25.8 4 Thread CCR XP Intel4 16.3 4 Thread CCR XP 39.3 4 Process MPICH2 Redhat 99.4 4 Process mpiJava Redhat 152 4 Process MPJE Redhat 185 4 Process MPJE XP AMD4 20.2 8 Thread CCR Vista 100 8 Process mpiJava Fedora 142 8 Process MPJE Fedora 170 8 Process MPJE Vista Intel8b 64.2 8 Process MPICH2 111 8 Process mpiJava 157 8 Process MPJE Fedora Intel8c:gf20 4.21 8 Process Nemesis 39.3 8 Process MPICH2: Fast 40.0 8 Process MPICH2 181 8 Process MPJE Redhat Intel8c:gf12

MPI Exchange Latency Parallelism

Grains Runtime

OS Machine

(15)

0

2

4

6

8

10

Stages (millions)

MPICH mpiJava MPJE

(16)

0

2

4

6

8

10

Stages (millions)

MPICH mpiJava MPJE

(17)

0

2

4

6

8

10

Stages (millions)

MPICH Nemesis MPJE

(18)
(19)

Cache Line Interference

Early implementations of our clustering algorithm

showed large fluctuations due to the cache line

interference effect discussed here and on next slide

in a simple case

We have one thread on each core each calculating a

sum of same complexity storing result in a common

array A with different cores using different array

locations

Thread i stores sum in A(i) is separation 1 – no

variable access interference but cache line

interference

Thread i stores sum in A(X*i) is separation X

Serious degradation if X < 8 (64 bytes) with Windows

Note A is a double (8 bytes)

(20)

Cache Line Interference

Note measurements at a separation of 8 (and values between 8 and 1024 not shown) are essentially identical

Measurements at 7 (not shown) are higher than that at 8 (except for Red Hat which shows essentially no enhancement at X<8)

If effects due to co-location of thread variables in a 64 byte cache line, the array must be aligned with cache boundaries

(21)
(22)

Deterministic Annealing

See

K. Rose, "Deterministic Annealing for Clustering,

Compression, Classification, Regression, and Related

Optimization Problems," Proceedings of the IEEE, vol. 80,

pp. 2210-2239, November 1998

Parallelization

is similar to ordinary K-Means as we are

calculating global sums which are decomposed into local

averages and then summed over components calculated in

each processor

Many similar data mining algorithms (such as annealing for

E-M

expectation maximization) which have high parallel

efficiency and avoid local minima

For more details see

http

://grids.ucs.indiana.edu/ptliupages/presentations/Grid

2007PosterSept19-07.ppt and

(23)

Parallel Multicor

Deterministic Annealing

Clustering

Parallel Overhea

on 8 Threads Intel 8b

Speedup = 8/(1+Overhead)

10000/(Grain Size

n

= points per core)

Overhead =

Constant1

+

Constant2

/

n

Constant1 =

0.05 to 0.1 (Client Windows) due to threa

runtime fluctuations

10 Clusters

(24)

Parallel Multicore

Deterministic Annealing

Clustering

“Constant1”

Increasing number of clusters decreases

communication/memory bandwidth overheads

Parallel Overhead for large (2M points) Indiana Census clusterin

on 8 Threads Intel 8

(25)

Scaled Speed up Tests

The full clustering algorithm involves different values of

the number of clusters N

C

as computation progresses

The amount of computation per data point is proportional

to N

C

and so overhead due to memory bandwidth (cache

misses) declines as N

C

increases

We did a set of tests on the clustering kernel with fixed N

C

Further we adopted the

scaled speed-up

approach looking

at the performance as a function of number of parallel

threads with constant number of data points assigned to

each thread

This contrasts with fixed problem size scenario where the number

of data points per thread is inversely proportional to number of

threads

We plot Run time for same workload per thread divided by

number of data points multiplied by number of clusters

multiped by time at smallest data set (10,000 data points

per thread)

Expect this normalized run time to be independent of

number of threads if not for parallel and memory

bandwidth overheads

(26)

Intel 8b C with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

Note the smallest dataset has highest overheads as we increase

the number of threads

Not clear why this is

(27)

Intel 8b C with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

As we increase number of clusters, the effects at

10,000 data points decrease

Number of Threads

(28)

Intel 8b C# with 1 Cluster: Vista

Scaled Run Time for Clustering

Kernel

C# is similar to C with larger effects

(29)

Intel 8b C# with 1 Cluster: Vista Run

Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(30)

Intel 8b C# with 80 Clusters: Vista

Scaled Run Time for Clustering

Kernel

C# is similar to C with larger effects

(31)

AMD4 C with 1 Cluster: XP Scaled

Run Time for Clustering Kernel

This is significantly more stable than Intel runs

and shows little or no memory bandwidth effect

(32)

AMD4 C# with 1 Cluster: XP Scaled

Run Time for Clustering Kernel

This is significantly more stable than Intel C# 1

Cluster runs

(33)

AMD4 C# with 80 Clusters: XP

Scaled Run Time for Clustering

Kernel

This is broadly similar to 80 Cluster Intel C# runs

unlike one cluster case that was very different

(34)

AMD4 C# with 1 Cluster: Windows Server

Scaled Run Time for Clustering Kernel

This is significantly more stable than Intel C# runs

(35)

AMD4 C# with 80 Clusters: Windows

Server Scaled Run Time for Clustering

Kernel

Curiously run time decreases a bit as number of

threads increases in some AMD4 scenarios

(36)

Intel 8c C with 1 Cluster: Red Hat

Scaled Run Time for Clustering

Kernel

Deviations from “perfect” scaled speed-up are

much less for Red Hat than for Windows

(37)

Intel 8c C with 80 Clusters: Red Hat

Scaled Run Time for Clustering

Kernel

Deviations from “perfect” scaled speed-up are

much less for Red Hat

(38)

Intel 8b C# with 80 Clusters: Vista

Run Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(39)

AMD4 with 1 Cluster: Windows Server

Run Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time of the 8 threads

between messaging synchronization points

XP (not shown) is similar

(40)

Intel 8c with 80 Clusters: Redhat Run

Time Fluctuations for Clustering

Kernel

This is average of standard deviation of run time

of the 8 threads between messaging

synchronization points

(41)

DSS Section

We view system as a collection of

services – in this case

One to supply data

One to run parallel clustering

One to visualize results – in this by

spawning a Google maps browser

Note we are clustering Indiana census data

(42)

42

Timing of HP Opteron Multicore as a function of number of simultaneous

two-way service messages processed (November 2006 DSS Release)

n

Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

(43)

Clustering algorithm annealing by decreasing distance scale and gradually finds more

clusters as resolution improved

(44)
(45)
(46)
(47)
(48)
(49)

References

Related documents

We extend the conditional replenishment algorithm to produce multiple rates by striping block updates across different output layers.. When a block becomes idle, we “slide it” down

The task we want to solve for a given compari- son sentence is to detect the comparative predicate, the entities that are involved and the aspect that is being compared.. We borrow

It is then compared with the existing techniques like TeraSort to show that the performance of the Xtrie partitioning technique has good load balancing among the reducers

In addition, people in this area interact with the Nyanganje Forest Reserve for the collection of various NTFPs like firewood, poles, medicinal herbs, wild

A) High level modules should not depend upon low level modules. Both should depend upon abstractions. B) Abstractions should not depend upon details. Details should depend

In general, the day porter and nightly cleaning crew will be needed on almost all business days throughout the year, including days when the school is closed, while the

From a methodological standpoint, it appears reasonable that the lay indi- viduals involved in the subculture could make valid clinical observations regard- ing the absence or

In addition to securities trading based on hacked information, the “deceptive acquisition” theory could extend Rule 10b-5 liability to other scenarios in which a person