• No results found

Applications and Runtime for multicore/manycore

N/A
N/A
Protected

Academic year: 2020

Share "Applications and Runtime for multicore/manycore"

Copied!
24
0
0

Loading.... (view fulltext now)

Full text

(1)

Applications and Runtime for

multicore/manycore

March 21 2007

Geoffrey Fox

Community Grids Laboratory Indiana University

505 N Morton Suite 224 Bloomington IN

[email protected]

(2)

Pradeep K. Dubey, [email protected]

Tomorrow

What is …? Is it …? What if …?

Recognition Mining Synthesis

Create a model instance

RMS: Recognition Mining Synthesis

Model-based multimodal recognition Find a model instance Model

Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Today

Model-less Real-time streaming andtransactions on static – structured

datasets

(3)

Discussed in Seminars

http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/

Rest mainly classi parallel computing

(4)

Some Bioinformatics Datamining

1. Multiple Sequence Alignment (MSA)

Kernel Algorithms

HMM (Hidden Markov Model)

pairwise alignments (dynamic programming) with heuristics (e.g. progressive, iterative method)

2. Motif Discovery

Kernel Algorithms:

MEME (Multiple Expectation Maximization for Motif Elicitation)

Gibbs sampler

3. Gene Finding (Prediction)

Hidden Markov Methods

4. Sequence Database Search

Kernel Algorithms

BLAST (Basic Local Alignment Search Tool)

PatternHunter

(5)

Berkeley Dwarfs

Dense Linear Algebra

Sparse Linear Algebra

Spectral Methods

N-Body Methods

Structured Grids

Unstructured Grids

Pleasingly Parallel

Combinatorial Logic

Graph Traversal

Dynamic Programming

Branch & Bound

Graphical Models (HMM)

Finite State Machine

Consistent in Sprit with Intel Analysis

(6)

Client side Multicore applications

“Lots of not very parallel applications”

Gaming; Graphics; Codec conversion for multiple user conferencing ……

Complex Data querying and data

manipulation/optimization/regression ; database and datamining (including computer vision) (Recognition

and Mining for Intel Analysis)

Statistical packages as in Excel and R

Scenario and Model simulations (Synthesis for Intel)

Multiple users give several Server side multicore applications

(7)

Approach I

Integrate Intel, Berkeley and other sources including database (successful on current parallel machines like scientific applications)

and define parallel approaches in “white paper”

Develop some key examples testing 3 parallel programming paradigms

Coarse Grain functional Parallelism (as in workflow) including pleasingly parallel instances with different data

Fine Grain functional Parallelism (as in Integer Programming)

Data parallel (Loosely Synchronous as in Science)

Construct so can use different run time including perhaps CCR/DSS, MPI, Data Parallel .NET

May be these will become libraries used as in

(8)

Approach II

Have looked at CCR in MPI style applications

Seems to work quite well and support more general messaging

models

NAS Benchmark using CCR to confirm its utility

Developing 4 exemplar multi-core parallel applications

Support Vector Machines (linear algebra) Data Parallel

Deterministic Annealing (statistical physics) Data Parallel

Computer Chess or Mixed Integer Programming Fine Grain Parallelism

Hidden Markov Method (Genetic Algorithms) Loosely Coupled functional Parallelism

Test high level coordination to such parallel applications

(9)

CCR for Data Parallel (Loosely

Synchronous) Applications

CCR supports general coordination of messages queued

in ports in Handler or Rendezvous mode

DSS builds service model on CCR and supports coarse grain functional parallelism

Basic CCR supports fine grain parallelism as in

computer chess (and use STM enabled primitives?)

MPI has well known collective communication which supply scalable global synchronization etc.

Look at performance of MPI_Sendrecv

What is model that encompasses best shared and

distributed memory approaches for “data parallel” problems

This could be put on top of CCR?

(10)

Thread0

Port 3 Thread2 Port2

Port 1 Port 0 Thread3 Thread1

Thread2 Port2 Thread0 Port0

Port 3 Thread3 Port 1 Thread1

Thread3 Port3 Thread2 Port2 Thread0 Port0

Thread1 Port1

(a) Pipeline (b) Shift

(d) Exchange

Thread0

Port 3 Thread2 Port2

Port 1 Port 0 Thread3 Thread1

(c) Two Shifts

Four Communication Patterns used in CCR Tests. (a) and (b) use CCR Receive while (c) and (d) use CCR Multiple Item Receive

Use o

AMD 4-core Xeon 4-core Xeon 8-core

Latter do up to 8 way

(11)

Write Exchange Messages Port 3 Port 2 Thread0 Thread3 Thread2 Thread1 Port 1 Port 0 Thread0 Write Exchange Messages Port 3 Thread2 Port 2

Exchanging Messages with 1D Torus Exchang

topology for loosely synchronous execution in CCR

Thread0 Read Message s Thread3 Thread2 Thread1 Port 1 Port 0 Thread3 Thread1

Stage Stage Stage

Break a single computation into different number of stages varying from 1.4 microseconds to 14 seconds for AMD

(12)

Stages (millions)

Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107

stages on HP Opteron Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode

Time Seconds

8.04 microseconds overhead per stage averaged from 1 to 10 million stages

Overhead = Computatio

n

Computation Component if no Overhead

4-way Pipeline Pattern 4 Dispatcher Threads

HP Opteron 1.4 microseconds

computation per stage

14 microseconds

(13)

Stage Overhead versus Thread Computation time

• Overhead per stage constant up to about million

stages and then increases

14 Seconds Stage Computation 14

(14)

Stages (millions)

Fixed amount of computation (4.107 units) divided into 4 cores and from 1 to 107

stages on Dell 2 processor 2-core each Xeon Multicore. Each stage separated by reading and writing CCR ports in Pipeline mode

Time Seconds

12.40 microseconds per stage

averaged from 1 to 10 million stages

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon

Overhead = Computatio

n

(15)

Summary of Stage Overheads for AMD 2-core 2-processor Machine

(16)

Summary of Stage Overheads for Intel 2-core 2-processor Machine

These are stage switching overheads for a set of runs with

different levels of parallelism and different message patterns –each stage takes about 30 microseconds. AMD overheads in parentheses

(17)

Summary of Stage Overheads for Intel 4-core 2-processor Machine

These are stage switching overheads for a set of runs with

different levels of parallelism and different message patterns –each stage takes about 30 microseconds. core

2-processor Xeon overheads in parentheses

(18)

XP-Pro

8-way Parallel Pipeline on two 4-core Xeon

• Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds

– So overhead of 6.1 microseconds modest

• Message size is just one

integer • Choose

computation unit that is

appropriate for a few

microsecond stage

overhead

AMD 4-way

27.94 microsecond Computation Unit

(19)

8-way Parallel Shift on two 4-core Xeon

• Histogram of 100 runs -- each run has 500,000 synchronizations following a thread execution that takes 33.92 microseconds

– So overhead of 8.2 microseconds modest

• Shift versus

pipeline adds a microsecond to cost

• Unclear what causes second peak

XP-ProVISTA

AMD 4-way

27.94 microsecond Computation Unit

(20)

8-way Parallel Double Shift on two 4-core Xeon

• Histogram of 100 runs -- each run has 500,000

synchronizations following a thread execution that takes 33.92 microseconds

– So overhead of 22.3 microseconds significant – Unclear why double shift slow compared to shift

• Exchange performance partly reflects number of

messages • Opteron

overheads significantly lower than Intel

XP-Pro

AMD 4-way

27.94 microsecond Computation Unit

(21)

AMD 2-core 2-processor Bandwidth Measurements

• Previously we measured latency as measurements corresponded to small messages. We did a further set of measurements of bandwidth by

exchanging larger messages of different size between threads • We used three types of data structures for receiving data

– Array in thread equal to message size

– Array outside thread equal to message size

– Data stored sequentially in a large array (“stepped” array)

(22)

Intel 2-core 2-processor Bandwidth Measurements

• For bandwidth, the Intel did better than AMD especially when one exploited cache on chip with small transfers

• For both AMD and Intel, each stage executed a computational task after copying data arrays of size 105 (labeled small), 106 (labeled

large) or 107 double words. The last column is an approximate value

in microseconds of the compute time for each stage. Note that copying 100,000 double precision words per core at a

(23)

Typical Bandwidth measurements showing effect of cache with slope change

5,000 stages with run time plotted against size of double array copied in each stage from thread to stepped locations in a large array on Dell Xeon Multicore

Time Seconds

4-way Pipeline Pattern 4 Dispatcher Threads Dell Xeon

Total Bandwidth 1.0 Gigabytes/Sec up to one million double words and 1.75 Gigabytes/Sec up to

100,000 double words

Array Size: Millions of Double Words

Slope Change (Cache

(24)

Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)

n CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better

References

Related documents

However, analysis of this programme illustrates the paradoxical nature of such equality programmes and just how ingrained the masculine norms of business and success are,

occupations requiring some form of on-the-job training are Customer Service Representatives, Bookkeeping, Accounting, and Auditing Clerks, and Business Operations

 Access to food remains limited for people displaced by the ongoing conflict, including those sheltering both within and outside of UN bases across the country..  Distribution

Penyebab dari miskonsepsi sendiri yaitu, (1) konsep yang dimiliki siswa belum lengkap, masih sederhana dan berbeda, (2) beberapa sumber belajar yang digunakan oleh

Therefore, the presented study systematically analyzed landscape features and hydrometric data during the snow-free periods of 2011 and 2012 and applied a minimalistic

So while the primary focus in this study is American English, the language phenomena discussed here are relevant to every language community and every human being..

However, the results changed dramatically when demand on the central task was higher as the healthy older individuals suffered sig- nificant loss in the ability to discriminate

A variable oscillator allows a laboratory student to adjust the frequency of a source to produce standing waves in a