Linking Programming models between Grids, Web 2 0 and Multicore

(1)

1

Linking Programming models

between Grids, Web 2.0 and

Multicore

Distributed Programming Abstractions

Workshop NESC Edinburgh UK

May 31 2007

Geoffrey Fox

Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401

(2)

2

Points in Talk I

n All parallel programming projects more or less fail n All distributed programming projects report success

• There are several hundred in Grid workflow area alone

n Few constraints on distributed programming

n Composition (in distributed computing) v decomposition (in parallel

computing)

n There is not much difference between distributed programming and a key

paradigm of parallel computing (functional parallelism)

n Pervasive use of  64 core chips in the future will often require one to build a Grid on a chip i.e. to execute a traditional distributed application on a chip

n XML is a pretty dubious syntax for expressing programs

n Web 2.0 is pretty scruffy but there are some large companies and many users

behind it.

n Web 2.0 and Grids will converge and features of both will survive or disappear

in merged environment

n Web 2.0 has a more plausible approach to distributed programming than Web

Services/Grids

(3)

Some More points

n Services could be universal abstraction in parallel and distributed computing

• Whereas objects could not be universal so perhaps should move away from their use

n Gateways/Portals (Portlets, Widgets, Gadgets) are natural user (application usage)

interface to a collection of services

n Important Data (SQL, WFS, RSS Feeds) abstractions

n Divide Parallel Programming Run-time (matching application structure) into 3 or 4

Broad classes

n Inter-entity communication time characteristic of different programming model

• 1-5 µs for MPI/Thread switching to 1-1000 milliseconds for services on the Grid and 25 µs

for services inside a chip

n Multicore Commodity Programming Model

• Marine corps write librariesin “HLA++”, MPI or dynamic threads (internally one

microsecond latency)expressed as services

• Services composed/mashuped by “millions”

n Many composition (coordination) or mashup approaches

• Functional (cf. Google Map Reduce for data transformations)

• Dataflow

• Workflow

• Visual

• Script

n The difficulties of making effective use of multicore chips will so great that it will be

main driver of new programming environments

n Microsoft CCR DSS is good example of unification of parallel and distributed

(4)

Some Details

n

See

http://www.slideshare.net/Foxsden

or more

conventionally

n

Web 2.0 and Grid

Tutorial

•

ht

tp://grids.ucs.indiana.edu/ptliupages/presentations/CTSpar

tIMay21-07.ppt

•

htt

p://grids.ucs.indiana.edu/ptliupages/presentations/Web20T

utorial_CTS.ppt

n

Multicore and Parallel Computing

Tutorial

•

http:

//grids.ucs.indiana.edu/ptliupages/presentations/PC2007/

index.html

n

“Web 2

.0”

citation site

(5)

Web 2.0 and Web Services I

n Web Services have clearly defined protocols (SOAP) and a well defined

mechanism (WSDL) to define service interfaces

• There is good .NET and Java support

• The so-called WS-* (WS-Nightmare) specifications provide a rich

sophisticated but complicated standard set of capabilities for security, fault tolerance, meta-data, discovery, notification etc.

n “Narrow Grids” build on Web Services and provide a robust managed

environment with growing adoption in Enterprise systems and distributed science (e-Science)

n We can use the term Grids strictly as Narrow Grids that are collections of

Web Services (or even more strictly OGSA Grids) or just call any collections of services as “Broad Grids” which actually is quite often done

n Web 2.0 supports a similar architecture to Web services but has developed in

a more chaotic but remarkably successful fashion with a service architecture with a variety of protocols including those of Web and Grid services

• Over 400 Interfaces defined at http://www.programmableweb.com/apis

n One can easily combine SOAP (Web Service) based services/systems with

(6)

Web 2.0 and Web Services II

n

Web 2.0

also has many well known capabilities with

Google

Maps

and

Amazon Compute/Storage services

of clear general

relevance

n

There are also

Web 2.0 services

supporting novel collaboration

modes and user interaction with the web as seen in social

networking sites and portals such as: MySpace, YouTube,

Connotea, Slideshare ….

n

I once thought

Web Services were inevitable

but this is no longer

clear to me

n

Web services are

complicated

,

slow

and

non functional

•

WS-Security

is unnecessarily slow and pedantic

(canonicalization of XML)

•

WS-RM

(Reliable Messaging) seems to have poor adoption

and doesn’t work well in collaboration

•

WSDM

(distributed management) specifies a lot

(7)

Attack of the Killer Multicores

n

Today

commodity Intel systems

are sold with

8 cores

spread over

two processors

n

Specialized chips such as

GPU

’s and

IBM Cell

processor have

substantially more cores

n

Moore’s Law

implies and will be satisfied by and imply

exponentially increasing number of cores doubling every 1.5-3

Years

•

Modest increase in clock speed

•

Intel has already prototyped a 80 core Server chip ready in

2011?

n

Huge activity in

parallel computing

programming (recycled from

the past?)

•

Some

programming models

and

application styles

similar to

Grids

n

We will have a

Grid on a chip

……….

(8)

Grids meet Multicore Systems

n

The expected rapid growth in the number of cores per chip has

important implications for Grids

n

With 16-128 cores on a single commodity system 5 years from

now one will both be able to build a

Grid like application on a

chip

and indeed must build such an application to get the

Moore’s law performance increase

•

Otherwise you will “waste” cores …..

n

One

will not want to reprogram

as you move your application

from a 64 node cluster or transcontinental implementation to a

single chip Grid

n

However multicore chips have a very

different architecture

from

Grids

•

Shared not Distributed Memory

•

Latencies measured in microseconds not milliseconds

n

Thus

Grid and multicore technologies will need to “converge”

and converged technology model will have different

requirements from current Grid assumptions

(9)

Grid versus Multicore Applications

n

It seems likely that

future multicore applications

will

involve a loosely coupled mix of multiple modules that

fall into three classes

•

Data access/query/store

•

Analysis and/or simulation

•

User visualization and interaction

n

This is

precisely mix that Grids support

but Grids of

course involve distributed modules

n

Grids and Web 2.0 use

service oriented architectures

to

describe system at module level – is this appropriate

model for multicore programming?

n

Where do multicore systems get their data from

?

(10)

Pradeep K. Dubey, [email protected] 10

Tomorrow

What is …? What

if …? Is it …?

Recognition Mining Synthesis

Create a model instance

RMS: Recognition Mining Synthesis

Model-based multimodal

recognition

Find a model instance Model

Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Today

Model-less Real-time streaming and transactions on

static – structured datasets

Very limited realism

Intel has probably most sophisticated analysis of

future “killer” multicore applications –

(11)

Pradeep K. Dubey, [email protected] 11

What is a tumor? Is there a tumor here? What if the tumor progresses?

It is all about dealing efficiently with complex multimodal datasets

R

ecognition

M

ining

S

ynthesis

(12)

(13)

Role of Data in Grid/Multicore I

n

One typically is told to place compute (

analysis

) at the

data

but most of the computing power is in

multicore

clients

on the edge

n

These

multicore clients

can get data from the internet

i.e. distributed sources

•

This could be personal interests of client and used by client to

help user interact with world

•

It could be

cached or copied

•

It could be a

standalone

calculation or part of a

distributed

coordinated

computation (SETI@Home)

n

Or

they could get data from set of

local sensors

(video-cams and environmental sensors) naturally stored on

client or locally to client

(14)

Role of Data in Grid/Multicore

n

Note that as you increase sophistication of data

analysis, you increase ratio of compute to I/O

•

Typical modern datamining approach like

Support Vector

Machine

is sophisticated (dense) matrix algebra and not just

text matching

• http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/PC07BYOPA.ppt

n

Time complexity

of Sophisticated

data analysis

will

make it more attractive to fetch data from the Internet

and cache/store on client

•

It will also help with memory bandwidth problems in

multicore chips

n

In this vision, the

Grid “just” acts as a source of data

and the Grid application runs locally

(15)

PC07Intro [email protected] ₁₅

Multicore Programming Paradigms

• At a very high level, there are

three or four broad classes

of

parallelism

• Coarse grain functional parallelism

typified by workflow

and often used to build composite “metaproblems” whose

parts are also parallel

–

“Compute-File”, Database/Sensor, Community, Service, Pleasing

Parallel (Master-worker) are sub-classses

• Large Scale loosely synchronous data parallelism

where

dynamic irregular work has clear synchronization points as

in most large scale scientific and engineering problems

• Fine grain (asynchronous) thread parallelism

as used in

search algorithms which are often data parallel (over

choices) but don’t have universal synchronization points

(16)

Data Parallel Time Dependence

• A simple form of data parallel applications are synchronous with all elements of the application space being evolved with essentially the same instructions

• Such applications are suitable for SIMD computers and run well on vector supercomputers (and GPUs but these are more general than just

synchronous)

• However synchronous applications also run fine on MIMD machines

• SIMD CM-2 evolved to MIMD CM-5 with same data parallel language

CMFortran

• The iterative solutions to Laplace’s equation are synchronous as are many full matrix algorithms

Synchronization on MIMD machines is accomplished by messaging

(17)

Local Messaging for Synchronization

• MPI_SENDRECV is typical primitive

• Processors do a send followed by a receive ora receive followed by a send

• In two stages (needed to avoid race conditions), one has a complete left shift

• Often follow by equivalent right shift, do get a complete exchange

• This logic guarantees correctly updated data is sent to processors that have their data at same simulation time

………

8 Processors

(18)

Loosely Synchronous Applications

• This is

most common

large scale science and engineering

and one has the traditional data parallelism but now

each data point has in general a different update

–

Comes from

heterogeneity

in problems that would be

synchronous if homogeneous

• Time steps typically uniform but sometimes need to support variable time steps across application space – however ensure small time steps are t = (t1

-t0)/Integer so subspaces with finer time steps do synchronize with full domain

• The time synchronization via messaging is still valid

• However one no longer load

balances (ensure each processor does equal work in each time step) by putting equal number of points in each processor

• Load balancing although NP complete is in practice

surprisingly easy Application Time Application Space t 0 t 1 t 2 t 3 t 4

(19)

MPI Futures?

• MPI

likely to become

more important

as

multicore systems become more common

• Should

use MPI when MPI needed

and use other

messaging for other cases (such as linking

services) where different features/performance

appropriate

• MPI has

too many primitives

which will

handicap broad implementation/adoption

• Perhaps only have

one collective primitive

like

(20)

Fine Grain Dynamic Applications

• Here there is no natural universal ‘time’ as there is in science

algorithms where an iteration number or Mother Nature’s time

gives global synchronization

• Loose (zero) coupling or special features

of application needed for successful

parallelization

• In computer chess, the minimax scores

at parent nodes provide multiple

dynamic synchronization points

Application Time

Application Space

(21)

Computer Chess

• Thread level parallelism unlike position evaluation parallelism used in other systems

• Competed with poor reliability and results in 1987 and 1988 ACM Computer Chess

Championships

(22)

Discrete Event Simulations

• These are familiar in military and circuit (system) simulations

when one uses macroscopic approximations

– Also probably paradigm of most multiplayer Internet games/worlds

• Note Nature is perhaps

synchronous

when viewed quantum

mechanically in terms of uniform fundamental elements (quarks

and gluons etc.)

• It is

loosely synchronous

when considered in terms of particles

and mesh points

• It is

asynchronou

when viewed in

terms of tanks

people, arrows etc.

• Circuit simulations

can be done loosel

synchronously but

inefficient as many

inactive elements

(23)

Programming Models

• The three major models are supported by

HPCS languages

which

are very interesting but too monolithic

• So the

Fine grain thread parallelism

and

Large Scale loosely

synchronous data parallelism

styles are distinctive to parallel

computing while

• Coarse grain functional parallelism

of multicore overlaps with

workflows

from Grids and

Mashups

from Web 2.0

• Seems plausible that a more

uniform approach

evolve for coarse

grain case although this is

least constrained

of programming

styles as typically latency issues are not critical

– Multicore would have strongest performance constraints

– Web 2.0 and Multicore the most important usability constraints

• A possible model for broad use of multicores is that the difficult

parallel algorithms are coded as libraries (

Fine grain thread

parallelism

and

Large Scale loosely synchronous data parallelism

(24)

PC07Intro [email protected] ₂₄

Google MapReduce

Simplified Data Processing on Large Clusters

• http://labs.google.com/papers/mapreduce.html

• This is a dataflow model between services where services can do useful

document oriented data parallel applications including reductions

• The decomposition of services onto cluster engines is automated

• The large I/O requirements of datasets changes efficiency analysis in favor of dataflow

• Services (count words in example) can obviously be extended to general parallel applications

• There are many alternatives to language expressing either dataflow and/or

(25)

Programming Models

• The services and objects in distributed

computing are usually “natural” (come from

application) whereas

parts connected by MPI

(or

created by parallelizing compiler) come from

“artificial” decompositions and not naturally

considered services

• Services in multicore

(parallel computing) are

original modules before decomposition and its

these modules that

coarse grain functional

parallelism

addresses

• Most of “

difficult

” issues in parallel computing

(26)

Parallel Software Paradigms: Top Level

• In the conventional

two-level Grid/Web Service

programming model

, one programs each

individual service and then separately programs

their interaction

–

This is Grid-aware Services programming model

–

SAGA supports Grid-aware programs?

• This is generalized to multicore with “Marine

Corps” programming services for “difficult”

cases

–

Loosely Synchronous

–

Fine Grain threading

–

Discrete Event Simulation

• “Average” Programmer produces mashups or

(27)

The Marine Corps Lack of

Programming Paradigm Library Model

• One could assume that parallel computing is “just too hard

for real people” and assume that we use a

Marine Corps of

programmers

to build as

libraries

excellent parallel

implementations of “all” core capabilities

–

e.g. the primitives identified in the

Intel application

analysis

–

e.g. the primitives supported in Google

MapReduce

,

HPF

,

PeakStream

,

Microsoft Data Parallel .NET

etc.

• These primitives are orchestrated (linked together) by

overall frameworks

such as workflow or

mashups

• The Marine Corps probably is content with

efficient

rather

(28)

Component Parallel and Program Parallel

• Component parallel

paradigm is where one

explicitly programs

the different parts of a parallel application

with the linkage either

specified externally as in workflow or in components themselves

as in most other component parallel approaches

– In Grids, components are natural

– In Parallel computing, components are produced by decomposition

• In the

program parallel

paradigm, one writes a single program to

describe the whole application and some combination of compiler

and runtime

breaks up the program

into the multiple parts that

execute in parallel

• Note that a

program parallel approach

will often call a built in

runtime

library

written in

component parallel fashion

– A parallelizing compiler could call an MPI library routine

(29)

Component Parallel and Program Parallel

• Program Parallel

approaches include

–

Data structure parallel

as in

Google MapReduce

,

HPF

(High

Performance Fortran),

HPCS

(High-Productivity Computing

Systems) or “SIMD”

co-processor

languages (

PeakStream

,

ClearSpeed

and Microsoft Data Parallel .NET)

–

Parallelizing compilers

including

OpenMP

annotation

–

Note OpenMP and HPF have failed in some sense for large scale

parallel computing (writing algorithm in standard sequential

languages throws away information needed for parallelization)

• Component Parallel

approaches include

–

MPI

(and related systems like

PVM

) parallel message passing

–

PGAS

(Partitioned Global Address Space

CAF, UPC, Titanium,

HPJava

)

–

C++ futures

and

active

objects

–

CSP … Microsoft

CCR

and

DSS

–

Workflow

and

Mashups

(30)

Why people like MPI!

• Jason J Beech-Brandt, and Andrew A. Johnson, at AHPCRC

Minneapolis

• BenchC

is unstructured finite element CFD Solver

• Looked at

OpenMP

on

shared memor

Altix with som

effort to

optimize

• Optimized UP

on severa

machines

• MPI always goo

but other

approache

erratic

• Other studies

reach similar

conclusions?

cluste r

After Optimization of UPC

(31)

Web 2.0 Systems are Portals, Services, Resources

n

Captures the incredible development of interactive

Web sites enabling people to create and collaborate

(32)

32

Mashups v Workflow?

n Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 n Workflow Tools are reviewed by Gannon and Fox

http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf

n

Both include

scripting

in PHP,

Python, sh etc. as

both implement

distributed

programming at level

of services

n

Mashups

use all

types of service

interfaces and do not

have the potential

robustness

(security)

of Grid service

approach

(33)

Web 2.0 APIs



http://www.programmable

web.com/apis

has (May 14

2007) 431 Web 2.0 APIs

with GoogleMaps the most

often used in Mashups



This site acts as a “

UDDI

”

(34)

The List of

Web 2.0 API’s



Each site has API and

its features



Divided into broad

Only a few used a lot

(

42 API’s

used in

more than

10 mashups

)



RSS feed of new APIs



Amazon S3 growing

(35)

APIs/Mashups per Protocol Distribution

REST SOAP XML-RPC REST,

XML-RPC XML-RPC,REST, SOAP

REST,

SOAP JS Other

(36)

4 more

Mashups each

day



For a total of

1906

April 17 2007 (4.0 a

day over last

month)



Note ClearForest

runs

Semantic Web

Services Mashup

competitions (not

workflow

competitions)



Some

Mashup

types

: aggregators,

search aggregators,

visualizers, mobile,

maps, games

(37)

Implication for Grid Technology

of Multicore and Web 2.0 I

n

Web 2.0 and Grids are addressing a

similar application

class

although Web 2.0 has focused on user interactions

•

So technology has similar requirements

n

Multicore differs significantly from Grids in

component location and this seems

particularly

significant for data

•

Not clear therefore how similar applications will be

•

Intel RMS multicore application class

pretty similar

to Grids

n

Multicore has more stringent software requirements

than Grids as latter has intrinsic network overhead

(38)

Implication for Grid Technology

of Multicore and Web 2.0 II

n

Multicore chips require

low overhead protocols

to

exploit low latency that suggests

simplicity

•

We need to simplify

MPI

AND

Grids!

n

Web 2.0 chooses

simplicity

(REST rather than SOAP)

to

lower barrier

to everyone participating

n

Web 2.0 and Multicore tend to use

traditional (possibly

visual) (scripting) languages

for equivalent of workflow

whereas Grids use

visual interface backend recorded in

BPEL

•

Google MapReduce

illustrates a popular Web 2.0

and Multicore approach to dataflow

(39)

Implication for Grid Technology

of Multicore and Web 2.0 III

n

Web 2.0 and Grids both use

SOA Service Oriented

Architectures

•

Seems likely that

Multicore will also adopt

although a more

conventional object oriented approach also possible

•

Services should help multicore applications integrate

modules from different sources

•

Multicore will use fine grain objects but coarse grain

services

n

“System of Systems”:

Grids, Web 2.0 and Multicore are likely

to build systems hierarchically out of smaller systems

•

We need to support Grids of Grids, Webs of Grids, Grids

of Multicores etc. i.e. systems of systems of all sorts

(40)

The Ten areas covered by the 60 core WS-*

Specifications

WSRP (Remote Portlets) 10: Portals and User

Interfaces

WS-Policy, WS-Agreement 9: Policy and Agreements

WSDM, WS-Management, WS-Transfer 8: Management

WSRF, WS-MetadataExchange, WS-Context 7: System Metadata and State

UDDI, WS-Discovery 6: Service Discovery

WS-Security, WS-Trust, WS-Federation, SAML, WS-SecureConversation

5: Security

BPEL, WS-Choreography, WS-Coordination 4: Workflow and

Transactions

WS-Notification, WS-Eventing (Publish-Subscribe)

3: Notification

WS-Addressing, WS-MessageDelivery; Reliable Messaging WSRM; Efficient Messaging MOTM 2: Service Internet

XML, WSDL, SOAP 1: Core Service Model

(41)

**WS-* Areas and Web 2.0**

Start Pages, AJAX and Widgets(Netvibes) Gadgets 10: Portals and User

Interfaces

Service dependent. Processed by application 9: Policy and Agreements

WS-Transfer style Protocols GET PUT etc. 8:

Management==Interaction

Processed by application – no system state –

Microformats are a universal metadata approach 7: System Metadata and

State

http://www.programmableweb.com 6: Service Discovery

SSL, HTTP Authentication/Authorization, OpenID is Web 2.0 Single Sign on

5: Security

Mashups, Google MapReduce

Scripting with PHP JavaScript …. 4: Workflow and

Transactions (no

Transactions in Web 2.0)

Hard with HTTP without polling– JMS perhaps? 3: Notification

No special QoS. Use JMS or equivalent? 2: Service Internet

XML becomes optional but still useful SOAP becomes JSON RSS ATOM

WSDL becomes REST with API as GET PUT etc. Axis becomes XmlHttpRequest

1: Core Service Model

(42)

**WS-* Areas and Multicore**

Web 2.0 technology popular 10: Portals and User

Interfaces

Handled by application 9: Policy and Agreements

Interaction between objects key issue in parallel programming trading off efficiency versus

performance 8: Management ==

Interaction

Environment Variables 7: System Metadata and State

Use libraries 6: Service Discovery

Not so important intrachip 5: Security

Many approaches; scripting languages popular 4: Workflow and

Transactions

Publish-Subscribe for events and Interrupts 3: Notification

Not so important intrachip 2: Service Internet

Fine grain Java C# C++ Objects and coarse grain services as in DSS. Information passed explicitly or by handles. MPI needs to be updated to handle non scientific applications as in CCR

1: Core Service Model

(43)

CCR as an example of a Cross Paradigm

Run Time

• Naturally supports fine grain thread switching with

message passing with around

4 microsecond

latency for

4 threads switching to 4 others on an AMD PC with C#.

Threads spawned – no rendezvous

• Has around

50 microsecond latency

for coarse grain

service interactions with DSS extension which supports

Web 2.0 style messaging

• MPI Collectives – Shift and Exchange vary from

10 to

20 microsecond latency

in rendezvous mode

• Not as good as best MPI’s but managed code and

supports Grids Web 2.0 and Parallel Computing ……

(44)

PC07Intro [email protected] ₄₄

Microsoft CCR

• Supports exchange of messages between threads using

named

ports

• FromHandler:

Spawn threads without reading ports

• Receive:

Each handler reads one item from a single port

• MultipleItemReceive:

Each handler reads a prescribed number of

items of a given type from a given port. Note items in a port can

be general structures but all must have same type.

• MultiplePortReceive:

Each handler reads a one item of a given

type from multiple ports.

• JoinedReceive:

Each handler reads one item from each of two

ports. The items can be of different type.

• Choice:

Execute a choice of two or more port-handler pairings

• Interleave:

Consists of a set of arbiters (port -- handler pairs) of 3

types that are Concurrent, Exclusive or Teardown (called at end

for clean up). Concurrent arbiters are run concurrently but

exclusive handlers are

(45)

Overhead (latency) of AMD 4-core PC with 4 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern. Compute time is 10 seconds divided by number of stages

Stages (millions)

Time

Microseconds

Rendezvous exchange as two shifts Rendezvous exchange customized for MPI

(46)

Overhead (latency) of INTEL 8-core PC with 8 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern. Compute time is 15 seconds divided by number of stages

Stages (millions) Time

Microseconds

Rendezvous exchange as two shifts Rendezvous exchange customized for MPI

(47)

PC07Intro [email protected] 47

Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)

n CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better