1
Linking Programming models
between Grids, Web 2.0 and
Multicore
Distributed Programming Abstractions
Workshop NESC Edinburgh UK
May 31 2007
Geoffrey Fox
Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
2
Points in Talk I
n All parallel programming projects more or less fail n All distributed programming projects report success
• There are several hundred in Grid workflow area alone
n Few constraints on distributed programming
n Composition (in distributed computing) v decomposition (in parallel
computing)
n There is not much difference between distributed programming and a key
paradigm of parallel computing (functional parallelism)
n Pervasive use of 64 core chips in the future will often require one to build a Grid on a chip i.e. to execute a traditional distributed application on a chip
n XML is a pretty dubious syntax for expressing programs
n Web 2.0 is pretty scruffy but there are some large companies and many users
behind it.
n Web 2.0 and Grids will converge and features of both will survive or disappear
in merged environment
n Web 2.0 has a more plausible approach to distributed programming than Web
Services/Grids
Some More points
n Services could be universal abstraction in parallel and distributed computing
• Whereas objects could not be universal so perhaps should move away from their use
n Gateways/Portals (Portlets, Widgets, Gadgets) are natural user (application usage)
interface to a collection of services
n Important Data (SQL, WFS, RSS Feeds) abstractions
n Divide Parallel Programming Run-time (matching application structure) into 3 or 4
Broad classes
n Inter-entity communication time characteristic of different programming model
• 1-5 µs for MPI/Thread switching to 1-1000 milliseconds for services on the Grid and 25 µs
for services inside a chip
n Multicore Commodity Programming Model
• Marine corps write librariesin “HLA++”, MPI or dynamic threads (internally one
microsecond latency)expressed as services
• Services composed/mashuped by “millions”
n Many composition (coordination) or mashup approaches
• Functional (cf. Google Map Reduce for data transformations)
• Dataflow
• Workflow
• Visual
• Script
n The difficulties of making effective use of multicore chips will so great that it will be
main driver of new programming environments
n Microsoft CCR DSS is good example of unification of parallel and distributed
Some Details
n
See
http://www.slideshare.net/Foxsden
or more
conventionally
n
Web 2.0 and Grid
Tutorial
•
ht
tp://grids.ucs.indiana.edu/ptliupages/presentations/CTSpar
tIMay21-07.ppt
•
htt
p://grids.ucs.indiana.edu/ptliupages/presentations/Web20T
utorial_CTS.ppt
n
Multicore and Parallel Computing
Tutorial
•
http:
//grids.ucs.indiana.edu/ptliupages/presentations/PC2007/
index.html
n
“Web 2
.0”
citation site
Web 2.0 and Web Services I
n Web Services have clearly defined protocols (SOAP) and a well defined
mechanism (WSDL) to define service interfaces
• There is good .NET and Java support
• The so-called WS-* (WS-Nightmare) specifications provide a rich
sophisticated but complicated standard set of capabilities for security, fault tolerance, meta-data, discovery, notification etc.
n “Narrow Grids” build on Web Services and provide a robust managed
environment with growing adoption in Enterprise systems and distributed science (e-Science)
n We can use the term Grids strictly as Narrow Grids that are collections of
Web Services (or even more strictly OGSA Grids) or just call any collections of services as “Broad Grids” which actually is quite often done
n Web 2.0 supports a similar architecture to Web services but has developed in
a more chaotic but remarkably successful fashion with a service architecture with a variety of protocols including those of Web and Grid services
• Over 400 Interfaces defined at http://www.programmableweb.com/apis
n One can easily combine SOAP (Web Service) based services/systems with
Web 2.0 and Web Services II
n
Web 2.0
also has many well known capabilities with
Maps
and
Amazon Compute/Storage services
of clear general
relevance
n
There are also
Web 2.0 services
supporting novel collaboration
modes and user interaction with the web as seen in social
networking sites and portals such as: MySpace, YouTube,
Connotea, Slideshare ….
n
I once thought
Web Services were inevitable
but this is no longer
clear to me
n
Web services are
complicated
,
slow
and
non functional
•
WS-Security
is unnecessarily slow and pedantic
(canonicalization of XML)
•
WS-RM
(Reliable Messaging) seems to have poor adoption
and doesn’t work well in collaboration
•
WSDM
(distributed management) specifies a lot
Attack of the Killer Multicores
n
Today
commodity Intel systems
are sold with
8 cores
spread over
two processors
n
Specialized chips such as
GPU
’s and
IBM Cell
processor have
substantially more cores
n
Moore’s Law
implies and will be satisfied by and imply
exponentially increasing number of cores doubling every 1.5-3
Years
•
Modest increase in clock speed
•
Intel has already prototyped a 80 core Server chip ready in
2011?
n
Huge activity in
parallel computing
programming (recycled from
the past?)
•
Some
programming models
and
application styles
similar to
Grids
n
We will have a
Grid on a chip
……….
Grids meet Multicore Systems
n
The expected rapid growth in the number of cores per chip has
important implications for Grids
n
With 16-128 cores on a single commodity system 5 years from
now one will both be able to build a
Grid like application on a
chip
and indeed must build such an application to get the
Moore’s law performance increase
•
Otherwise you will “waste” cores …..
n
One
will not want to reprogram
as you move your application
from a 64 node cluster or transcontinental implementation to a
single chip Grid
n
However multicore chips have a very
different architecture
from
Grids
•
Shared not Distributed Memory
•
Latencies measured in microseconds not milliseconds
n
Thus
Grid and multicore technologies will need to “converge”
and converged technology model will have different
requirements from current Grid assumptions
Grid versus Multicore Applications
n
It seems likely that
future multicore applications
will
involve a loosely coupled mix of multiple modules that
fall into three classes
•
Data access/query/store
•Analysis and/or simulation
•
User visualization and interaction
n
This is
precisely mix that Grids support
but Grids of
course involve distributed modules
n
Grids and Web 2.0 use
service oriented architectures
to
describe system at module level – is this appropriate
model for multicore programming?
n
Where do multicore systems get their data from
?
Pradeep K. Dubey, pradeep.dubey@intel.com 10
Tomorrow
What is …? What
if …? Is it …?
Recognition Mining Synthesis
Create a model instance
RMS: Recognition Mining Synthesis
Model-based multimodal
recognition
Find a model instance Model
Real-time analytics on dynamic, unstructured, multimodal datasets Photo-realism and physics-based animation Today
Model-less Real-time streaming and transactions on
static – structured datasets
Very limited realism
Intel has probably most sophisticated analysis of
future “killer” multicore applications –
Pradeep K. Dubey, pradeep.dubey@intel.com 11
What is a tumor? Is there a tumor here? What if the tumor progresses?
It is all about dealing efficiently with complex multimodal datasets
R
ecognition
M
ining
S
ynthesis
Role of Data in Grid/Multicore I
n
One typically is told to place compute (
analysis
) at the
data
but most of the computing power is in
multicore
clients
on the edge
n
These
multicore clients
can get data from the internet
i.e. distributed sources
•
This could be personal interests of client and used by client to
help user interact with world
•
It could be
cached or copied
•
It could be a
standalone
calculation or part of a
distributed
coordinated
computation (SETI@Home)
n
Or
they could get data from set of
local sensors
(video-cams and environmental sensors) naturally stored on
client or locally to client
Role of Data in Grid/Multicore
n
Note that as you increase sophistication of data
analysis, you increase ratio of compute to I/O
•
Typical modern datamining approach like
Support Vector
Machine
is sophisticated (dense) matrix algebra and not just
text matching
• http://grids.ucs.indiana.edu/ptliupages/presentations/PC2007/PC07BYOPA.ppt
n
Time complexity
of Sophisticated
data analysis
will
make it more attractive to fetch data from the Internet
and cache/store on client
•
It will also help with memory bandwidth problems in
multicore chips
n
In this vision, the
Grid “just” acts as a source of data
and the Grid application runs locally
PC07Intro gcf@indiana.edu 15
Multicore Programming Paradigms
•
At a very high level, there are
three or four broad classes
of
parallelism
•
Coarse grain functional parallelism
typified by workflow
and often used to build composite “metaproblems” whose
parts are also parallel
–
“Compute-File”, Database/Sensor, Community, Service, Pleasing
Parallel (Master-worker) are sub-classses
•
Large Scale loosely synchronous data parallelism
where
dynamic irregular work has clear synchronization points as
in most large scale scientific and engineering problems
•
Fine grain (asynchronous) thread parallelism
as used in
search algorithms which are often data parallel (over
choices) but don’t have universal synchronization points
Data Parallel Time Dependence
• A simple form of data parallel applications are synchronous with all elements of the application space being evolved with essentially the same instructions
• Such applications are suitable for SIMD computers and run well on vector supercomputers (and GPUs but these are more general than just
synchronous)
• However synchronous applications also run fine on MIMD machines
• SIMD CM-2 evolved to MIMD CM-5 with same data parallel language
CMFortran
• The iterative solutions to Laplace’s equation are synchronous as are many full matrix algorithms
Synchronization on MIMD machines is accomplished by messaging
Local Messaging for Synchronization
• MPI_SENDRECV is typical primitive• Processors do a send followed by a receive ora receive followed by a send
• In two stages (needed to avoid race conditions), one has a complete left shift
• Often follow by equivalent right shift, do get a complete exchange
• This logic guarantees correctly updated data is sent to processors that have their data at same simulation time
………
8 Processors
Loosely Synchronous Applications
•
This is
most common
large scale science and engineering
and one has the traditional data parallelism but now
each data point has in general a different update
–
Comes from
heterogeneity
in problems that would be
synchronous if homogeneous
• Time steps typically uniform but sometimes need to support variable time steps across application space – however ensure small time steps are t = (t1
-t0)/Integer so subspaces with finer time steps do synchronize with full domain
• The time synchronization via messaging is still valid
• However one no longer load
balances (ensure each processor does equal work in each time step) by putting equal number of points in each processor
• Load balancing although NP complete is in practice
surprisingly easy Application Time Application Space t 0 t 1 t 2 t 3 t 4
MPI Futures?
•
MPI
likely to become
more important
as
multicore systems become more common
•
Should
use MPI when MPI needed
and use other
messaging for other cases (such as linking
services) where different features/performance
appropriate
•
MPI has
too many primitives
which will
handicap broad implementation/adoption
•
Perhaps only have
one collective primitive
like
Fine Grain Dynamic Applications
•
Here there is no natural universal ‘time’ as there is in science
algorithms where an iteration number or Mother Nature’s time
gives global synchronization
•
Loose (zero) coupling or special features
of application needed for successful
parallelization
•
In computer chess, the minimax scores
at parent nodes provide multiple
dynamic synchronization points
Application Time
Application Space
Application Space
Computer Chess
• Thread level parallelism unlike position evaluation parallelism used in other systems
• Competed with poor reliability and results in 1987 and 1988 ACM Computer Chess
Championships
Discrete Event Simulations
•
These are familiar in military and circuit (system) simulations
when one uses macroscopic approximations
– Also probably paradigm of most multiplayer Internet games/worlds
•
Note Nature is perhaps
synchronous
when viewed quantum
mechanically in terms of uniform fundamental elements (quarks
and gluons etc.)
•
It is
loosely synchronous
when considered in terms of particles
and mesh points
•
It is
asynchronou
when viewed in
terms of tanks
people, arrows etc.
•
Circuit simulations
can be done loosel
synchronously but
inefficient as many
inactive elements
Programming Models
•
The three major models are supported by
HPCS languages
which
are very interesting but too monolithic
•
So the
Fine grain thread parallelism
and
Large Scale loosely
synchronous data parallelism
styles are distinctive to parallel
computing while
•
Coarse grain functional parallelism
of multicore overlaps with
workflows
from Grids and
Mashups
from Web 2.0
•
Seems plausible that a more
uniform approach
evolve for coarse
grain case although this is
least constrained
of programming
styles as typically latency issues are not critical
– Multicore would have strongest performance constraints
– Web 2.0 and Multicore the most important usability constraints
•
A possible model for broad use of multicores is that the difficult
parallel algorithms are coded as libraries (
Fine grain thread
parallelism
and
Large Scale loosely synchronous data parallelism
PC07Intro gcf@indiana.edu 24
Google MapReduce
Simplified Data Processing on Large Clusters
• http://labs.google.com/papers/mapreduce.html
• This is a dataflow model between services where services can do useful
document oriented data parallel applications including reductions
• The decomposition of services onto cluster engines is automated
• The large I/O requirements of datasets changes efficiency analysis in favor of dataflow
• Services (count words in example) can obviously be extended to general parallel applications
• There are many alternatives to language expressing either dataflow and/or
Programming Models
•
The services and objects in distributed
computing are usually “natural” (come from
application) whereas
parts connected by MPI
(or
created by parallelizing compiler) come from
“artificial” decompositions and not naturally
considered services
•
Services in multicore
(parallel computing) are
original modules before decomposition and its
these modules that
coarse grain functional
parallelism
addresses
•
Most of “
difficult
” issues in parallel computing
Parallel Software Paradigms: Top Level
•
In the conventional
two-level Grid/Web Service
programming model
, one programs each
individual service and then separately programs
their interaction
–
This is Grid-aware Services programming model
–
SAGA supports Grid-aware programs?
•
This is generalized to multicore with “Marine
Corps” programming services for “difficult”
cases
–
Loosely Synchronous
–
Fine Grain threading
–
Discrete Event Simulation
•
“Average” Programmer produces mashups or
The Marine Corps Lack of
Programming Paradigm Library Model
•
One could assume that parallel computing is “just too hard
for real people” and assume that we use a
Marine Corps of
programmers
to build as
libraries
excellent parallel
implementations of “all” core capabilities
–
e.g. the primitives identified in the
Intel application
analysis
–
e.g. the primitives supported in Google
MapReduce
,
HPF
,
PeakStream
,
Microsoft Data Parallel .NET
etc.
•
These primitives are orchestrated (linked together) by
overall frameworks
such as workflow or
mashups
•
The Marine Corps probably is content with
efficient
rather
Component Parallel and Program Parallel
•
Component parallel
paradigm is where one
explicitly programs
the different parts of a parallel application
with the linkage either
specified externally as in workflow or in components themselves
as in most other component parallel approaches
– In Grids, components are natural
– In Parallel computing, components are produced by decomposition
•
In the
program parallel
paradigm, one writes a single program to
describe the whole application and some combination of compiler
and runtime
breaks up the program
into the multiple parts that
execute in parallel
•
Note that a
program parallel approach
will often call a built in
runtime
library
written in
component parallel fashion
– A parallelizing compiler could call an MPI library routine
Component Parallel and Program Parallel
•
Program Parallel
approaches include
–
Data structure parallel
as in
Google MapReduce
,
HPF
(High
Performance Fortran),
HPCS
(High-Productivity Computing
Systems) or “SIMD”
co-processor
languages (
PeakStream
,
ClearSpeed
and Microsoft Data Parallel .NET)
–
Parallelizing compilers
including
OpenMP
annotation
–
Note OpenMP and HPF have failed in some sense for large scale
parallel computing (writing algorithm in standard sequential
languages throws away information needed for parallelization)
•
Component Parallel
approaches include
–
MPI
(and related systems like
PVM
) parallel message passing
–
PGAS
(Partitioned Global Address Space
CAF, UPC, Titanium,
HPJava
)
–
C++ futures
and
active
objects
–
CSP … Microsoft
CCR
and
DSS
–
Workflow
and
Mashups
Why people like MPI!
•
Jason J Beech-Brandt, and Andrew A. Johnson, at AHPCRC
Minneapolis
•
BenchC
is unstructured finite element CFD Solver
•
Looked at
OpenMP
on
shared memor
Altix with som
effort to
optimize
•
Optimized UP
on severa
machines
•
MPI always goo
but other
approache
erratic
•
Other studies
reach similar
conclusions?
cluste r
After Optimization of UPC
Web 2.0 Systems are Portals, Services, Resources
n
Captures the incredible development of interactive
Web sites enabling people to create and collaborate
32
Mashups v Workflow?
n Mashup Tools are reviewed at http://blogs.zdnet.com/Hinchcliffe/?p=63 n Workflow Tools are reviewed by Gannon and Fox
http://grids.ucs.indiana.edu/ptliupages/publications/Workflow-overview.pdf
n
Both include
scripting
in PHP,
Python, sh etc. as
both implement
distributed
programming at level
of services
n
Mashups
use all
types of service
interfaces and do not
have the potential
robustness
(security)
of Grid service
approach
Web 2.0 APIs
http://www.programmable
web.com/apis
has (May 14
2007) 431 Web 2.0 APIs
with GoogleMaps the most
often used in Mashups
This site acts as a “
UDDI
”
The List of
Web 2.0 API’s
Each site has API and
its features
Divided into broad
categories
Only a few used a lot
(
42 API’s
used in
more than
10
mashups
)
RSS feed of new APIs
Amazon S3 growing
APIs/Mashups per Protocol Distribution
REST SOAP XML-RPC REST,
XML-RPC XML-RPC,REST, SOAP
REST,
SOAP JS Other
4 more
Mashups each
day
For a total of
1906
April 17 2007 (4.0 a
day over last
month)
Note ClearForest
runs
Semantic Web
Services Mashup
competitions (not
workflow
competitions)
Some
Mashup
types
: aggregators,
search aggregators,
visualizers, mobile,
maps, games
Implication for Grid Technology
of Multicore and Web 2.0 I
n
Web 2.0 and Grids are addressing a
similar application
class
although Web 2.0 has focused on user interactions
•
So technology has similar requirements
nMulticore differs significantly from Grids in
component location and this seems
particularly
significant for data
•
Not clear therefore how similar applications will be
•Intel RMS multicore application class
pretty similar
to Grids
n
Multicore has more stringent software requirements
than Grids as latter has intrinsic network overhead
Implication for Grid Technology
of Multicore and Web 2.0 II
n
Multicore chips require
low overhead protocols
to
exploit low latency that suggests
simplicity
•
We need to simplify
MPI
AND
Grids!
n
Web 2.0 chooses
simplicity
(REST rather than SOAP)
to
lower barrier
to everyone participating
n
Web 2.0 and Multicore tend to use
traditional (possibly
visual) (scripting) languages
for equivalent of workflow
whereas Grids use
visual interface backend recorded in
BPEL
•
Google MapReduce
illustrates a popular Web 2.0
and Multicore approach to dataflow
Implication for Grid Technology
of Multicore and Web 2.0 III
n
Web 2.0 and Grids both use
SOA Service Oriented
Architectures
•
Seems likely that
Multicore will also adopt
although a more
conventional object oriented approach also possible
•
Services should help multicore applications integrate
modules from different sources
•
Multicore will use fine grain objects but coarse grain
services
n
“System of Systems”:
Grids, Web 2.0 and Multicore are likely
to build systems hierarchically out of smaller systems
•
We need to support Grids of Grids, Webs of Grids, Grids
of Multicores etc. i.e. systems of systems of all sorts
The Ten areas covered by the 60 core WS-*
Specifications
WSRP (Remote Portlets) 10: Portals and User
Interfaces
WS-Policy, WS-Agreement 9: Policy and Agreements
WSDM, WS-Management, WS-Transfer 8: Management
WSRF, WS-MetadataExchange, WS-Context 7: System Metadata and State
UDDI, WS-Discovery 6: Service Discovery
WS-Security, WS-Trust, WS-Federation, SAML, WS-SecureConversation
5: Security
BPEL, WS-Choreography, WS-Coordination 4: Workflow and
Transactions
WS-Notification, WS-Eventing (Publish-Subscribe)
3: Notification
WS-Addressing, WS-MessageDelivery; Reliable Messaging WSRM; Efficient Messaging MOTM 2: Service Internet
XML, WSDL, SOAP 1: Core Service Model
WS-* Areas and Web 2.0
Start Pages, AJAX and Widgets(Netvibes) Gadgets 10: Portals and User
Interfaces
Service dependent. Processed by application 9: Policy and Agreements
WS-Transfer style Protocols GET PUT etc. 8:
Management==Interaction
Processed by application – no system state –
Microformats are a universal metadata approach 7: System Metadata and
State
http://www.programmableweb.com 6: Service Discovery
SSL, HTTP Authentication/Authorization, OpenID is Web 2.0 Single Sign on
5: Security
Mashups, Google MapReduce
Scripting with PHP JavaScript …. 4: Workflow and
Transactions (no
Transactions in Web 2.0)
Hard with HTTP without polling– JMS perhaps? 3: Notification
No special QoS. Use JMS or equivalent? 2: Service Internet
XML becomes optional but still useful SOAP becomes JSON RSS ATOM
WSDL becomes REST with API as GET PUT etc. Axis becomes XmlHttpRequest
1: Core Service Model
WS-* Areas and Multicore
Web 2.0 technology popular 10: Portals and User
Interfaces
Handled by application 9: Policy and Agreements
Interaction between objects key issue in parallel programming trading off efficiency versus
performance 8: Management ==
Interaction
Environment Variables 7: System Metadata and State
Use libraries 6: Service Discovery
Not so important intrachip 5: Security
Many approaches; scripting languages popular 4: Workflow and
Transactions
Publish-Subscribe for events and Interrupts 3: Notification
Not so important intrachip 2: Service Internet
Fine grain Java C# C++ Objects and coarse grain services as in DSS. Information passed explicitly or by handles. MPI needs to be updated to handle non scientific applications as in CCR
1: Core Service Model
CCR as an example of a Cross Paradigm
Run Time
•
Naturally supports fine grain thread switching with
message passing with around
4 microsecond
latency for
4 threads switching to 4 others on an AMD PC with C#.
Threads spawned – no rendezvous
•
Has around
50 microsecond latency
for coarse grain
service interactions with DSS extension which supports
Web 2.0 style messaging
•
MPI Collectives – Shift and Exchange vary from
10 to
20 microsecond latency
in rendezvous mode
•
Not as good as best MPI’s but managed code and
supports Grids Web 2.0 and Parallel Computing ……
PC07Intro gcf@indiana.edu 44
Microsoft CCR
•
Supports exchange of messages between threads using
named
ports
•
FromHandler:
Spawn threads without reading ports
•
Receive:
Each handler reads one item from a single port
•
MultipleItemReceive:
Each handler reads a prescribed number of
items of a given type from a given port. Note items in a port can
be general structures but all must have same type.
•
MultiplePortReceive:
Each handler reads a one item of a given
type from multiple ports.
•
JoinedReceive:
Each handler reads one item from each of two
ports. The items can be of different type.
•
Choice:
Execute a choice of two or more port-handler pairings
•
Interleave:
Consists of a set of arbiters (port -- handler pairs) of 3
types that are Concurrent, Exclusive or Teardown (called at end
for clean up). Concurrent arbiters are run concurrently but
exclusive handlers are
Overhead (latency) of AMD 4-core PC with 4 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern. Compute time is 10 seconds divided by number of stages
Stages (millions)
Time
Microseconds
Rendezvous exchange as two shifts Rendezvous exchange customized for MPI
Overhead (latency) of INTEL 8-core PC with 8 execution threads on MPI style Rendezvous Messaging for Shift and Exchange implemented either as two shifts or as custom CCR pattern. Compute time is 15 seconds divided by number of stages
Stages (millions) Time
Microseconds
Rendezvous exchange as two shifts Rendezvous exchange customized for MPI
PC07Intro gcf@indiana.edu 47
Timing of HP Opteron Multicore as a function of number of simultaneous two-way service messages processed (November 2006 DSS Release)
n CGL Measurements of Axis 2 shows about 500 microseconds – DSS is 10 times better