High Performance Data
Streaming in a Service
Architecture
Jackson State University Internet Seminar
November 18 2004 Geoffrey Fo
Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
Abstract
n We discuss a class of HPC applications characterized
by large scale simulations linked to large data streams coming from sensors, data repositories and other
simulations.
n Such applications will increase in importance to
support "data-deluged science”.
n We show how Web service and Grid technologies offer
significant advantages over traditional approaches from the HPC community.
n We cover Grid workflow (contrasting it with dataflow)
and how Web Service (SOAP) protocols can achieve high performance
Parallel Computing
n Parallel processing is built on breaking problems up
into parts and simulating each part on a separate computer node
n There are several ways of expressing this breakup into
parts with Software:
• Message Passing as in MPI or
• OpenMP model for annotating traditional languages
• Explicitly parallel languages like High Performance Fortran
n And several computer architectures designed to
support this breakup
• Distributed Memory with or without custom interconnect • Shared Memory with or without good cache
• Vectors with usually good memory bandwidth
The Six Fundamental MPI routines
u MPI_Init (argc, argv) -- initialize
u MPI_Comm_rank (comm, rank) -- find process
label (rank) in group
u MPI_Comm_size(comm, size) -- find total number
of processes
u MPI_Send (sndbuf,count,datatype,dest,tag,comm)
-- send a message
u MPI_Recv
(recvbuf,count,datatype,source,tag,comm,status) -- receive a message
u MPI_Finalize( ) -- End Up
Whatever the Software/Parallel
Architecture …..
n The software is a set of linked parts
• Threads, Processes sharing the same memory or independent programs
on different computers
n And the parts must pass information between them in to
synchronize themselves and ensure they really are working the same problem
n The same of course is true in any system
• Neurons pass electrical signals in the brain
• Humans use a variety of information passing schemes to build
communities: voice, book, phone
• Ants and Bees use chemical messages
n Systems are built of parts and in interesting systems the parts communicate with each other and this communication expresses “why it is a system” and not a bunch of independent bits
A Picture from 20 years ago
Passing Information
n Information passing between parts covers a wide range
in size (number of bits electronically) and “urgency”
n Communication Time = Latency + (Information
Size)/Bandwidth
n From Society we know that we choose multiple
mechanisms with different tradeoffs • Planes and high latency and bandwidth
• Walking is low latency but low bandwidths • Cars are somewhat in between theses cases
n We can always think of information being transferred
as a message
• If airplane passenger, sound waves or a posted letter
• Whether if an MPI message or UNIX Pipe between processes
or a method call between threads
Parallel Computing and Message Passing
n We worked very hard to get a better programming
model for parallel computing that removed need for user to
• Explicitly decompose problem and derive parallel
algorithm for decomposed parts
• Write MPI programs expressing explicit
decomposition
n This effort wasn’t so successful and on distributed
memory machines (including BlueGene/L) at least message passing of MPI style is the execution model even if one uses a higher level language
n So for parallelism, we are forced to use message passing
and this is efficient but intellectually hard
The Latest Top 5 in Top500
What about Web Services?
• Web Services are distributed computer programs
that can be in any language (Fortran .. Java .. Perl ..
Python)
• The simplest implementations involve
XML
messages (SOAP)
and programs written in net
friendly languages like Java and Python
• Here is a typical e-commerce use?
Securit
y Catalog
Internet Programming Model
n Web Services are designed as the latest distributed computing
programming paradigm motivated by the Internet and the expectation that enterprise software will be built on the same software base
n Parallel Computing is centered on DECOMPOSITION n Internet Programming is centered on COMPOSITION n The components of e-commerce (catalog, shipping, search,
payment) are NATURALLY separated (although they are often mistakenly integrated in older implementations)
n These same components are naturally linked by Messages n MPI is replaced by SOAP and the COMPOSITION model is
called Workflow
n Parallel Computing and the Internet have the same execution
model (processes exchanging messages) but very different REQUIREMENTS
Requirements for MPI Messaging
n MPI and SOAP Messaging both send data from a source to a
destination
• MPI supports multicast (broadcast) communication;
• MPI specifies destination and a context (in comm parameter) • MPI specifies data to send
• MPI has a tag to allow flexibility in processing in source processor • MPI has calls to understand context (number of processors etc.)
n MPI requires very low latency and high bandwidth so that
tcomm/tcalc is at most 10
• BlueGene/L has bandwidth between 0.25 and 3
Gigabytes/sec/node and latency of about 5 microseconds
• Latency determined so Message Size/Bandwidth > Latency
tcomm
tcalc tcalc
BlueGene/L MPI I
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pd f
BlueGene/L MPI II
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pd f
BlueGene/L MPI III
http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
500
Megabytes/sec
Requirements for SOAP Messaging
n Web Services has much of the same requirements as MPI with
two differences where MPI more stringent than SOAP
• Latencies are inevitably 1 (local) to 100 milliseconds which is
200 to 20,000 times that of BlueGene/L
n 1) 0.000001 ms – CPU does a calculation n 2) 0.001 to 0.01 ms – MPI latency
n 3) 1 to 10 ms – wake-up a thread or process n 4) 10 to 1000 ms – Internet delay
• Bandwidths for many business applications are low as one
just needs to send enough information for ATM and Bank to define transactions
n SOAP has MUCH greater flexibility in areas like security,
fault-tolerance, “virtualizing addressing” because one can run a lot of software in 100 milliseconds
• Typically takes 1-3 milliseconds to gobble up a modest
message in Java and “add value”
Ways of Linking Software Modules
Module A Module
B
.001 to 1 millisecond METHOD CALL BASED
Service A Service
B Messages
0.1 to 1000 millisecond latency
MESSAGE BASED
Coarse Grain Service Model
Closely coupled Java/Python …
Service B Service A
Publisher Post Events “Listener
Subscribe to Events
Message Queue in the Sky
EVENT BASED with brokered messages
MPI and SOAP Integration
n Note SOAP Specifies format and through WSDL
interfaces
n MPI only specifies interface and so interoperability
between different MPIs requires additional work • IMPI http://impi.nist.gov/IMPI/
n Pervasive networks can support high bandwidth
(Terabits/sec soon) but latency issue is not resolvable in general way
n Can combine MPI interfaces with SOAP messaging but
I don’t think this has been done
n Just as walking, cars, planes, phones coexist with
different properties; so SOAP and MPI are both good and should be used where appropriate
NaradaBrokering
n http://www.naradabrokering.org
n We have built a messaging system that is designed to
support traditional Web Services but has an
architecture that allows it to support high performance data transport as required for Scientific applications • We suggest using this system whenever your application can
tolerate 1-10 millisecond latency in linking components
• Use MPI when you need much lower latency
n Use SOAP approach when MPI interfaces required but
latency high
• As in linking two parallel applications at remote sites
n Technically it forms an overlay network supporting in
software features often done at IP Level
Pentium-3, 1GHz, 256 MB RAM
100 Mbps LAN
JRE 1.3 Linux
hop-3 0 1 2 3 4 5 6 7 8 9 100 1000 Transit Delay (Milliseconds)
Message Payload Size (Bytes)
Mean transit delay for message samples in NaradaBrokering: Different communication hops
Average Video Delays for one broker –
divide by N for N load balanced brokers
Latency ms
# Receivers One session Multipl
sessions
30 frames/sec
NB-enhanced GridFTP
Adds Reliability and Web Service Interfaces to GridFTP
Preserves parallel TCP performance and offers choice of transport and
Firewall penetration
Role of Workflow
n Programming SOAP and Web Services (the Grid):
Workflow describes linkage between services
n As distributed, linkage must be by messages
n Linkage is two-way and has both control and data
n Apply to disciplinary, scale linkage,
multi-program linkage, link visualization to simulation, GIS to
simulations and visualization filters to each other
n Microsoft-IBM specification BPEL is current preferred
Web Service XML specification of workflow
Service-1 Service-3
Service-2
Example workflow
Here a sensor feeds a data-mining application
(We are extending
data-mining in DoD applications with Grossman from UIC) The data-mining
application drives a visualization
Example Flood Simulation workflow
SERVOGrid Codes, Relationships
Elastic Dislocation
Pattern Recognizers
Fault Model BEM
Viscoelastic Layered BEM
Viscoelastic FEM Elastic Dislocation Inversion
Two-level Programming I
• The Web Service (Grid) paradigm implicitly assumes a
two-level Programming Model
• We make a Service (same as a “distributed object” or
“computer program” running on a remote computer) using conventional technologies
– C++ Java or Fortran Monte Carlo module – Data streaming from a sensor or Satellite – Specialized (JDBC) database access
• Such services accept and produce data from users files and database
• The Grid is built by coordinating such services assuming we have solved problem of programming the service
Servic
e Data
Two-level Programming II
n The Grid is discussing the composition of distributed
services with the runtime interfaces to Grid as
opposed to UNIX pipes/data streams
n Familiar from use of UNIX Shell, PERL or Python
scripts to produce real applications from core programs
n Such interpretative environments are the single
processor analog of Grid Programming
n Some projects like GrADS from Rice University are
looking at integration between service and composition levels but dominant effort looks at each level separately
Service
1 Service2
Service
3 Service4
3 Layer Programming Model
Application (level 1 Programming)
Application Semantics (Metadata, Ontology) Level 2 “Programming”
Basic Web Service Infrastructure Web Service 1
Workflow (level 3) Programming BPEL
WS 2 WS 3 WS 4
MPI Fortran C++ etc.
Semantic Web
Structure of SOAP
• SOAP defines a very obvious message structure with a header
and a body just like email
• The header contains information used by the “Internet operating system”
– Destination, Source, Routing, Context, Sequence Number …
• The message body is partly further information used by the
operating system and partly information for application when it is not looked at by “operating system” except to encrypt,
compress it etc.
– Note WS-Security supports separate encryption for different parts of a document
• Much discussion in field revolves around what is referenced in header
• This structure makes it possible to define VERY Sophisticated messaging
Deployment Issues for “System Services”
n “System Services” (handlers/filters) are ones that act
before the real application logic of a service
n They gobble up part of the SOAP header identified by
the namespace they care about and possibly part or all of the SOAP body
• e.g. the XML elements in header from the WS-RM
namespace
n They return a modified SOAP header and body to next
handler in chain
WS-R Handl
er
WS-……. Handler Header
Body
e.g. ……. Could be Eventing
Fast Web Service Communication I
• Internet Messaging systems allow one to optimize message streams at the cost of “startup time”,
• Web Services can deliver the fastest possible
interconnections with or without reliable messaging • Typical results from Grossman (UIC) comparing Slow
SOAP over TCP with binary and UDP transport (latter gains a factor of 1000)
Pure SOAP SOAP over UDP Binary over UDP
Fast Web Service Communication II
• Mechanism only works for
streams
– sets of related
messages
• SOAP header in streams is constant
except for
sequence number (Message ID), time-stamp ..
• One needs two types of new Web Service
Specification
• “
WS-StreamNegotiation
” to define how one can use
WS-Policy to send messages at start of a stream to
define the methodology for treating remaining
messages in stream
• “
WS-FlexibleRepresentation
” to define new
encodings of messages
Fast Web Service Communication III
• Then use “WS-StreamNegotiation” to negotiate stream in Tortoise SOAP – ASCII XML over HTTP and TCP –
– Deposit basic SOAP header through connection – it is part of context for stream (linking of 2 services)
– Agree on firewall penetration, reliability mechanism, binary representation and fast transport protocol
– Naturally transport UDP plus WS-RM
• Use “WS-FlexibleRepresentation” to define encoding of a Fast transport (On a different port) with messages just having
“FlexibleRepresentationContextToken”, Sequence Number, Time stamp if needed
– RTP packets have essentially this structure – Could add stream termination status
• Can monitor and control with original negotiation stream
• Can generate different streams optimized for different end-points
Data Deluged Science
n In the past, we worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new
algorithms and new ways of computing
n Data assimilation was not central to HPCC
n DoE ASC set up because didn’t want test data!
n Now particle physics will get 100 petabytes from CERN
• Nuclear physics (Jefferson Lab) in same situation • Use around 30,000 CPU’s simultaneously 24X7
n Weather, climate, solid earth (EarthScope)
n Bioinformatics curated databases (Biocomplexity only 1000’s of
data points at present)
n Virtual Observatory and SkyServer in Astronomy n Environmental Sensor nets
Weather Requirements
Data
Information
Ideas Simulation
Model
Assimilation
Reasoning
Datamining
Computationa Science
Informatics
Data Deluge Scienc
Virtual Observatory Astronomy Gri
Integrate Experiments
Radio Far-Infrared Visible
Visible + X-ray
Dust Map
Galaxy Density
In flight data Airline Maintenance Centre Ground Station Global Network Such as SITA
Internet, e-mail, pager
Engine Health (Data) Center
DAME Data Deluged Engineering
Rolls Royce and UK e-Science Progra Distributed Aircraft Maintenance
Environment
~ Gigabyte per aircraft per Engine per transatlantic
flight
~5000 engines
USArray
Seismic
Sensors
a
Topography 1 km
Stress Change
Earthquakes
PBO
Site-specific Irregular
Scalar Measurements Constellations for Plate
Boundary-Scale Vector Measurements
a
a
Ice Sheets Volcanoes
Long Valley, CA
Northridge, CA
Hector Mine, CA Greenland
HPC Simulation Data Filter Data Filter Data Filter Data Filt er Data Filter Distributed Filters massage data For simulation Other Gri
and W
Data Assimilation
n Data assimilation implies one is solving some optimization
problem which might have Kalman Filter like structur
n Due to data deluge, one will become more and more dominated
by the data (Nobs much larger than number of simulation
points).
n Natural approach is to form for each local (position, time)
patch the “important” data combinations so that optimization doesn’t waste time on large error or insensitive data.
n Data reduction done in natural distributed fashion NOT on
HPC machine as distributed computing most cost effective if calculations essentially independent
• Filter functions must be transmitted from HPC machine
Distributed Filtering
HPC Machine Distribute
Machine
Data Filter
Nobslocal patch 1
Nfilteredlocal patch 1
Data Filter
Nobslocal patch 2
Nfilteredlocal patch 2
Geographicall y
Distribute
Sensor patches
Nobslocal patch >> Nfilteredlocal patch ≈ Number_of_Unknownslocal patch
Send needed Filter Receive filtered data In simplest approach, filtered data gotten by linear transformations on original data based on Singular Value Decomposition of Least
squares matrix
Factorize Matri to product of