Cyberinfrastructure to integrate
simulation, data and sensors for
collaborative eScience in CRESI
CERSER and CRESIS
http://nia.ecsu.edu/
Elizabeth City State University
October 19 2006
Geoffrey Fox
Computer Science, Informatics, Physics
Pervasive Technology Laboratories
Indiana University Bloomington IN 47401
Abstract
n
Cyberinfrastructure
supports eScience or
collaborative science with distributed scientists,
computers, data repositories and sensors.
n
We describe the emerging
Grid software
for
eScience and the underlying Cyberinfrastructure
such as the
TeraGrid
.
n
We give one examples in detail:
iSERVO
– the
International Solid Earth Research Virtual
Organization supporting Earthquake Science
n
This illustrates
Computing Grids
,
Geographical
Information System Grids
,
Sensor Grids
n
We suggest implications for
CReSIS – Center for
Why Cyberinfrastructure Useful
n
Supports
distributed science
– data, people, computers
n
Exploits
Internet technology
(Web2.0) adding management,
security, supercomputers etc.
n
It has two aspects:
parallel
– low latency (microseconds)
between nodes and
distributed
– highish latency (milliseconds)
between nodes
n
Parallel needed to get
high performance
on
individual
3D
simulations, data analysis etc.; must
decompose problem
n
Distributed aspect
integrates
already distinct components
n
Cyberinfrastructure is in general a
distributed collection of
parallel systems
n
Grids are made of services
that are “just” programs or data
sources packaged for distributed access
e-moreorlessanything and the Grid
n
‘
e-Science
is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from
its inventor
John Taylor
Director General of Research Councils
UK, Office of Science and Technology
n
e-Science
is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
n
Similarly
e-Business
captures an emerging view of corporations as
dynamic
virtual organizations
linking employees, customers and
stakeholders across the world.
•
The growing use of
outsourcing
is one example
n
The
Grid
provides the information technology
e-infrastructure
for
e-moreorlessanything
.
n
A
deluge of data
of unprecedented and inevitable size must be
managed and understood.
n
People
,
computers
,
data
and
instruments
must be linked.
n
On demand
assignment of experts, computers, networks and
TeraGrid: Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates computational, information, and analysis resources at the
San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of
Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications,
Purdue University,
Indiana University
, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
Today 100 Teraflop; tomorrow a petaflop; Indiana 20 teraflop today.
SDSCTACC
UC/ANL
NCSA
ORNL
PU
IU
PSC NCAR
Caltech
USC-ISI Utah
Iowa
Cornell Buffalo
Virtual Observatory Astronomy Gri
Integrate Experiments
Radio
Far-Infrared
Visible
Visible + X-ray
Dust
Map
Grid Capabilities for Science
n
Open
technologies for any
large scale distributed system
that is adopted by
industry, many sciences and many countries (including UK, EU, USA, Asia)
•
Security, Reliability, Management and state standards
n
Service
and messaging specifications
n
User interfaces
via portals and portlets virtualizing to desktops, email,
PDA’s etc.
•
~20 TeraGrid
Science Gateways
(their name for portals)
•
OGCE Portal
technology effort led by Indiana
n
Uniform approach to access distributed
(super)computers
supporting
single
(large) jobs
and
spawning lots of related jobs
n
Data
and
meta-data
architecture supporting real-time and archives as well
as federation
•
Links to
Semantic web
and
annotation
n
Grid (Web service) workflow with standards and several successful
instantiations (such as
Taverna
and
MyLead)
n
Many
Earth science grids
including ESG (DoE), GEON, LEAD, SCEC,
SERVO; LTER and NEON for Environment
•
http://www.
nsf.gov/od/oci/ci-v7.pdf
APEC Cooperation for Earthquake Simulation
n
ACES
is a seven year-long collaboration among scientists
interested in
earthquake and tsunami predication
•
iSERVO
is Infrastructure to suppor
work of ACES
•
SERVOGrid
is (completed) US Grid that is
a prototype of iSERVO
•
http://
www.quakes.uq.edu.au/ACES/
n
Charte
red under
APEC
–
Database
Database
Analysis and
Visualizatio
Portal
Repositorie
Federated
Databases
Data
Filte
Services
Field Trip Data
Streaming
Data
Sensor
s
?
Discovery
Services
SERVOGrid
Researc
Simulation
s
Research
Education
Customization
Services
From
Researc
to Education
Educatio
Grid
Computer
Farm
Grid of Grids: Research Grid and Education Grid
SERVOGrid and Cyberinfrastructure
n
Grids
are the technology based on Web services that implement
Cyberinfrastructure
i.e. support eScience or science as a team
sport
•
Internet scale managed services that link
computers data
repositories sensors instruments
and
people
n
There is a
portal
and services in
SERVOGrid
for
•
Applications
such as GeoFEST, RDAHMM, Pattern
Informatics, Virtual California (VC), Simplex, mesh
generating programs …..
•
Job management
and monitoring web services for running
the above codes.
•
File management
web services for moving files between
various machines.
•
Geographical Information System services
•
Quaketables
earthquake specific database
•
Sensors
as well as databases
•
Context
(dynamic metadata) and
UDDI
system long term
metadata services
a
Topography 1 km
Stress Change
Earthquakes
PBO
Site-specific Irregular
Scalar Measurements
Constellations for Plate
Boundary-Scale Vector
Measurements
a
a
Ice Sheets
Volcanoes
Long Valley, CA
Northridge, CA
Hector Mine, CA
Greenland
Some Grid Concepts I
n
Services
are “just” (distributed) programs sending and
receiving messages with well defined syntax
n
Interfaces
(input-output)
must be open
; innards can be
open source (allowing you to modify) or proprietary
•
Services can be any language from Fortran, Shell scripts, C,
C#, C++, Java, Python, Perl – your choice!!
•
Web Services
supported by all vendors (IBM, Microsoft …)
n
Service overhead
will be just a
few milliseconds
(more
now) which is < typical network transit time
•
Any program that is distributed can be a Web service
•
Any program taking execution time ≥ 20ms can be an
Web services
n
Web Services
build
loosely-coupled,
distributed
applications,
(wrapping existing
codes and databases)
based on the
SOA
(service oriented
architecture) principles.
n
Web Services interact
by exchanging messages
in
SOAP
format
n
The contracts for the
message exchanges that
implement those
interactions are
described via
WSDL
A typical Web Service
n
In principle, services can be in
any
language (Fortran .. Java ..
Perl .. Python) and the interfaces can be method calls, Java RMI
Messages, CGI Web invocations, totally compiled away (inlining)
n
The simplest implementations involve
XML messages (SOAP)
and
programs written in net friendly languages like Java and Python
Paymen
Credit
Card
Warehous
e
Shipping
WSDL
interfaces
WSDL
interfaces
Securit
y
Catalo
g
Porta
Service
Some Grid Concepts II
n
Systems are built from contributions from many different
groups
– you do not need one “vendor” for all components as
Web services allow interoperability between components
•
One reason
DoD likes Grids
(called Net-Centric computing)
n
Grids are
distributed
in services and data allowing anybody to
store their data and to produce “their” view
•
Some think that University Library of future will curate/store data of
their faculty
n
“
2 level programming model
”: Classic programming of services
and services are composed using workflow consistent with
industry standards (BPEL)
n
Grid of Grids
: (System of Systems) Realistically Grid-like
systems will be built using multiple technologies and “standards”
–integrate separate Grids for Sensors, GIS, Visualization,
computing etc. with
OGSA
(Open Grid Service Architecture
from OGF) system Grid (Security, registry) into a single
Grid
n
Existing codes UNCHANGED
; wrap as a service with metadata
LEAD Gateway Portal
NSF Large ITR and Teragrid Gateway
- Adaptive Response to Mesoscal
weather events
- Supports Data exploration,Grid Workflow
Grid Workflow Data Assimilation in Earth Science
nGrid services
triggered by abnormal events and controlled by
workflow
process real
time data from radar and high resolution simulations for tornado forecasts
SERVOGrid has a portal
The Portal is built from portlets –
providing user interface
fragments for each service
that are composed into the
full interface – uses OGCE
technology as does planetary
science VLAB portal with
University of Minnesota
GIS and Sensor Grids
n
OGC
has defined a suite of
data structures
and
services
to
support
Geographical Information Systems and Sensors
n
GML
Geography Markup language defines specification of
geo-referenced data
n
SensorML
and
O&M
(Observation and Measurements) define
meta-data and data structure for sensors
n
Services like
Web Map Service, Web Feature Service, Sensor
Collection Service
define services interfaces to access GIS and
sensor information
n
Grid workflow
links services that are designed to support
streaming input and output messages
n
We built Grid (Web) service implementations of these
specifications for NASA’s
SERVOGrid
Grid Workflow Datamining in Earth Science
n
Work with
Scripps Institute
n
Grid services
controlled by
workflow
process real time
data from ~70 GPS Sensors in Southern California
Streaming Data
Support
Transformations
Data Checking
Hidden Marko
Datamining (JPL)
Display (GIS)
NASA
GPS
Earthquake
Earth/Atmosphere Grids built as Grids of (library) Grids
Ice Sheet Sensors,
SAR, Filters, EM,
Glacier Simulations
Physical Network
Registr
y
Metadat
a
Earthquake Data,
Filters &
Simulation
Services
Earthquake
SERVOGrid
…
Tornad
Grid
…
Ice Sheet
PolarGrid
Data
Access/Storage
Securit
y
Notificatio
n
Workflo
w
Messagin
g
Portal
s
Visualization
Grid
Collaboration
Grid
Sensor Grid
Compute
Grid
GIS Grid
CReSIS PolarGrid
n
Important CReSIS-specific Cyberinfrastructure
components include
•
Managed data from
sensors
and
satellites
•
Data analysis such as
SAR processing
– possibly with parallel
algorithms
•
Electromagnetic simulations
(currently commercial codes) to
design instrument antennas
•
3D simulations of
ice-sheets
(glaciers) with non-uniform
meshes
•
GIS
Geographical Information Systems
n
Also need capabilities present in many Grids
•
Portal
i.e. Science Gateway
•
Submitting
multiple sequential or parallel
jobs
What should we do?
n
Identify
existing programs
that should be wrapped as
Grid services
•
One can do this even for commercial codes as one keeps existing codes (Fortran,
C++) unchanged and constructs a “metadata” wrapper defining where programs
and its data are located and how to invoke.
n
Identify where
parallel versions
needed and if
help
needed in creating these
•
Parallel codes can be Grid services
•
Electromagnetic codes are commercial – in principle parallel
•
Ice sheet models can be parallelized for high resolution simulations
n
Scope out system;
Computational
needs -Identify value of
TeraGrid
; data
storage
needs;
network
requirements
n
Examine
data model
and produce a data
Grid architecture
•
Use databases? Distributed? Metadata? Files? What are key performance issues?
n
Examine integration of
GIS
with Grid Services
n
Design and implement
Science Gateway
n
Are there important
visualization
requirements outside GIS?
n
Are there key issues from
security
?
n
Bring up core services such as
registries
Benefits of CReSIS PolarGrid
n
Shared resources support
collaboration among CReSIS scientists
n
Integration
of Polar related data with appropriate compute
resources enabling research on specific topics and studies across
topics
n
Polar Science Gateway
accessing common services (programs),
data and their integration as workflow
n
Access to
TeraGrid
with same interface for large scale
simulations
n
Can share
common capabilities
(SAR analysis, GIS) with related
Grids such as SERVOGrid, GEON, LEAD etc.
n
Modular Grid services
allow exchange of new capabilities
preserving systems
•
e.g. Change EM Simulation service
n
Management
of dynamic heterogeneous data
We built a Web Service version of this Open Geospatial Consortium specification. The WMS constructs images out of abstract feature descriptions.
Web Map Service
We have built data model extensions to UDDI to support XPath queries over Geographical Information System capability.xml files. This is designed to replace OGC (Open Geospatial Consortium) Web registry service
Information Service
This uses capabilities built into portal. Note that simulations are typically performed on machines where user has accounts while data services are shared for read access
Authentication and Authorization
We use an OGCE based portal based on portlet architecture Portal
We built a file web service that could do uploads, downloads, and crossloads between different services. Clearly this supports specific operations such as file browsing, creation, deletion and copying.
File Services
We have an Application and a Host Descriptor service based on XML schema descriptors. Portlet interfaces allow code administrators to make applications available through the browser.
Application and Host Metadata Service
We store information gathered from users’ interactions with the portal interface in a generic, recursively defined XML data structure. Typically we store input parameters and choices made by the user so that we can recover and reload these later. We also use this for monitoring remote workflows. We have devoted considerable effort into developing WS-Context to support the generalization of this initial simple service.
Context Data Service
These can be all launched by a single Job Management service or by custom instances of this with metadata preset to a particular application
Specific Applications: Virtual California, Geofest, Park, RDAHMM ..
SERVO wraps Apache Ant as a web service and uses it to launch jobs. For a particular application, we design a build.xml template. The interface is simply a string array of build properties called for by the template. We’ve also built a simple generic “template engine” version of this.
Job Management
Description Service
WS-Security JSDL WSRF BPEL OGSA-DAI Key interfaces/standards/software
NOT Used (often just for historical reasons as project predated standard)
GML WFS WMS
WSDL XML Schema with pull parser XPP SOAP with Axis 1.
UDDI WS-Context JSR-168 JDBC Servlets
WS-Management VOTables in Research Key interfaces/standards/software
Used
We are developing a Web Service based on the National Virtual Observatory’s VOTables XML format for tabular data. We see this as a useful general format for ASCII data produced by various application codes in SERVO and other projects.
Data Tables Web Service
We are developing Dislin-based scientific plotting services as a variation of our Web Map Service: for a given input service, we can generate a raster image (like a contour plot) which can be integrated with other scientific and GIS map plot images.
Scientific Plotting Services
The USC QuakeTables fault database project includes a web service that allows you to search for Earthquake faults.
QuakeTables Database Services
This supplies alerts to users when filters (data-mining) detects features of interest Notification Service
This is used to stream data in workflow fed by real-time sources. It is based on NaradaBrokering which can also be used in cases just involving archival data
Messaging Service
We are developing infrastructure to support streaming GPS signals and their successive filtering into different formats. This is built over NaradaBrokering (see messaging service). This does not use Web Services as such at present but the filters can be controlled by HPSearch services.
Sensor Grid Services
The HPSearch project uses HPSearch Web Services to execute JavaScript workflow descriptions. It has more recently been revised to support WS-Management and to support both workflow (where there are many alternatives) and system management (where there is less work). Management functions include life cycle of services and QoS for inter-service links
Workflow/Monitoring/Management Services