Cyberinfrastructure
across the Globe
Indiana University
Computer Science Undergraduate Honors Seminar
January 8 2007
Geoffrey Fox
Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401
Abstract
n
We discuss the role of Cyberinfrastructure (also called
e-infrastructure and implemented by Grid technology)
in a variety of global activities. These include the
linking of researchers and data world wide in many
fields; new generations of digital libraries and tools like
Google Scholar; study of ice-sheets at the poles and the
dramatic impact of Global warming; the study of
earthquakes across the Pacific ocean; the linking of
apparel manufacturers in Asia to designers in different
continents and the command and control system for the
Department of Defense. We discuss these applications
and their associated technology.
Why Cyberinfrastructure Useful
n Supports distributed science – data, people, computers
n Exploits Internet technology (Web2.0) adding management,
security, supercomputers etc.
n It has two aspects: parallel – low latency (microseconds)
between nodes and distributed – highish latency (microseconds) between nodes
n Parallel needed to get high performance on individual 3D
simulations, data analysis etc.; must decompose problem
n Distributed aspect integrates already distinct components n Cyberinfrastructure is in general a distributed collection of
parallel systems
n Grids are made of services that are “just” programs or data
e-moreorlessanything and the Grid
n ‘e-Science is about global collaboration in key areas of science,
and the next generation of infrastructure that will enable it.’ from its inventor John Taylor Director General of Research Councils UK, Office of Science and Technology
n e-Science is about developing tools and technologies that allow
scientists to do ‘faster, better or different’ research
n Similarly e-Business captures an emerging view of corporations as
dynamic virtual organizations linking employees, customers and stakeholders across the world.
• The growing use of outsourcing is one example
n The Grid provides the information technology e-infrastructure
for e-moreorlessanything.
n A deluge of data of unprecedented and inevitable size must be
managed and understood.
n People, computers, data and instruments must be linked.
n On demand assignment of experts, computers, networks and
storage resources must be supported
TeraGrid: Integrating NSF Cyberinfrastructure
TeraGrid is a facility that integrates computational, information, and analysis resources at the San Diego Supercomputer Center, the Texas Advanced Computing Center, the University of Chicago / Argonne National Laboratory, the National Center for Supercomputing Applications, Purdue University,Indiana University, Oak Ridge National Laboratory, the Pittsburgh
Supercomputing Center, and the National Center for Atmospheric Research.
SDSC
TACC
UC/ANL
NCSA ORNL
PU IU
PSC NCAR
Caltech
USC-ISI Utah
Iowa
Cornell Buffalo
Virtual Observatory Astronomy Gri
Integrate Experiments
Radio Far-Infrared Visible
Visible + X-ray
Dust Map
Galaxy Density Map
Grid Capabilities for Science
n Open technologies for any large scale distributed system that is adopted by
industry, many sciences and many countries (including UK, EU, USA, Asia)
• Security, Reliability, Management and state standards
n Service and messaging specifications
n User interfaces via portals and portlets virtualizing to desktops, email,
PDA’s etc.
• ~20 TeraGrid Science Gateways (their name for portals) • OGCE Portal technology effort led by Indiana
n Uniform approach to access distributed (super)computers supporting single
(large) jobs and spawning lots of related jobs
n Data and meta-data architecture supporting real-time and archives as well
as federation
• Links to Semantic web and annotation
n Grid (Web service) workflow with standards and several successful
instantiations (such as Taverna and MyLead)
n Many Earth science grids including ESG (DoE), GEON, LEAD, SCEC,
SERVO; LTER and NEON for Environment
n
Much of the world’s manufacturing industry is
globalized and the apparel/textile industry is typical
n
We are working with Hong Kong Textile Industry to
link the Asian manufacturers with
design/marketing/purchase functions elsewhere (USA,
Europe)
n
Need to exchange designs, available fabrics and
discussions
n
Good example of e-infrastructure enabling
specialization in one geographical area to thrive
n
Software and digital animation outsourcing are good
examples
eApparel
APEC Cooperation for Earthquake Simulation
n ACES is a seven year-long collaboration among scientists
interested in earthquake and tsunami predication
• iSERVO is Infrastructure to suppor
work of ACES
• SERVOGrid is (completed) US Grid that is
a prototype of iSERVO
• http://www.quakes.uq.edu.au/ACES/
n Chartered under APEC –
Database Database Analysis and Visualizatio Portal Repositorie Federated Databases Data Filte Services
Field Trip Data
Streaming Data Sensor s
?
Discovery Services SERVOGrid Researc Simulation s Research Education Customization Services From Researc to Education Educatio Grid Computer FarmGrid of Grids: Research Grid and Education Grid
SERVOGrid and Cyberinfrastructure
n Grids are the technology based on Web services that implement
Cyberinfrastructure i.e. support eScience or science as a team sport
• Internet scale managed services that link computers data
repositories sensors instruments and people
n There is a portal and services in SERVOGrid for
• Applications such as GeoFEST, RDAHMM, Pattern
Informatics, Virtual California (VC), Simplex, mesh generating programs …..
• Job management and monitoring web services for running
the above codes.
• File management web services for moving files between
various machines.
• Geographical Information System services • Quaketables earthquake specific database • Sensors as well as databases
• Context (dynamic metadata) and UDDI system long term
metadata services
a
Topography 1 km
Stress Change
Earthquakes
PBO
Site-specific Irregular
Scalar Measurements Constellations for Plate Boundary-Scale Vector Measurements
a
a
Ice Sheets Volcanoes
Long Valley, CA
Northridge, CA
Hector Mine, CA Greenland
Some Grid Concepts I
n
Services
are “just” (distributed) programs sending and
receiving messages with well defined syntax
n
Interfaces
(input-output)
must be open
; innards can be
open source (allowing you to modify) or proprietary
• Services can be any language from Fortran, Shell scripts, C,
C#, C++, Java, Python, Perl – your choice!!
• Web Services supported by all vendors (IBM, Microsoft …) n
Service overhead
will be just a
few milliseconds
(more
now) which is < typical network transit time
• Any program that is distributed can be a Web service • Any program taking execution time ≥ 20ms can be an
Web services
n
Web Services
build
loosely-coupled,
distributed
applications,
(wrapping existing
codes and databases)
based on the
SOA
(service oriented
architecture) principles.
n
Web Services interact
by exchanging messages
in
SOAP
format
n
The contracts for the
message exchanges that
implement those
interactions are
described via
WSDL
Some Grid Concepts II
n Systems are built from contributions from many different
groups – you do not need one “vendor” for all components as Web services allow interoperability between components
• One reason DoD likes Grids (called Net-Centric computing)
n Grids are distributed in services and data allowing anybody to
store their data and to produce “their” view
• Some think that University Library of future will curate/store data of
their faculty
n “2 level programming model”: Classic programming of services
and services are composed using workflow consistent with industry standards (BPEL)
n Grid of Grids: (System of Systems) Realistically Grid-like
systems will be built using multiple technologies and “standards” –integrate separate Grids for Sensors, GIS, Visualization,
computing etc. with OGSA (Open Grid Service Architecture from OGF) system Grid (Security, registry) into a single Grid
TeraGrid User Portal
LEAD Gateway Portal
NSF Large ITR and Teragrid Gateway- Adaptive Response to Mesoscal weather events
Grid Workflow Data Assimilation in Earth Science
n Grid services triggered by abnormal events and controlled by workflow process real
time data from radar and high resolution simulations for tornado forecasts
Use a Portlet-based user portal to access and control services and workflow
SERVOGrid has a portal
Portlets v. Google Gadgets
nPortals for Grid Systems are built using portlets with
software like GridSphere integrating these on the
server-side into a single web-page
n
Google (at least) offers the Google sidebar and Google
home page which support Web 2.0 services and do not
use a server side aggregator
n
Google is more user friendly!
n
The many Web 2.0 competitions is an interesting model
for promoting development in the world-wide
distributed collection of Web 2.0 developers
n
I guess Web 2.0 model will win!
GIS and Sensor Grids
n OGC has defined a suite of data structures and services to
support Geographical Information Systems and Sensors
n GML Geography Markup language defines specification of
geo-referenced data
n SensorML and O&M (Observation and Measurements) define
meta-data and data structure for sensors
n Services like Web Map Service, Web Feature Service, Sensor
Collection Service define services interfaces to access GIS and sensor information
n Grid workflow links services that are designed to support
streaming input and output messages
n We built Grid (Web) service implementations of these
specifications for NASA’s SERVOGrid
Grid Workflow Datamining in Earth Science
n Work with Scripps Instituten Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California
Streaming Data Support
Transformations Data Checking
Hidden Marko Datamining (JPL)
Display (GIS)
NASA GPS
Earthquake
Earth/Atmosphere Grids built as Grids of (library) Grids
Ice Sheet Sensors, SAR, Filters, EM, Glacier Simulations
Physical Network Registr
y Metadata
Earthquake Data, Filters & Simulation
Services
Earthquake
SERVOGrid
…
TornadGrid…
Ice SheetPolarGridData
Access/Storage Securit
y Notification Workflow Messaging
Portal
s VisualizationGrid
Collaboration Grid
Sensor Grid Compute
Grid GIS Grid
Community Tools
e-mail and list-serves are oldest and best used
Kazaa, Instant Messengers, Skype, Napster, BitTorrent for P2P Collaboration – text, audio-video conferencing, files
del.icio.us, Connotea, Citeulike, Bibsonomy, Biolicious manage shared bookmarks
MySpace, YouTube, Bebo, Hotornot, Facebook, or similar sites allow you to create (upload) community resources and share them; Friendster, LinkedIn
create networks
• http://en.wikipedia.org/wiki/List_of_social_networking_websites
Writely, Wikis and Blogs are powerful specialized shared document systems
ConferenceXP and WebEx share general applications
Google Scholar tells you who has cited your papers while publisher sites tell you about co-authors
• Windows Live Academic Search has similar goals
Note sharing resources creates (implicit) communities
• Social network tools study graphs to both define communities and extract their properties
Mashups and Grids
http://www.programmableweb.com There are 281 “commodity”
service Web 2.0 API’s on October 1 06 (356 Jan 9 07)
Mashups are composed from
JavaScript, AJAX and REST
and not usually BPEL WSDL
and SOAP; Google Gadgets not
portlets
Architecture of Mashups and
Grids “identical”
See Amazon S3 Storage and
EC2 Elastic Computing services
Mashups enable everybody to
Mashup Matrix
GIS Grid of “Indiana Map” and ~10 Indiana counties with accessible Map (Feature) Servers from different vendors. Grids federate different data repositories (cf Astronomy VO federating different observatory collections
Indiana Map Mash-up
eSports?
n YouTube illustrates asynchronous
video sharing and video conferencing illustrates synchronous video sharing
n One can link trainers (or spectators)
and athletes globally with real time video supporting video and text
annotation
n Technically hard due to network
issues and allowing real-time playing of annotated video
n Exploring with China
n Note IU could export coaching in
Soccer, Basketball etc
n Example of Cyberinfrastructure
supporting geographically distributed specialization
Minority Serving Institutions and the Grid
• Historically the R1 Research University powerhouses dominated research due to their concentration of expertise
• Cyberinfrastructure allows others to participate in same way it
supports distributed open source software and distributed Web 2.0 • Navajo Nation (Colorado Plateau covering over 25,000 square
miles in northeast Arizona, northwest New Mexico, and southeast Utah) with 110 communities and over 40% unemployment.
Building a wireless grid for education, healthcare
• http://www.win-hec.org/ World Indigenous Nations Higher Education Consortium
• Cyberinfrastructure allows Nations to preserve their geographical identity but participate fully with world class jobs and research • Some 335 MSI’s in Alliance have similar hopes for
Cyberinfrastructure to jump start their advancement!
Example: Setting up a Polar CI-Grid
• The North and South poles are melting with potential huge environmental impact
• As a result of MSI meetings, I am working with MSI ECSU in North Carolina and Kansas University to design and set up a Polar Grid (Cyberinfrastructure)
• This is a network of computers, sensors (on robots and
satellites), data and people aimed at understanding science of ice-sheets and impact of global warming
• We have changed the 100,000 year Glacier cycle into a ~50 year cycle; the field has increased dramatically in importance and interest
• Good area to get involved in as not so much established work
Typical Illustration of effect of Climate Change on Greenland:
PolarGrid
n
Important Polar Grid Cyberinfrastructure components
include
• Managed data from sensors and satellites
• Data analysis such as SAR processing – possibly with parallel
algorithms
• Electromagnetic simulations (currently commercial codes) to
design instrument antennas
• 3D simulations of ice-sheets (glaciers) with non-uniform
meshes
• GIS Geographical Information Systems
n
Also need capabilities present in many Grids
• Portal i.e. Science Gateway• Submitting multiple sequential or parallel jobs
F F B F F B F F
B Real TimeMonitor Real Time
Monitor
Archival – High Latency
Archival – High Latency Low Bandwidth Low Bandwidth A d a a y r
Prototype Base/Field Grid
Other Polar Sensors an Sensor Aggregators (Non-polar and Polar Sites) Polar Expeditions
IU Field Base Camps
Existing User Interface
Document-enhanced Cyberinfrastructure
etc. Google Scholar Manuscript Central Science.gov Windows Live Academic Search Citeseer CMT Conferenc Management Existing Documen Web servic New Document-enhanced Integration Enhancement User Interface Community Tools Generic Document ToolsMyResearc Database Bibliographic Database Export RSS, Bibte
Endnote etc. CiteULike
Delicious Semantic Web/Grid
n
http://del.icio.us
purchased by
Yahoo
for ~$30M
n
h
ttp://www.CiteULike.org
n
http://www.connotea.org (
Nature)
n
Associate
metadata
with
Bookmarks
specified by
URL’s, DOI’s (Digital Object Identifiers)
n
Users add
comments
and
keywords
(called
tags
)
n
Users are linked together into
groups
(communities)
nInformation such as title and authors extracted
automatically
from some sites (PubMed, ACM, IEEE,
Wiley etc.)
n
Bibtex
like additional information in CiteULike
n
This is perhaps
de facto Semantic Web
– remarkable
for its simplicity
Document-enhanced Cyberinfrastructur
aka Semantic Scholar Grid I
n
Citeseer
and
Google Scholar
scour the Internet and analyze
documents for incidental metadata
•
Title
,
author
and
institution
of documents
•
Citations
with their own metadata allowing one to match
to other documents
n
Science.gov
extracts metadata from lots of US Government
databases
n
These capabilities are sure to become more powerful and to
be extended
•
Give “
Citation Index
” in real time
•
Tell you all authors of all papers that cite a paper that
cites you etc. (Note it’s a small world so don’t go too far
in link analysis)
•
Tell you all
citations of all papers in a workshop
Document-enhanced Cyberinfrastructur
aka Semantic Scholar Grid II
n It is natural to develop core document Services such as those
used in Citeseer/Google Scholar but applied to “your”
documents of interest that may not have been processed yet
• As just submitted to a conference perhaps
n These tools can help form useful lists such as authors of all cited
or submitted papers to a journal
n OSCAR2/3 (from Peter Murray-Rust’s group at Cambridge)
augment the application independent “core” metadata (Title, authors, institutions, Citations) with a list of all chemical terms
• This tool is a Service that can be applied to “your” document or to a set of
documents harvested in some fashion
• Other fields have natural application specific metadata and OSCAR like
tools can be developed for them
n Such high value tools could appear on “publisher” sites of future