Cloud Computing
Lecture 5
Grid Case Studies
2014-2015
Up until now…
• Introduction.
• Definition of Cloud Computing.
• Grid Computing:
• Schedulers
Summary
• Grid Case Studies:
• Monitoring: TeraGRID IS • Data Transfer: LIGO
TeraGrid (National Science Foundation)
•TeraGrid DEEP: Integrates NSF’s 60 largest computers (+ than 60 TF):
•more than 2 PB of online storage.
•National data visualization infra-structure.
•Most powerful world computing network.
•TeraGrid WIDE Science Portals:
•Integration of scientific communities.
•More than 90 community data collections.
•Cooperation with other grid projects in Europe and
• Provide a mechanism that allows all participants to publish and discover information about available capabilities:
• What are TeraGrid’s computing resources? • Which features are provided by each resource? • Where are the login services?
• Where can I access a particular data collection? • Who has a weather forecast service?
• Provide a mechanism adapted to the TeraGrid open community:
• Editors record information (instead of submitting to a centralized database).
• A central index allows for aggregation and discovery.
• Multiple access interfaces (WS/SOAP, WS/ReST, browser).
Technical Issues
• Information is stored in legacy systems:
• Databases (different types, restricted access). • Static and dynamic web interfaces.
• Multiple and varied database schemas:
• It’s very complex to design an integrate database which supports all data types and relations.
• Many types of clients (browsers, SOAP, ReST).
• Service availability is critical:
• Both TeraGrid management (testing, documentation and planning) and its users and partners depend on this
service.
TeraGrid Information Service Architecture
Clients Clients Clients Cache Cache WS/REST HTTP GET WS/SOAP WS MDS4 Tomcat WebMDS Apache 2.0 TeraGrid Central Services WS/SOAP WS MDS4 ResourcesRegistry
• Editors record available content:
• The local service maintains a registry at the central registry.
• Entries expire automatically so are refreshed periodically. • Editors maintain ownership of their information systems
(they can even participate in other grids).
• Indexing services pull content:
• Registry entries have access control.
• Registry entries have a link to the source. • Cache supports service faults, etc…
High Availabilty Architecture
…
info.dyn.teragrid.org
info.teragrid.org
TeraGrid Dynamic Direct Numerical Simulation
Clients
Resources and Partners
TGUP Batch Load & Queue Data
IIS provides queue & batch
load information from all
RP sites for TGUP to use
in system monitor
<LoadRP xmlns="">
<ComputeResourceLoad xmlns="">
<ResourceID>pople.psc.teragrid.org</ResourceID> <SiteID>psc.teragrid.org</SiteID>
<LoadInfo hostname="tg-login1.pople.psc.teragrid.org" timestamp="2009-11-11T13:46:19Z"> <Load>
<Type>queue</Type> <Value>98</Value> </Load>
TeraGrid Results
• Does not require deep modification or loss of
ownership of legacy systems.
• Simple and consistent access mechanism.
• Integrates:
• Description of computing services and queue state. • Registry of service and software availability.
• Centralized documentation.
• Test, validation and verification service: INCA, testing and execution portal.
Segurança Execução Monit. Base Dados
GT4: Base
GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data RepDelegation LocationReplica Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus Toolkit
LIGO: Laser Interferometry Gravitational Wave
Observatory
• Goal: observe
gravitational waves.
• Three physical detectors
in two locations (plus the
GEO detector in
Germany).
• More than 10 centres for
data analysis.
• Collaborators in more
than 40 institutions.
LIGO: Laser Interferometry
Gravitational Wave Observatory
• LIGO records thousands of data channels generating
1TB/day of data during test periods:
• The result data are published and data centres subscribe to the parts that local users want for analysis or storage.
• The data analysis results in more derived data:
• About 30% of LIGOs total data.
• They are also published and replicated.
• More than 35 million files on the grid:
• More than 6 million unique files
The Challenge
Replicate more than 1 TB/day of data to more
than 10 locations. Solution:
• A publish/subscribe model.
• Let scientist specify and discover data using
applicational criteria (metadata).
Technical Constraints
• Efficiency
• Avoid bandwidth sub-use during transfer specially
on broadband (10 Gbps) links.
Lightweight Data Replicator
Joins three basic Globus services:
1. Metadata service (MDS):
“What files are available?”
Information about files such as size , md5, date,… Metadata propagation.
2. Globus Replica Location Service (RLS):
“Where are the files?”
Catalog service translates filenames into URLs. Maps files to locations.
3. GridFTP Service:
“How can we copy files?” Server+adapted client.
LIGO Data Replicator Architecture
• Each participant has a machine dedicated to transferring files requested by local clients.
• The scheduler requests metadata and catalogs replicas to identify missing files which are added to a priority list.
• The transfer daemon checks the list, transfer files and updates the LRC. • If a transfer fails it remains on the
LIGO Results
• Complete LIGO/GEO experiment:
• Replicated 30 TB in 30 days. (~700 MB/minute) • MTBF: 1 month.
• More than 35 million files on the LDR network.
• Performance limited by chosen programming
language (Python).
• Partnership with Globus Alliance to include a version
of the LDR in Globus Toolkit.
Data Replication Service
• Data Replication Service (DRS)
• Reimplementation of LDR publish/subscribe
capabilities.
• Uses Java based WS-RF services.
• Uses the RLS e RFT services from GT4.
Segurança Execução Monit. Base Dados
GT4: Base
GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data RepDelegation LocationReplica Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus Toolkit
GEO600 Observatory
• Goal: observe
gravitational waves.
• 600m laser
interferometer close to
Hannover.
• Members of LIGO as
well.
The Challenge
• Sweep the packets of data produced by
GEO600 looking for gravitational waves:
• Very complex signal processing.
• Very large amounts of data to process.
• D-GRID (state owned German Grid) and Open
Science Grid available.
Two Pronged Approach
Einstein@Home
• Shared with the LIGO community.
• Runs on voluntary PCs. • Uses the BOINC network. • Runs since mid 2006.
• >70,000 computers/week • ~19000 units/day
AstroGrid-D
• Same application as Einstein@Home.
• Using D-Grid and OSG. • Runs since Oct/2007.
• Distributes tasks using GRAM. • Averages 4000 units/day.
Traditional Approach to Resource
Management
• Accessing multiple sites:
• Accounts, permissions, etc...
• Use a metascheduler to decide on resource
selection:
• GridWay
• Metascheduler uses GRAM to contact different
job submission sites.
GEO600 Approach
• Submission node has a list of all the GEO 600 resources with min. and max. job capacities for each.
• Hourly the list are reviewed and jobs are dispatched.
• Every time a node has less than min. Jobs, more jobs are
transferred upto max. jobs and the corresponding input files are transferred asynchronously.
• The status of each job is maintained in the submission machine.
• Job output files are sent back to the submission system asynchronously.
D-GRID Approach
GRAM4 Service
Scheduler (e.g. Condor)
Computing Nodes
GRAM4 Service Scheduler (e.g. SGE)
Computing Nodes
Local jobs Local jobs
Other GRAM4 tasks GEO600 jobs Resource A Resource B Other GRAM4 tasks