Cloud Computing
Lecture 6
Grid Case Studies
2010-2011
Up until now…
• Introduction.
• Definition of Cloud Computing.
• Grid Computing:
• Schedulers
• Globus Toolkit
Summary
• Grid Case Studies:
• Monitoring: TeraGRID IS • Data Transfer: LIGO • Task Distribution: GEO600
TeraGrid (
National Science Foundation
)
•TeraGrid DEEP: Integrates NSF’s 60 largest computers (+ than 60 TF):
•more than 2 PB of online storage.
•National data visualization infra-structure.
•Most powerful world computing network.
•TeraGrid WIDE Science Portals:
•Integration of scientific communities.
•More than 90 community data collections.
•Cooperation with other grid projects in Europe and Asia/Pacífic.
• Provide a mechanism that allows all participants to publish and discover information about available capabilities:
• What are TeraGrid’s computing resources? • Which features are provided by each resource? • Where are the login services?
• Where can I access a particular data collection? • Who has a weather forecast service?
• Provide a mechanism adapted to the TeraGrid open community:
Technical Issues
• Information is stored in legacy systems:
• Databases (different types, restricted access). • Static and dynamic web interfaces.
• Multiple and varied database schemas:
• It’s very complex to design an integrate database which supports all data types and relations.
• Many types of clients (browsers, SOAP, ReST).
• Service availability is critical:
• Both TeraGrid management (testing, documentation and planning) and its users and partners depend on this service.
• Therefore it has a 99.5% availability goal.
TeraGrid Information Service Architecture
Clients Clients Clients Cache Cache WS/REST HTTP GET WS/SOAP WS MDS4 Tomcat WebMDS Apache 2.0 TeraGrid Central Services WS/SOAP WS MDS4 Resources
Registry
• Editors record available content:
• The local service maintains a registry at the central registry.
• Entries expire automatically so are refreshed periodically. • Editors maintain ownership of their information systems
(they can even participate in other grids).
• Indexing services pull content:
• Registry entries have access control. • Registry entries have a link to the source. • Cachesupports service faults, etc…
High Availabilty Architecture
info.dyn.teragrid.org
info.teragrid.org
TeraGrid Dynamic Direct Numerical Simulation
Clients
Resources and Partners
TGUP Batch Load & Queue Data
IIS provides queue &
batch load
information from all
RP sites for TGUP to
use in system
monitor
<LoadRP xmlns=""> <ComputeResourceLoad xmlns=""> <ResourceID>pople.psc.teragrid.org</ResourceID> <SiteID>psc.teragrid.org</SiteID><LoadInfo hostname="tg-login1.pople.psc.teragrid.org" timestamp="2009-11-11T13:46:19Z"> <Load> <Type>queue</Type> <Value>98</Value> </Load> http://portal.teragrid.org/
TeraGrid Results
• Does not require deep modification or loss of
ownership of legacy systems.
• Simple and consistent access mechanism.
• Integrates:
• Description of computing services and queue state. • Registry of service and software availability.
• Centralized documentation.
• Test, validation and verification service: INCA, testing and execution portal.
Segurança Execução Monit. Base Dados
GT4: Base
GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data Rep Delegation Replica Location Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus ToolkitLIGO
LIGO: Laser Interferometry Gravitational Wave
Observatory
• Goal: observe
gravitational waves.
• Three physical detectors
in two locations (plus the
GEO detector in
Germany).
• More than 10 centres for
data analysis.
• Collaborators in more
than 40 institutions.
LIGO
• LIGO records thousands of data channels generating
1TB/day of data during test periods:
• The result data are published and data centres subscribe to the parts that local users want for analysis or storage.
• The data analysis results in more derived data:
• About 30% of LIGOs total data.
• They are also published and replicated.
• More than 35 million files on the grid:
• More than 6 million unique files
The Challenge
Replicate more than 1 TB/day of data to more
than 10 locations. Solution:
• A publish/subscribe model.
• Let scientist specify and discover data using
applicational criteria (metadata).
• Let scientists locate copies of the data.
Technical Constraints
• Efficiency
• Avoid bandwidth sub-use during transfer specially
on broadband (10 Gbps) links.
Lightweight Data Replicator
Joins three basic Globus services:
1. Metadata service (MDS):
“What files are available?”
Information about files such as size , md5, date,… Metadata propagation.
2. Globus Replica Location Service (RLS):
“Where are the files?”
Catalog service translates filenames into URLs. Maps files to locations.
3. GridFTP Service:
“How can we copy files?” Server+adapted client.
Used to replicate data between locations.
LIGO Data Replicator Architecture
• Each participant has a machine dedicates to transferring files requested by local clients.
• The scheduler requests metadata and catalog replicas to indentify missing files which are added to a priority list.
• The transfer daemon checks the list, transfer files and updates the LRC. • If a transfer fails it remains on the
LIGO Results
• Complete LIGO/GEO experiment:
• Replicated 30 TB in 30 days. • MTBF: 1 month.
• More than 35 million files on the LDR network.
• Performance limited by chosen programming
language (Python).
• Partnership with Globus Alliance to include a version
of the LDR in Globus Toolkit.
Data Replication Service
• Data Replication Service (DRS)
• Reimplementation of LDR publish/subscribe capabilities.
• Uses Java based WS-RF services.
• Uses the RLS e RFT services from GT4.
Segurança Execução Monit. Base Dados
GT4: Base
GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data Rep Delegation Replica Location Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus ToolkitGEO600
GEO600 Observatory
• Goal (same as LIGO):
observe gravitational
waves.
• 600m laser
interferometer close to
Hannover.
• Members of LIGO as
well.
The Challenge
• Sweep the packets of data produced by
GEO600 looking for gravitational waves:
• Very complex signal processing.
• Very large amounts of data to process.
• D-GRID (state owned German Grid) and Open
Science Grid available.
Two Pronged Approach
Einstein@Home
• Shared with the LIGO community.
• Runs on voluntary PCs. • Uses the BOINC network. • Runs since mid 2006. • >70,000 computers/week • ~19000 units/day
AstroGrid-D
• Same application as Einstein@Home. • Using D-Grid and OSG. • Runs since Oct/2007.
• Distributes tasks using GRAM. • Averages 4000 units/day.
Traditional Approach to Resource
Management
• Accessing multiple sites:
• Accounts, permissions, etc...
• Use a metascheduler to decide on resource
selection:
• GridWay
• Metascheduler uses GRAM to contact different
job submission sites.
GEO600 Approach
• Submission node has a list of all the GEO 600 resources with min. and max. job capacities for each.
• Hourly the list are reviewed and jobs are dispatched. • Every time a node has less than min. Jobs, more jobs are
transferred upto max. jobs and the corresponding input files are transferred asynchronously.
• The status of each job is maintained in the submission machine.
• Job output files are sent back to the submission system asynchronously.
• GEO600 processes 4000 jobs/day this way.
D-GRID Approach
GRAM4 Service
Scheduler (e.g. Condor)
GRAM4 Service Scheduler (e.g. SGE)
Local jobs Local jobs
Other GRAM4 tasks GEO600 jobs Other GRAM4 tasks