Cloud Computing. Lecture 5 Grid Case Studies

(1)

Cloud Computing

Lecture 5

Grid Case Studies

2014-2015

(2)

Up until now…

• Introduction.

• Definition of Cloud Computing.

• Grid Computing:

• Schedulers

(3)

Summary

• Grid Case Studies:

• Monitoring: TeraGRID IS • Data Transfer: LIGO

(4)

(5)

TeraGrid (National Science Foundation)

•TeraGrid DEEP: Integrates NSF’s 60 largest computers (+ than 60 TF):

•more than 2 PB of online storage.

•National data visualization infra-structure.

•Most powerful world computing network.

•TeraGrid WIDE Science Portals:

•Integration of scientific communities.

•More than 90 community data collections.

•Cooperation with other grid projects in Europe and

(6)

• Provide a mechanism that allows all participants to publish and discover information about available capabilities:

• What are TeraGrid’s computing resources? • Which features are provided by each resource? • Where are the login services?

• Where can I access a particular data collection? • Who has a weather forecast service?

• Provide a mechanism adapted to the TeraGrid open community:

• Editors record information (instead of submitting to a centralized database).

• A central index allows for aggregation and discovery.

• Multiple access interfaces (WS/SOAP, WS/ReST, browser).

(7)

Technical Issues

• Information is stored in legacy systems:

• Databases (different types, restricted access). • Static and dynamic web interfaces.

• Multiple and varied database schemas:

• It’s very complex to design an integrate database which supports all data types and relations.

• Many types of clients (browsers, SOAP, ReST).

• Service availability is critical:

• Both TeraGrid management (testing, documentation and planning) and its users and partners depend on this

service.

(8)

TeraGrid Information Service Architecture

Clients Clients Clients Cache Cache WS/REST HTTP GET WS/SOAP WS MDS4 Tomcat WebMDS Apache 2.0 TeraGrid Central Services WS/SOAP WS MDS4 Resources

(9)

Registry

• Editors record available content:

• The local service maintains a registry at the central registry.

• Entries expire automatically so are refreshed periodically. • Editors maintain ownership of their information systems

(they can even participate in other grids).

• Indexing services pull content:

• Registry entries have access control.

• Registry entries have a link to the source. • Cache supports service faults, etc…

(10)

High Availabilty Architecture

…

info.dyn.teragrid.org

info.teragrid.org

TeraGrid Dynamic Direct Numerical Simulation

Clients

Resources and Partners

(11)

TGUP Batch Load & Queue Data

IIS provides queue & batch

load information from all

RP sites for TGUP to use

in system monitor

<ResourceID>pople.psc.teragrid.org</ResourceID> <SiteID>psc.teragrid.org</SiteID>

<Type>queue</Type> <Value>98</Value> </Load>

(12)

TeraGrid Results

• Does not require deep modification or loss of

ownership of legacy systems.

• Simple and consistent access mechanism.

• Integrates:

• Description of computing services and queue state. • Registry of service and software availability.

• Centralized documentation.

• Test, validation and verification service: INCA, testing and execution portal.

(13)

Segurança Execução Monit. Base Dados

GT4: Base

GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data Rep

Delegation _LocationReplica Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus Toolkit

(14)

(15)

LIGO: Laser Interferometry Gravitational Wave

Observatory

• Goal: observe

gravitational waves.

• Three physical detectors

in two locations (plus the

GEO detector in

Germany).

• More than 10 centres for

data analysis.

• Collaborators in more

than 40 institutions.

(16)

LIGO: Laser Interferometry

Gravitational Wave Observatory

• LIGO records thousands of data channels generating

1TB/day of data during test periods:

• The result data are published and data centres subscribe to the parts that local users want for analysis or storage.

• The data analysis results in more derived data:

• About 30% of LIGOs total data.

• They are also published and replicated.

• More than 35 million files on the grid:

• More than 6 million unique files

(17)

The Challenge

Replicate more than 1 TB/day of data to more

than 10 locations. Solution:

• A publish/subscribe model.

• Let scientist specify and discover data using

applicational criteria (metadata).

(18)

Technical Constraints

• Efficiency

• Avoid bandwidth sub-use during transfer specially

on broadband (10 Gbps) links.

(19)

Lightweight Data Replicator

Joins three basic Globus services:

1. Metadata service (MDS):

 “What files are available?”

 Information about files such as size , md5, date,…  Metadata propagation.

2. Globus Replica Location Service (RLS):

 “Where are the files?”

 Catalog service translates filenames into URLs.  Maps files to locations.

3. GridFTP Service:

 “How can we copy files?”  Server+adapted client.

(20)

LIGO Data Replicator Architecture

• Each participant has a machine dedicated to transferring files requested by local clients.

• The scheduler requests metadata and catalogs replicas to identify missing files which are added to a priority list.

• The transfer daemon checks the list, transfer files and updates the LRC. • If a transfer fails it remains on the

(21)

LIGO Results

• Complete LIGO/GEO experiment:

• Replicated 30 TB in 30 days. (~700 MB/minute) • MTBF: 1 month.

• More than 35 million files on the LDR network.

• Performance limited by chosen programming

language (Python).

• Partnership with Globus Alliance to include a version

of the LDR in Globus Toolkit.

(22)

Data Replication Service

• Data Replication Service (DRS)

• Reimplementation of LDR publish/subscribe

capabilities.

• Uses Java based WS-RF services.

• Uses the RLS e RFT services from GT4.

(23)

Segurança Execução Monit. Base Dados

GT4: Base

GridWay GridFTP Reliable File Transfer GRAM MDS4 CAS Data Rep

Delegation _LocationReplica Java Runtime C Runtime Python Runtime GSI-OpenSSH MyProxy Globus Toolkit

(24)

(25)

GEO600 Observatory

• Goal: observe

gravitational waves.

• 600m laser

interferometer close to

Hannover.

• Members of LIGO as

well.

(26)

The Challenge

• Sweep the packets of data produced by

GEO600 looking for gravitational waves:

• Very complex signal processing.

• Very large amounts of data to process.

• D-GRID (state owned German Grid) and Open

Science Grid available.

(27)

Two Pronged Approach

Einstein@Home

• Shared with the LIGO community.

• Runs on voluntary PCs. • Uses the BOINC network. • Runs since mid 2006.

• >70,000 computers/week • ~19000 units/day

AstroGrid-D

• Same application as Einstein@Home.

• Using D-Grid and OSG. • Runs since Oct/2007.

• Distributes tasks using GRAM. • Averages 4000 units/day.

(28)

Traditional Approach to Resource

Management

• Accessing multiple sites:

• Accounts, permissions, etc...

• Use a metascheduler to decide on resource

selection:

• GridWay

• Metascheduler uses GRAM to contact different

job submission sites.

(29)

GEO600 Approach

• Submission node has a list of all the GEO 600 resources with min. and max. job capacities for each.

• Hourly the list are reviewed and jobs are dispatched.

• Every time a node has less than min. Jobs, more jobs are

transferred upto max. jobs and the corresponding input files are transferred asynchronously.

• The status of each job is maintained in the submission machine.

• Job output files are sent back to the submission system asynchronously.

(30)

D-GRID Approach

GRAM4 Service

Scheduler (e.g. Condor)

Computing Nodes

GRAM4 Service Scheduler (e.g. SGE)

Computing Nodes

Local jobs Local jobs

Other GRAM4 tasks GEO600 jobs Resource A Resource B Other GRAM4 tasks

(31)