• No results found

A Service for Data-Intensive Computations on Virtual Clusters

N/A
N/A
Protected

Academic year: 2021

Share "A Service for Data-Intensive Computations on Virtual Clusters"

Copied!
17
0
0

Loading.... (view fulltext now)

Full text

(1)

A Service for

Data-Intensive Computations

on Virtual Clusters

Rainer Schmidt, Christian Sadilek, and Ross King

[email protected]

(2)

Planets Project

ƒ “Permanent Long-term Access through NETworked Services”

ƒ Addresses the problem of digital preservation

ƒ driven by National Libraries and Archives

ƒ Project instrument: FP6 Integrated Project

ƒ 5. IST Call

ƒ Consortium: 16 organisations from 7 countries

ƒ Duration: 48 months, June 2006 – May 2010

ƒ Budget: 14 Million Euro

(3)

Outline

ƒ

Digital Preservation

ƒ

Need for eScience Repositories

ƒ

The Planets Preservation Environment

ƒ

Distributed Data and Metadata Integration

ƒ

Data-Intensive Computations

ƒ

Grid Service for Job Execution on AWS

ƒ

Experimental Results

(4)

Drivers for Digital Preservation

ƒ

Exponential growth of digital data across all sectors of society

ƒ

Large quantities of often irreplaceable information

ƒ

Research and scientific data

ƒ

Data becomes significantly more complex and diverse

ƒ

Digital information is fragile

ƒ

Ensure long term access and interpretability

ƒ

Requires data curation and preservation

ƒ

Development and assessment of preservation strategies

(5)

The Planets Environment

ƒ

A collaborative research infrastructure for the systematic

development and evaluation and of preservation strategies.

ƒ

A decision-support system for the development of

preservation plans

ƒ

An integrated environment for dynamically providing,

discovering and accessing a wide range of preservation tools,

services, and technical registries.

ƒ

A workflow environment for the data integration, process

execution, and the maintenance of preservation information.

(6)

Service Gateway Architecture

Preservation Planning (Plato) Experimentation Testbed Application Notification and Logging System Workflow Execution UI Workflow Execution and Monitoring Experiment Data and Metadata Repository Service and Tool Registry Application Services Execution Services Data Storage Services Administration UI Authentication and Authorization User Applications Portal Services

Application Execution and Data Services

(7)

Data and Metadata Integration

ƒ

Support distributed and remote storage facilities

ƒ

data registry and metadata repository

ƒ

Integrate different data sources and encoding standards

ƒ

Support range of protocols and description schemas

ƒ

Connect distributed data repositories and archives.

ƒ

Development of common digital object model

ƒ

Dynamic mapping of objects from different institutions

(memory inst., research collections, content networks)

ƒ

Recording of provenance information, preservation,

integrity, and descriptive metadata

(8)

Data Intensive Applications

ƒ Development Planets Job Submission Services

ƒ Allow Job Submission to a PC cluster (e.g. Hadoop, Condor)

ƒ Standard grid protocols/interfaces (SOAP, HPC-BP, JSDL)

ƒ Demand for massive compute and storage resources

ƒ Instantiate cluster on top of (leased) cloud resources based on AWS EC2 + S3, (or alternatively in-house in a PC lab.)

ƒ Allows computation to be moved to data or vice-versa

ƒ On-demand cluster based on virtual machine images

ƒ Many Preservation Tools are/rely on 3rd party applications

(9)

Experimental Architecture

JSDL

Virtual Node (Xen)

JSS

Data Service

Cloud Infrastructure (EC2)

Storage Infrastructure

(S3) Virtual Cluster and File System (Apache Hadoop)

Digital Objects

Job Description File

Job HPCBP Workflo w Envir onment

(10)

A Light-Weight Job Execution Service

TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager

(11)

A Light-Weight Job Execution Service

TLS/WS Security WS - Container BES/HPCBP API Account Manager JSDL Parser Session Handler Exec. Manager Data Resolver Input Generator Job Manager HDFS S3 Hadoop

(12)

The MapReduce Application

ƒ Map-Reduce implements a framework and prog. model for processing large documents (Sorting, Searching, Indexing) on multiple nodes.

ƒ Automated decomposition (split)

ƒ Mapping to intermediary pairs (map), optionally (combine)

ƒ Merge output (reduce)

ƒ Provides implementation for data parallel + i/o intensive applications

ƒ Migrating a digital object (e.g web page, folder, archive)

ƒ Decompose into atomic pieces (e.g. file, image, movie)

ƒ On each node, process input splits

(13)

Experimental Setup

ƒ Amazon Elastic Compute Cloud (EC2)

ƒ 1 – 150 cluster nodes

ƒ Custom image based on Fedora 8 i386

ƒ Amazon Simple Storage Service (S3)

ƒ max. 1TB I/O per experiment

ƒ All measurements based on a single core per virtual machine

ƒ Apache Hadoop (v.0.18)

ƒ MapReduce Implementation

ƒ Preinstalled command line tools

ƒ Quantitative evaluation of VMs (AWS) based on execution time, number of tasks, number of nodes, physical size

(14)

Experimental Results 1 – Scaling Job Size

0,00 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00 10,00 1 10 100 1000 tasks ti m e [m in ] EC2 0,07 MB EC2 7,5 MB EC2 250 MB SLE 0,07 MB SLE 7,5 MB SLE 250 MB number of nodes = 5 x(1k) = 3,5 x(1k) = 4,4 x(1k) = 3,6 x(1k) = t_seq / t_parallel and tasks = 1000

(15)

Experimental Results 2 – Scaling #nodes

0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 40,00 0 50 100 150 nodes ti m e [m in ] EC2 1000 x 0,07 MB SLE 1000 x 0,07 MB X X X X X X n=1, t=36, s1 = 0.72, e=72% n=5, t=4.5, s=3.3, e=66% n=10, t=4.5, s=5.5, e=55% n=50, t=1.68, s=16, e=31% n=100, t=1.03, s=26, e=26% n=1 (local), t=26

(16)

Results and Observations

ƒ AWS + Hadoop + JSS

ƒ Robust and fault tolerant, scalable up to large numbers of nodes

ƒ ~32,5MB/s download / ~13,8MB/s upload (cloud internally)

ƒ Also small clusters of virtual machines reasonable

ƒ S = 4,4 for p = 5, with n=1000 and s=7,5MB

ƒ Ideally, size of cluster grows/shrinks on demand

ƒ Overheads:

ƒ SLE vs. Cloud (p=1, n=1000) 30% due to file system master

ƒ In average, less then 10% due to S3

(17)

Conclusion

ƒ Challenges in digital preservation are diversity and scale

ƒ Focus: scientific data, arts and humanities

ƒ Repositories Preservation systems need to be embedded with large-scale distributed environments

ƒ grids, clouds, eScience environments

ƒ Cloud Computing provides a powerful solution for getting on-demand access to an IT infrastructure.

ƒ Currently integration issues: Policies, Legal Aspects, SLAs

ƒ We have presented current developments in the context of the EU project Planets on building an integrated research environment for the development and evaluation of preservation strategies

References

Related documents

All from Sangrattanaprasert 423/15B (PSU).. Lobule sac obovate to slightly elliptical, strongly inflated, 0.44–0.6 mm long, 0.3–0.4 mm wide, sac surface mamillose, apex

• Manage public spaces, pavements and gullies to minimise weeds and keep Nottingham tidy • Protect from cuts: Teams that clean up graffiti and dog fouling within 48 hours of reporting

The study also raises the efficiency problem of financial support at the local level through the analysis of the structure of financial transfers from the EU budget to the

All isolates recovered in this study appeared as causative agents of early-onset septicemia (within the first 7 days), with Klebsiella species predom- inating.. Some authors did

This paper is similar to Verdelhan in that we attribute the anomaly to risk premia, which vary due to habit formation. In our model, however, we allow for trade and interna-

Carbon trading, as a market-based climate policy that allows polluters to comply with emissions reductions commitments with tradable pollution rights, is presented by its proponents

Another justification for permitting the determination of causation is to up- hold the purpose of appraisal. 109 The court in CIGNA explained that if the apprais- ers were

Disclaimer: The information contained in this report is a summary only - the accuracy of the Maori Land Court record, is itself, not accompanied by a state guarantee and to