• No results found

StratoSphere Above the Clouds

N/A
N/A
Protected

Academic year: 2021

Share "StratoSphere Above the Clouds"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Strato

Sphere

Stratosphere

Parallel Analytics in the Cloud beyond Map/Reduce

14th International Workshop

on High Performance Transaction Systems (HPTS) Poster Sessions, Mon Oct 24 2011

Thomas Bodner T.U. Berlin

(2)

Infrastructure  as  a  Service   Use-­‐Cases  

...  

The Stratosphere Project

*

■  Explore the power of Cloud

computing for complex information management applications

■  Database-inspired approach

■  Analyze, aggregate, and

query

■  Textual and (semi-)

structured data

■  Research and prototype a

web-scale data analytics infrastructure

Scien7fic  Data   Life  Sciences   Linked  Data  

Strato

Sphere

(3)

Architecture of the Stratosphere System

The PACT Programming Model

The Nephele Execution Engine

Demonstration of an Example Job

(4)

Architecture Overview

Hive,  JAQL,  

Pig  

 

 

 

 

Hadoop  

Map/Reduce  

Programming  

Model  

Hadoop  Stack  

(5)

Architecture Overview

Hive,  JAQL,  

Pig  

 

 

 

 

Hadoop  

Map/Reduce  

Programming  

Model  

Hadoop  Stack  

Dryad  

SCOPE,  

DryadLINQ  

Dryad  Stack  

(6)

Architecture Overview

Hive,  JAQL,  

Pig  

 

 

 

 

Hadoop  

Hyracks  

Map/Reduce  

Programming  

Model  

AQL  

Hadoop  Stack  

ASTERIX  Stack  

Dryad  

SCOPE,  

DryadLINQ  

(7)

Architecture Overview

Nephele  

PACT  

Programmin

g  Model  

Hive,  JAQL,  

Pig  

 

 

 

 

Hadoop  

Hyracks  

Map/Reduce  

Programming  

Model  

AQL  

Hive,  JAQL,  

Pig,  ?  

Hadoop  Stack  

ASTERIX  Stack   Stratosphere  

Stack  

Dryad  

SCOPE,  

DryadLINQ  

(8)

Architecture Overview

Nephele  

PACT  

Programmin

g  Model  

Hive,  JAQL,  

Pig  

 

 

 

 

Hadoop  

Hyracks  

Map/Reduce  

Programming  

Model  

AQL  

Hive,  JAQL,  

Pig,  ?  

Hadoop  Stack  

ASTERIX  Stack   Stratosphere  

Stack  

Dryad  

SCOPE,  

DryadLINQ  

(9)

Schema  Free  

Many  seman7cs  hidden  inside  the  

user  code  (tricks  required  to  push  

opera7ons  into  map/reduce)  

Single  default  way  of  paralleliza7on  

Data-Centric Parallel Programming

σ

π

σ

γ

Map / Reduce

Relational Databases

Schema  bound  (rela7onal  model)  

Well  defined  proper7es  and  

requirements  for  paralleliza7on  

Flexible  and  op7mizable  

GOAL: Advance the M/R Programming Model

Map

Reduce

Map

Reduce

Map

Reduce

(10)

Stratosphere in a Nutshell

■  PACT Programming Model

□  Parallelization Contract (PACT)

□  Declarative definition of data parallelism

□  Centered around second-order functions

□  Generalization of map/reduce

■  Nephele

□  Dryad-style execution engine

□  Evaluates dataflow graphs in parallel

□  Data is read from distributed filesystem

□  Flexible engine for complex jobs

■  Stratosphere = Nephele + PACT

□  Compiles PACT programs to Nephele dataflow graphs

□  Combines parallelization abstraction and flexible execution

□  Choice of execution strategies gives optimization potential

Nephele  

PACT  

Compiler  

(11)

Intuition for Parallelization Contracts

■  Map and reduce are second-order functions

□  Call first-order functions (user code)

□  Provide first-order functions with subsets of the input data

■  Define dependencies between the

records that must be obeyed when splitting them into subsets

□  Required partition properties

■  Map

□  All records are independently

processable

■  Reduce

□  Records with identical key must

be processed together

Input  set  

Independent   subsets   Key   Value  

(12)

Contracts beyond Map and Reduce

■  Cross

□  Two inputs

□  Each combination of records from the two inputs

is built and is independently processable

■  Match

□  Two inputs, each combination of records with

equal key from the two inputs is built

□  Each pair is independently processable

■  CoGroup

□  Multiple inputs

□  Pairs with identical key are grouped for each input

□  Groups of all inputs with identical key

(13)

Parallelization Contracts (PACTs)

■  Second-order function that defines properties on the input and

output data of its associated first-order function

■  Input Contract

□  Specifies dependencies between records

(a.k.a. "What must be processed together?")

□  Generalization of map/reduce

□  Logically: Abstracts a (set of) communication pattern(s)

-  For "reduce": repartition-by-key

-  For "match" : broadcast-one or repartition-by-key

■  Output Contract

□  Generic properties preserved or produced by the user code

-  key property, sort order, partitioning, etc.

□  Relevant to parallelization of succeeding functions

Input  

Contract  

First-­‐order  

func7on  

(user  code)  

Output  

Contract  

Data  

Data  

(14)

For certain PACTs, several distribution patterns exist that

fulfill the contract

□  Choice of best one is up to the system

Created properties (like a partitioning) may be reused for

later operators

□  Need a way to find out whether they still hold after the user code

□  Output contracts are a simple way to specify that

□  Example output contracts: Same-Key, Super-Key, Unique-Key

Using these properties, optimization across multiple

PACTs is possible

□  Simple System-R style optimizer approach possible

(15)

From PACT Programs to Data Flows

function match(Key k, Tuple val1, Tuple val2)

-> (Key, Tuple) {

Tuple res = val1.concat(val2); res.project(...); Key k = res.getColumn(1); Return (k, res); } invoke(): while (!input2.eof) KVPair p = input2.next(); hash-table.put(p.key, p.value); while (!input1.eof) KVPair p = input1.next(); KVPait t = hash-table.get(p.key); if (t != null) KVPair[] result =

UF.match(p.key, p.value, t.value); output.write(result); end UF1   (map)   UF2   (map)   UF3  

(match)   (reduce)  UF4  

V1   V2   V3   V4   In-­‐Memory   Channel   Network   Channel  

PACT  Program  

Nephele  DAG  

Spanned  Data  Flow  

V1   V2   V3   V4   V3   V1   V2   V3   V4   V3  

User  

Func7on  

PACT  code  

(grouping)  

Nephele  code  

(communica7on)  

compile   span  
(16)

www.stratosphere.eu

provides publications,

open-source release and examples

„Nephele: Efficient Parallel Data Processing in the Cloud“,

D. Warneke et al., MTAGS 2009

„Nephele/PACTs: a programming model and execution

framework for web-scale analytical processing“, D. Battré

et al., SoCC 2010

„Massively Parallel Data Analysis with PACTs on Nephele“,

A. Alexandrov et al. PVLDB 2010

„MapReduce and PACT - Comparing Data Parallel

Programming Models“, A. Alexandrov et al., BTW 2011

References

Related documents

(Thokar with many rotations holding the breath, whose action is directed not only to the fourth Chakra but also to the third, second and first. This procedure is commonly

Challenges of Rural Water Supply and Sanitation a Decade Later (Cusco+10) was aimed at identifying the new trends that will mark the sector in the future: i) achieving

In the classic plenter forest, the point of no return for recruitment, as far as shelterwood is concerned and therefore standing volume conservation, can be illustrated by the graph

Mobility (A = Antwerp-Central Station, B = Campus Antwerpen Sint Andries) By train → You arrive at Antwerp-Central Station (address: Koningin Astridplein 27). (option 1) By foot →

We show in this section that nurse-midwives ’ experience of control within their working environment, and the extent to which they felt able to do a good job, was linked to three

The estimated count of RGCs obtained by combining the RNFL thickness of Spectralis OCT with SAP sensitivity showed excellent diagnostic accuracy for discriminating between the

Ultra-thin high-k oxides are widely used in advanced CMOS technology in order to continue scalability and to increase performance. Nevertheless, as the devices reach the

Brazil, Russia, India, and China or ‘BRIC’, as the main growing markets, which have  seen  the  largest  growth  levels  on  a  world  scale.  It  was