• No results found

The Stratosphere Big Data Analytics Platform

N/A
N/A
Protected

Academic year: 2021

Share "The Stratosphere Big Data Analytics Platform"

Copied!
55
0
0

Loading.... (view fulltext now)

Full text

(1)

The Stratosphere Big Data Analytics Platform

Amir H. Payberah

Swedish Institute of Computer Science

[email protected]

June 4, 2014

Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44

(2)

Big Data

small data big data

(3)

I Big Data refers to datasets and flows large

enough that has outpaced our capability to

store, process, analyze, and understand.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 3 / 44

(4)

Where Does

Big Data Come From?

(5)

Big Data Market Driving Factors

The number of web pages indexed by Google, which were around

one million in 1998, have exceeded one trillion in 2008, and its

expansion is accelerated by appearance of the social networks.∗

∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013]

Amir H. Payberah (SICS) Stratosphere June 4, 2014 5 / 44

(6)

Big Data Market Driving Factors

The amount of mobile data traffic is expected to grow to 10.8

Exabyte per month by 2016.∗

∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013]

(7)

Big Data Market Driving Factors

More than 65 billion devices were connected to the Internet by

2010, and this number will go up to 230 billion by 2020.∗

∗“The Internet of Things Is Coming” [John Mahoney et al., 2013]

Amir H. Payberah (SICS) Stratosphere June 4, 2014 7 / 44

(8)

Big Data Market Driving Factors

Many companies are moving towards using Cloud services to

access Big Data analytical tools.

(9)

Big Data Market Driving Factors

Open source communities

Amir H. Payberah (SICS) Stratosphere June 4, 2014 9 / 44

(10)
(11)

Big Data Analytics Stack

Amir H. Payberah (SICS) Stratosphere June 4, 2014 11 / 44

(12)

Hadoop Big Data Analytics Stack

(13)

Spark Big Data Analytics Stack

Amir H. Payberah (SICS) Stratosphere June 4, 2014 13 / 44

(14)

Stratosphere Big Data Analytics Stack

* Under development

** Stratosphere community is thinking to integrate with Mahout

(15)

Amir H. Payberah (SICS) Stratosphere June 4, 2014 15 / 44

(16)

What is Stratosphere?

I An efficient distributed general-purpose data analysis platform.

I Built on top of HDFS and YARN.

I Focusing on ease of programming.

(17)

Project Status

I Research project started in 2009 by TU Berlin, HU Berlin, and HPI

I An Apache Incubator project

I 25 contributors

I v0.5 is released

Amir H. Payberah (SICS) Stratosphere June 4, 2014 17 / 44

(18)

Motivation

I MapReduce programming model has not been designed for complex

operations, e.g., data mining.

I Very expensive, i.e., always goes to disk and HDFS.

(19)

Solution

I Extends MapReduce with more operators.

I Support for advanced data flow graphs.

I In-memory and out-of-core processing.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 19 / 44

(20)

Stratosphere

Programming Model

(21)

Stratosphere Programming Model

I A program is expressed as an arbitrary data flow consisting of

transformations, sources and sinks.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 21 / 44

(22)

Transformations

I Higher-order functions that execute user-defined functions in parallel

on the input data.

(23)

Transformations

I Higher-order functions that execute user-defined functions in parallel

on the input data.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 22 / 44

(24)

Transformations: Map

I All pairs are independently processed.

val input: DataSet[(Int, String)] = ...

val mapped = input.flatMap { case (value, words) => words.split(" ") }

(25)

Transformations: Map

I All pairs are independently processed.

val input: DataSet[(Int, String)] = ...

val mapped = input.flatMap { case (value, words) => words.split(" ") }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 23 / 44

(26)

Transformations: Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val input: DataSet[(String, Int)] = ...

val reduced = input.groupBy(_._1)

.reduceGroup(words => words.minBy(_._2))

(27)

Transformations: Reduce

I Pairs with identical key are grouped.

I Groups are independently processed.

val input: DataSet[(String, Int)] = ...

val reduced = input.groupBy(_._1)

.reduceGroup(words => words.minBy(_._2))

Amir H. Payberah (SICS) Stratosphere June 4, 2014 24 / 44

(28)

Transformations: Join

I Performs an equi-join on the key.

I Join candidates are independently processed.

val counts: DataSet[(String, Int)] = ...

val names: DataSet[(Int, String)] = ...

val join = counts.join(names)

.where(_._2)

.isEqualTo(_._1)

.map { case (l, r) => l._1 + "and" + r._2 }

(29)

Transformations: Join

I Performs an equi-join on the key.

I Join candidates are independently processed.

val counts: DataSet[(String, Int)] = ...

val names: DataSet[(Int, String)] = ...

val join = counts.join(names)

.where(_._2)

.isEqualTo(_._1)

.map { case (l, r) => l._1 + "and" + r._2 }

Amir H. Payberah (SICS) Stratosphere June 4, 2014 25 / 44

(30)

Transformations: and More ...

Groups each input on key

Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

(31)

Transformations: and More ...

Groups each input on key Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

Amir H. Payberah (SICS) Stratosphere June 4, 2014 26 / 44

(32)

Transformations: and More ...

Groups each input on key Builds a Cartesian Product

Merges two or more input data sets, keeping duplicates

(33)

Example: Word Count

val input = TextFile(textInput)

val words = input.flatMap { line => line.split(" ")

map(word => (word, 1)) }

val counts = words.groupBy(_._1)

.reduce((w1, w2) => (w1._1, w1._2 + w2._2))

val output = counts.write(wordsOutput, CsvOutputFormat())

Amir H. Payberah (SICS) Stratosphere June 4, 2014 27 / 44

(34)

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(35)

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(1) Partitioned hash-join

- Hash/Sort tables into N buckets on join key

- Send buckets to parallel machines

- Join every pair of buckets

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

(36)

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(2) Broadcast hash-join

- Broadcast the smaller table

- Avoids the network overhead

(37)

Join Optimization

val large = env.readCsv(...)

val medium = env.readCsv(...)

val small = env.readCsv(...)

joined1 = large.join(medium)

.where(_._3)

.isEqualTo(_._1)

.map { (left, right) => ... }

joined2 = small.join(joined1)

.where(_._1)

.isEqualTo(_._2)

.map { (left, right) => ...}

result = joined2.groupBy(_._3)

.reduceGroup(_.maxBy(_._2))

(3) Grouping/Aggregation reuses the partitioning from step (1) - no shuffle

Amir H. Payberah (SICS) Stratosphere June 4, 2014 28 / 44

(38)

What about Iteration?

(39)

Iterative Algorithms

I Algorithms that need iterations:

Clustering, e.g., K-Means

Gradient descent, e.g., Logistic Regression

Graph algorithms, e.g., PageRank

...

Amir H. Payberah (SICS) Stratosphere June 4, 2014 30 / 44

(40)

Iteration

I Loop over the working data multiple times.

I Iterations with hadoop

Slow: using HDFS

Everything has to be read over and over again

(41)

Iteration

I Loop over the working data multiple times.

I Iterations with hadoop

Slow: using HDFS

Everything has to be read over and over again

Amir H. Payberah (SICS) Stratosphere June 4, 2014 31 / 44

(42)

Types of Iterations

I Two types of iteration at stratosphere:

Bulk iteration

Delta iteration

I Both operators repeatedly invoke the step function on the current

iteration state until a certain termination condition is reached.

(43)

Bulk Iteration

I In each iteration, the step function consumes the entire input, and

computes the next version of the partial solution.

I A new version of the entire model in each iteration.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 33 / 44

(44)

Bulk Iteration - Example

// 1st 2nd 10th

map(1) -> 2 map(2) -> 3 ... map(10) -> 11

map(2) -> 3 map(3) -> 4 ... map(11) -> 12

map(3) -> 4 map(4) -> 5 ... map(12) -> 13

map(4) -> 5 map(5) -> 6 ... map(13) -> 14

map(5) -> 6 map(6) -> 7 ... map(14) -> 15

val input: DataSet[Int] = ...

def step(partial: DataSet[Int]) = partial.map(a => a + 1)

val numIter = 10;

val iter = input.iterate(numIter, step)

(45)

Bulk Iteration - Example

// 1st 2nd 10th

map(1) -> 2 map(2) -> 3 ... map(10) -> 11

map(2) -> 3 map(3) -> 4 ... map(11) -> 12

map(3) -> 4 map(4) -> 5 ... map(12) -> 13

map(4) -> 5 map(5) -> 6 ... map(13) -> 14

map(5) -> 6 map(6) -> 7 ... map(14) -> 15

val input: DataSet[Int] = ...

def step(partial: DataSet[Int]) = partial.map(a => a + 1)

val numIter = 10;

val iter = input.iterate(numIter, step)

Amir H. Payberah (SICS) Stratosphere June 4, 2014 34 / 44

(46)

Delta Iteration

I Only parts of the model change in each iteration.

(47)

Delta Iteration - Example

Amir H. Payberah (SICS) Stratosphere June 4, 2014 36 / 44

(48)

Delta Iteration vs. Bulk Iteration

I Computations performed in each iteration for connected communi-

ties of a social graph.

(49)

Stratosphere

Execution Engine

Amir H. Payberah (SICS) Stratosphere June 4, 2014 38 / 44

(50)

Stratosphere Architecture

I Master-worker model

I Job Manager: handles job submission

and scheduling.

I Task Manager: executes tasks received

from JM.

I All operators start in-memory and

gradually go out-of-core.

(51)

Job Graph and Execution Graph

I Jobs are expressed as data flows.

I Job graphs are transformed into the execution graph.

I Execution graphs consist information to schedule and execute a job.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 40 / 44

(52)

Channels

I Channels transfer serialized records in buffers.

I Pipelined

Online transfer to receiver.

I Materialized

Sender writes result to disk, and

afterwards it transferred to receiver.

Used in recovery.

(53)

Fault Tolerance

I Task failure compensated by backup task deployment.

I Track the execution graph back to the latest available result, and

recompute the failed tasks.

Amir H. Payberah (SICS) Stratosphere June 4, 2014 42 / 44

(54)
(55)

http://stratosphere.eu

Amir H. Payberah (SICS) Stratosphere June 4, 2014 44 / 44

References

Related documents

The main wall of the living room has been designated as a "Model Wall" of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Abstract In this paper the well-known minimax theorems of Wald, Ville and Von Neumann are generalized under weaker topological conditions on the payoff function ƒ and/or extended

While fully integrating human rights considerations into operational mine water interactions is an ambitious target, tools such as the SWAP provide a vital step to establishing

Measure-valued processes, Continuous-state branching processes, Fleming-Viot pro- cesses, Immigration, Beta-Coalescent, Generators, Random time change.. Mathematics

Stratosphere: A platform for efficient data processing at scale | V.

= = 5,00 m S F 13 21 14 22 0 5 5 13-14 21-22 3 3 11 21 12 22 0 5 5 11-12 21-22 3 3 Actuating deflection/ force Contacts/ Switch travel.. Push button reset

The palladium catalysts; Pd/C- CeO 2 (1:1) and Pd/C-CeO 2 (1:0.5) showed enhanced oxidation kinetics towards ethylene glycol and better and enhanced stability in alkaline

We also find that these differential judicial opportunities cannot adequately account for the tactical choices made by activists with respect to the staging of covert or overt