The Berkeley Data Analytics Stack: Present and Future

(1)

The

Berkeley Data Analytics

Stack:

Present and Future

UC BERKELEY

Michael Franklin

27 March 2014

Technion Big Day on Big Data

(2)

BDAS in the Big Data Context

2

(3)

Sources Driving Big Data

It’s All Happening On-line

Every:

Click

Ad impression Billing event

Fast Forward, pause,…

Friend Request Transaction

Network message Fault

…

User Generated (Web &

Mobile)

…

..

Internet of Things / M2M Scientific Computing

(4)

Time

Answer

Quality

Money

Beyond Volume, Velocity and Variety

4

Step 2:

Tradeoffs

Massive

Diverse

and

Growing

Data

Massive

Diverse

and

Growing

Data

Step 1:

Run Faster

(5)

AMPLab: Integrating

Diverse Resources

Algorithms

• Machine Learning, Statistical Methods

• Prediction, Business Intelligence

Machines

• Clusters and Clouds

• Warehouse Scale Computing

People

• Crowdsourcing, Human Computation

• Data Scientists, Analysts

(6)

AMPLab Data

UC BERKELEY

Launched January 2011, 6 year duration

60+ Students, Postdocs, Faculty and Staff

In-house Applications

Cancer Genomics, Mobile Sensing, IoT (smartphones)

Industry Sponsors, Foundations + White House Big Data Program

Franklin DB

Jorda n ML

Stoic a Sys

Patterso n Sys

Shenk er Net

Rech t ML

Katz Sys

Josep h Sec

Goldber g HCI

(7)

Carat: Big Data at Work

7

715,000+

downloads

(8)

Open Source Engagement

SF Spark MeetUP: 1700+ members

Also: Boston, Hyderbad, …

Bootcamps: Berkeley, Strata, On-line

MeetUp talk on MLBase at Twitter HQ (Aug 6, 2013)

(9)

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill

Tez

Impala

S4 …

Specialized systems

(iterative, interactive and

streaming apps)

General batch

processing

(10)

BDAS Philosophy

Don’t specialize MapReduce – Generalize it!

Two additions to Hadoop MR can enable all the

models on previous slide!

1. General Task DAGs

2. Data Sharing

For Users:

Fewer Systems

Less Copying ^Spark

Stream ing Gra phX

…

Shark M Lbase

(11)

For developers: Code Size

0 20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

(12)

For Developers: Code Size

0 20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

(13)

For Developers: Code Size

0 20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

Shark*

* also calls into Hive

(14)

For Developers Code Size

0 20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

GraphX

Shark*

* also calls into Hive

(15)

Berkeley Data Analytics

Stack

AMP

Alpha or

Soon

AMP Released BSD/Apach

e 3^rd Party

Open Source

Apache Mesos YARN Resource Manager ^Resource

HDFS / Hadoop Storage

Tachyon

Storage

Apache Spark

Spark

Streaming ML-lib Processing

and Data

Management

Applications: Traffic, Carat, Genomics, 3

^rd

Party

Tools: IPython, Visualization, Data Cleaning, …

Shark

(SQL) BlinkDB GraphX MLBase ^Analytics

Frameworks

Spark-R

PySpark

(16)

Fast, MapReduce-like engine

»In-memory storage abstraction

for iterative/interactive queries

»General execution graphs

»Up to 100x faster than Hadoop MR (2-10x even for on-

disk)

Compatible with Hadoop’s storage APIs

»Can access HDFS, HBase, S3, SequenceFiles, etc

Great example of ML/Systems/DB collaboration

(17)

Example: Logistic Regression

Goal: find best line separating two sets of points

target

random initial line

(18)

ML and Queries in Hadoop

. . .

iter. 1

Input

HDFS

read

HDFS

write

iter. 2

HDFS

read

HDFS

write

query 2 result 2

query 3 result 3

. . .

Input

query 1 result 1

HDFS

read

(19)

In-Memory Data Sharing

iter. 1 iter. 2 . . .

Input

query 1

query 2

query 3

. . .

Distributed

memory

Input

one-time

processing

(20)

0

10

20

30

40

50

60 1 10 20 30

R unn in g Time (min )

Number of Iterations

Hadoop

Spark

Logistic Regression

Performance

110 s / iteration

first iteration 80 s

further iterations 1 s

29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

(21)

Challenge

A distributed memory abstraction that is

both:

efficient

and

fault-tolerant

(22)

Resilient Distributed Datasets

(RDDs)

API: coarse-grained transformations (map,

group-by, join, sort, filter, sample,…) on

immutable collections

Efficient fault recovery using lineage

» Log one operation to apply to many elements

» Recompute lost partitions of RDD on failure

» No cost if nothing fails

Rich enough to capture many models:

» Data flow models: MapReduce, Dryad, SQL, …

» Specialized models: Pregel, Hama, …

M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory

cluster computing, NSDI 2012. Best Paper Award

(23)

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

(24)

Fault Tolerance with RDDs

RDDs track the series of transformations

used to build them (their lineage)

Enables per-node recomputation of lost data

messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDD

path = hdfs://…

FilteredRDD

func = _.contains(...)

MappedRDD

func = _.split(…)

(25)

Spark Status

Current release (0.9) includes Java and Python APIs

» Apache Top Level Project (as of Feb 2014)

» Includes: streaming, MLlib, YARN integration, EC2, GraphX

» 100+ Developers, 20 Organizations contributed code

Supported by: Cloudera, Wandisco & others TBA

Spark Apps to be certified by DataBricks (AMPLab spin-

out)

Sample Use cases:

» In-memory analytics on Hive data (Conviva)

» Interactive queries on data streams (Quantifind)

» Business intelligence (Yahoo!)

» DNA sequence alignment (SNAP)

(26)

Berkeley Data Analytics

Stack

AMP

Alpha or

Soon

AMP Released BSD/Apach

e 3^rd Party

Open Source

Apache Mesos YARN Resource Manager ^Resource

HDFS / Hadoop Storage

Tachyon

Storage

Apache Spark

Spark

Streaming ML-lib Processing

and Data

Management

Applications: Traffic, Carat, Genomics, 3

^rd

Party

IPython, Visualization, Data Cleaning, …

Shark

(SQL) BlinkDB GraphX MLBase ^Analytics

Frameworks

Spark-R

PySpark

(27)

Shark

27

(28)

Shark = Spark + Hive

Uses Spark’s in-memory RDD caching and

language

»Result reuse and low latency

»Scalable, fault-tolerant, fast

Query Compatible with Hive

» Run HiveQL queries (w/ UDFs, UDAs…) without modifications

» Convert logical query plan generated from Hive into Spark execution

graph

Data Compatible with Hive

» Use existing HDFS data and Hive metadata, without

modifications

C. Engle, et al, Shark: Fast Data Analysis Using Coarse-grained Distributed Memory, SIGMOD 2012 (system demonstration). Best Demo Award

R. Xin et al., Shark: SQL and Rich Analytics at Scale, SIGMOD 2013

.

(29)

Shark Optimizations

• Fast task start-up

• Optimized column-oriented storage

• Dynamic (mid-query) join algorithm

selection based on statistical properties of

data

• Runtime selection of # of reducers

• Partition pruning using range statistics

• Controllable table partitioning across

nodes

(30)

Running Time: Join + Order

By

https://amplab.cs.berkeley.edu/benchmark/

30

SELECT srcIP, AVG(pageRank), SUM(adRevenue) as totalRev

FROM Rankings AS R, UserVisits AS UV

WHERE R.pageURL = UV.destURL

AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')

GROUP BY UV.sourceIP

ORDER BY totalRev DESC LIMIT 1

485K

UserVisits

Time (sec)

533M

UserVisits

Time (sec)

(31)

A Unified System for SQL & ML

def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p =>

val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row =>

new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}

val trainedVector = logRegress(features.cache())

Deep integration of

Shark and Spark

Both share the same

set of workers and

caches

Can move

seamlessly between

SQL and Machine

Learning worlds

(32)

GraphX – Adding Graphs to the Mix

GraphX Unified

Representation

Graph View

Table View

Tables and Graphs are composable views of the

same physical data

Each view has its own operators that exploit the

semantics of the view to achieve efficient

execution

R. Xin, J. Gonzalez, M. Franklin, I. Stoica, “GraphX: In-situ Graph Computation Made Easy

“ GRADES Workshop at SIGMOD, June 2013.

(33)

Speed/Accuracy Trade-off

Execution Time

Err or

30 mins

Time to

Execute on

Entire Dataset

Interactive

Queries

5 sec

(34)

Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times

on Very Large Data. ACM EuroSys 2013, Best Paper Award

Fast, approximate answers with error bars by executing queries

on small, pre-collected samples of data

Compatible with Apache Hive (storage, serdes, UDFs, types,

metadata) and HiveQL (with minor modifications)

(35)

Sampling Vs. No Sampling

0

100

200

300

400

500

600

700

800

900 1000

1 10 ^-1 10 ^-2 10 ^-3 10 ^-4 10 ^-5

Fraction of full data

Qu er y R es po ns e Time (Se con ds )

103 1020

18 13 10 8

10x as response time

is dominated by I/O

(36)

Sampling Vs. No Sampling

0

100

200

300

400

500

600

700

800

900 1000

1 10 ^-1 10 ^-2 10 ^-3 10 ^-4 10 ^-5

Fraction of full data

Qu er y R es po ns e Time (Se con ds )

103 1020

18 13 10 8

(0.02%)

(0.07%) (1.1%) (3.4%) (11%)

Error Bars

(37)

Execution Time

Err or

30 mins

Time to

Execute on

Entire Dataset

Interactive

Queries

5 sec

Speed/Accuracy Trade-off

Pre-Existing

Noise

(38)

People Resources

38

Hybrid Human-Machine

Computation

• Data Cleaning

• Active Learning

• Handling the last 5%

Disk 2 Disk 1

Parser

Optimizer

Statistics

CrowdSQL Results

Executor

Files Access Methods

UI Template Manager Form Editor UI

Creation

HIT Manager

MetaData

Turker Relationship Manager

Franklin, Kossmann et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011

Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012

Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award