• No results found

The Berkeley Data Analytics Stack: Present and Future

N/A
N/A
Protected

Academic year: 2021

Share "The Berkeley Data Analytics Stack: Present and Future"

Copied!
43
0
0

Loading.... (view fulltext now)

Full text

(1)

The

Berkeley Data Analytics

Stack:

Present and Future

UC BERKELEY

Michael Franklin

27 March 2014

Technion Big Day on Big Data

(2)

BDAS in the Big Data Context

2

(3)

Sources Driving Big Data

It’s All Happening On-line

Every:

Click

Ad impression Billing event

Fast Forward, pause,…

Friend Request Transaction

Network message Fault

User Generated (Web &

Mobile)

..

Internet of Things / M2M Scientific Computing

(4)

Time

Answer

Quality

Money

Beyond Volume, Velocity and Variety

4

Step 2:

Tradeoffs

Massive

Diverse

and

Growing

Data

Massive

Diverse

and

Growing

Data

Step 1:

Run Faster

(5)

AMPLab: Integrating

Diverse Resources

Algorithms

• Machine Learning, Statistical Methods

• Prediction, Business Intelligence

Machines

• Clusters and Clouds

• Warehouse Scale Computing

People

• Crowdsourcing, Human Computation

• Data Scientists, Analysts

(6)

AMPLab Data

UC BERKELEY

Launched January 2011, 6 year duration

60+ Students, Postdocs, Faculty and Staff

In-house Applications

Cancer Genomics, Mobile Sensing, IoT (smartphones)

Industry Sponsors, Foundations + White House Big Data Program

Franklin DB

Jorda n ML

Stoic a Sys

Patterso n Sys

Shenk er Net

Rech t ML

Katz Sys

Josep h Sec

Goldber g HCI

(7)

Carat: Big Data at Work

7

715,000+

downloads

(8)

Open Source Engagement

SF Spark MeetUP: 1700+ members

Also: Boston, Hyderbad, …

Bootcamps: Berkeley, Strata, On-line

MeetUp talk on MLBase at Twitter HQ (Aug 6, 2013)

(9)

Big Data Systems Today

MapReduce

Pregel

Dremel

GraphLab

Storm

Giraph

Drill

Tez

Impala

S4 …

Specialized systems

(iterative, interactive and

streaming apps)

General batch

processing

(10)

BDAS Philosophy

Don’t specialize MapReduce – Generalize it!

Two additions to Hadoop MR can enable all the

models on previous slide!

1. General Task DAGs

2. Data Sharing

For Users:

Fewer Systems

Less Copying Spark

Stream ing Gra phX

Shark M Lbase

(11)

For developers: Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

(12)

For Developers: Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

(13)

For Developers: Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

Shark*

* also calls into Hive

(14)

For Developers Code Size

0

20000

40000

60000

80000

100000

120000

140000

Hadoop

MapReduce

Storm

(Streaming)

Impala

(SQL)

Giraph

(Graph)

Spark

non-test, non-example source lines

Streaming

GraphX

Shark*

* also calls into Hive

(15)

Berkeley Data Analytics

Stack

AMP

Alpha or

Soon

AMP Released BSD/Apach

e 3rd Party

Open Source

Apache Mesos YARN Resource Manager Resource

HDFS / Hadoop Storage

Tachyon

Storage

Apache Spark

Spark

Streaming ML-lib Processing

and Data

Management

Applications: Traffic, Carat, Genomics, 3

rd

Party

Tools: IPython, Visualization, Data Cleaning, …

Shark

(SQL) BlinkDB GraphX MLBase Analytics

Frameworks

Spark-R

PySpark

(16)

Fast, MapReduce-like engine

»In-memory storage abstraction

for iterative/interactive queries

»General execution graphs

»Up to 100x faster than Hadoop MR (2-10x even for on-

disk)

Compatible with Hadoop’s storage APIs

»Can access HDFS, HBase, S3, SequenceFiles, etc

Great example of ML/Systems/DB collaboration

(17)

Example: Logistic Regression

Goal: find best line separating two sets of points

target

random initial line

(18)

ML and Queries in Hadoop

. . .

iter. 1

Input

HDFS

read

HDFS

write

iter. 2

HDFS

read

HDFS

write

query 2 result 2

query 3 result 3

. . .

Input

query 1 result 1

HDFS

read

(19)

In-Memory Data Sharing

iter. 1 iter. 2 . . .

Input

query 1

query 2

query 3

. . .

Distributed

memory

Input

one-time

processing

(20)

0

10

20

30

40

50

60

1 10 20 30

R unn in g Time (min )

Number of Iterations

Hadoop

Spark

Logistic Regression

Performance

110 s / iteration

first iteration 80 s

further iterations 1 s

29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

(21)

Challenge

A distributed memory abstraction that is

both:

efficient

and

fault-tolerant

(22)

Resilient Distributed Datasets

(RDDs)

API: coarse-grained transformations (map,

group-by, join, sort, filter, sample,…) on

immutable collections

Efficient fault recovery using lineage

» Log one operation to apply to many elements

» Recompute lost partitions of RDD on failure

» No cost if nothing fails

Rich enough to capture many models:

» Data flow models: MapReduce, Dryad, SQL, …

» Specialized models: Pregel, Hama, …

M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory

cluster computing, NSDI 2012. Best Paper Award

(23)

Example: Log Mining

Load error messages from a log into memory, then

interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

tasks

results

Cache 1

Cache 2

Cache 3

Base RDD

Transformed RDD

Action

(24)

Fault Tolerance with RDDs

RDDs track the series of transformations

used to build them (their lineage)

Enables per-node recomputation of lost data

messages = textFile(...).filter(_.contains(“error”))

.map(_.split(‘\t’)(2))

HadoopRDD

path = hdfs://…

FilteredRDD

func = _.contains(...)

MappedRDD

func = _.split(…)

(25)

Spark Status

Current release (0.9) includes Java and Python APIs

» Apache Top Level Project (as of Feb 2014)

» Includes: streaming, MLlib, YARN integration, EC2, GraphX

» 100+ Developers, 20 Organizations contributed code

Supported by: Cloudera, Wandisco & others TBA

Spark Apps to be certified by DataBricks (AMPLab spin-

out)

Sample Use cases:

» In-memory analytics on Hive data (Conviva)

» Interactive queries on data streams (Quantifind)

» Business intelligence (Yahoo!)

» DNA sequence alignment (SNAP)

(26)

Berkeley Data Analytics

Stack

AMP

Alpha or

Soon

AMP Released BSD/Apach

e 3rd Party

Open Source

Apache Mesos YARN Resource Manager Resource

HDFS / Hadoop Storage

Tachyon

Storage

Apache Spark

Spark

Streaming ML-lib Processing

and Data

Management

Applications: Traffic, Carat, Genomics, 3

rd

Party

IPython, Visualization, Data Cleaning, …

Shark

(SQL) BlinkDB GraphX MLBase Analytics

Frameworks

Spark-R

PySpark

(27)

Shark

27

(28)

Shark = Spark + Hive

Uses Spark’s in-memory RDD caching and

language

»Result reuse and low latency

»Scalable, fault-tolerant, fast

Query Compatible with Hive

» Run HiveQL queries (w/ UDFs, UDAs…) without modifications

» Convert logical query plan generated from Hive into Spark execution

graph

Data Compatible with Hive

» Use existing HDFS data and Hive metadata, without

modifications

C. Engle, et al, Shark: Fast Data Analysis Using Coarse-grained Distributed Memory, SIGMOD 2012 (system demonstration). Best Demo Award

R. Xin et al., Shark: SQL and Rich Analytics at Scale, SIGMOD 2013

.

(29)

Shark Optimizations

• Fast task start-up

• Optimized column-oriented storage

• Dynamic (mid-query) join algorithm

selection based on statistical properties of

data

• Runtime selection of # of reducers

• Partition pruning using range statistics

• Controllable table partitioning across

nodes

(30)

Running Time: Join + Order

By

https://amplab.cs.berkeley.edu/benchmark/

30

SELECT srcIP, AVG(pageRank), SUM(adRevenue) as totalRev

FROM Rankings AS R, UserVisits AS UV

WHERE R.pageURL = UV.destURL

AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')

GROUP BY UV.sourceIP

ORDER BY totalRev DESC LIMIT 1

485K

UserVisits

Time (sec)

533M

UserVisits

Time (sec)

(31)

A Unified System for SQL & ML

def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p =>

val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row =>

new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}

val trainedVector = logRegress(features.cache())

Deep integration of

Shark and Spark

Both share the same

set of workers and

caches

Can move

seamlessly between

SQL and Machine

Learning worlds

(32)

GraphX – Adding Graphs to the Mix

GraphX Unified

Representation

Graph View

Table View

Tables and Graphs are composable views of the

same physical data

Each view has its own operators that exploit the

semantics of the view to achieve efficient

execution

R. Xin, J. Gonzalez, M. Franklin, I. Stoica, “GraphX: In-situ Graph Computation Made Easy

“ GRADES Workshop at SIGMOD, June 2013.

(33)

Speed/Accuracy Trade-off

Execution Time

Err or

30 mins

Time to

Execute on

Entire Dataset

Interactive

Queries

5 sec

(34)

Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times

on Very Large Data. ACM EuroSys 2013, Best Paper Award

Fast, approximate answers with error bars by executing queries

on small, pre-collected samples of data

Compatible with Apache Hive (storage, serdes, UDFs, types,

metadata) and HiveQL (with minor modifications)

(35)

Sampling Vs. No Sampling

0

100

200

300

400

500

600

700

800

900

1000

1 10 -1 10 -2 10 -3 10 -4 10 -5

Fraction of full data

Qu er y R es po ns e Time (Se con ds )

103

1020

18 13 10 8

10x as response time

is dominated by I/O

(36)

Sampling Vs. No Sampling

0

100

200

300

400

500

600

700

800

900

1000

1 10 -1 10 -2 10 -3 10 -4 10 -5

Fraction of full data

Qu er y R es po ns e Time (Se con ds )

103

1020

18 13 10 8

(0.02%)

(0.07%) (1.1%) (3.4%) (11%)

Error Bars

(37)

Execution Time

Err or

30 mins

Time to

Execute on

Entire Dataset

Interactive

Queries

5 sec

Speed/Accuracy Trade-off

Pre-Existing

Noise

(38)

People Resources

38

Hybrid Human-Machine

Computation

• Data Cleaning

• Active Learning

• Handling the last 5%

Disk 2 Disk 1

Parser

Optimizer

Statistics

CrowdSQL Results

Executor

Files Access Methods

UI Template Manager Form Editor UI

Creation

HIT Manager

MetaData

Turker Relationship Manager

Franklin, Kossmann et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011

Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012

Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award

Supporting Data Scientists

• Interactive Analytics

• Visual Analytics

• Collaboration

(39)

Working with the Crowd

39

Incentives

Fatigue, Fraud, & other Failure

Modes

Latency & Prediction

Work Conditions

Interface Impacts Answer Quality

Task Structuring

Task Routing

(40)

Less is More?

Data Cleaning + Sampling

J. Wang et al., Work in Progress

(41)

Other Things We’re Working

• MLBase: Declarative Scalable Machine On

Learning

• OLTP and Serving Workloads

• MDCC: Mutli Data Center Consistency

• HAT: Highly-Available Transactions

• PBS: Probabilistically Bounded Staleness

• PLANET: Predictive Latency-Aware Networked Transactions

• Fast Matrix Manipulation Libraries

• Cold Storage, Partitioning, Distributed Caching

• Machine Learning Pipelines, GPUs,

• …

(42)

Summary

• The Berkeley AMPLab is integrating Algorithms,

Machines and People to make sense of data at

scale.

• BDAS is our main delivery vehicle.

• Open Source has been a great way to have real

industry impact with academic research.

• Our direction: advanced analytics, extreme elasticity,

and support for people in all phases of Big Data

analytics.

(43)

For More Information

amplab.cs.berkeley.

edu

[email protected]

du

UC BERKELEY

Thanks to NSF CISE Expeditions in Computing, DARPA XData,

Founding Sponsors: Amazon Web Services, Google, and SAP,

the Thomas and Stacy Siebel Foundation,

and all our industrial sponsors and partners.

References

Related documents

All services provided under this rate are subject to ENMAX Energy Corporation Terms and Conditions for the Regulate Rate Option Tariff.. 2015 INTERIM REGULATED RATE

In Central America, two conflicting narratives are used to describe the relationship between forest cover and water availability, with implications for management of water

Once you run the above script hive read data from the Cassandra storage and summarize it, then the summarized data will persist into RDBMS storage to visualize via

4,000 nodes.. All Rights Reserved. Looking Ahead HDFS MRV1 1) Processing 2) Resource management HDFS YARN (resource management) mapreduce other Hadoop v1 Hadoop v2..

Among many TCM medical and philosophical concepts, I specifically focus on the healing, the silence and the miracle cure and how they are embodied and co-constructed by

Cash flow for the full-year period from operating activities was strengthened and totaled SEK 21.9 (- 0.5) million, SEK 15.8 (-12.0) million of which is attributable to reduced

The model, while extremely simple, provides an explicit statement of the assumptions underlying our methodology: the recorded voltage traces arise as a linear superposition of

(2002) entitled A New Wave of Evidence, The Impact of School, Family, and Community Connections on Student Achievement, the authors state that “most students at