The
Berkeley Data Analytics
Stack:
Present and Future
UC BERKELEY
Michael Franklin
27 March 2014
Technion Big Day on Big Data
BDAS in the Big Data Context
2
Sources Driving Big Data
It’s All Happening On-line
Every:
Click
Ad impression Billing event
Fast Forward, pause,…
Friend Request Transaction
Network message Fault
…
User Generated (Web &
Mobile)
…
..
Internet of Things / M2M Scientific Computing
Time
Answer
Quality
Money
Beyond Volume, Velocity and Variety
4
Step 2:
Tradeoffs
Massive
Diverse
and
Growing
Data
Massive
Diverse
and
Growing
Data
Step 1:
Run Faster
AMPLab: Integrating
Diverse Resources
Algorithms
• Machine Learning, Statistical Methods
• Prediction, Business Intelligence
Machines
• Clusters and Clouds
• Warehouse Scale Computing
People
• Crowdsourcing, Human Computation
• Data Scientists, Analysts
AMPLab Data
UC BERKELEY
Launched January 2011, 6 year duration
60+ Students, Postdocs, Faculty and Staff
In-house Applications
Cancer Genomics, Mobile Sensing, IoT (smartphones)
Industry Sponsors, Foundations + White House Big Data Program
Franklin DB
Jorda n ML
Stoic a Sys
Patterso n Sys
Shenk er Net
Rech t ML
Katz Sys
Josep h Sec
Goldber g HCI
Carat: Big Data at Work
7
715,000+
downloads
Open Source Engagement
SF Spark MeetUP: 1700+ members
Also: Boston, Hyderbad, …
Bootcamps: Berkeley, Strata, On-line
MeetUp talk on MLBase at Twitter HQ (Aug 6, 2013)
Big Data Systems Today
MapReduce
Pregel
Dremel
GraphLab
Storm
Giraph
Drill
Tez
Impala
S4 …
Specialized systems
(iterative, interactive and
streaming apps)
General batch
processing
BDAS Philosophy
Don’t specialize MapReduce – Generalize it!
Two additions to Hadoop MR can enable all the
models on previous slide!
1. General Task DAGs
2. Data Sharing
For Users:
Fewer Systems
Less Copying Spark
Stream ing Gra phX
…
Shark M Lbase
For developers: Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala
(SQL)
Giraph
(Graph)
Spark
non-test, non-example source lines
For Developers: Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala
(SQL)
Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
For Developers: Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala
(SQL)
Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
Shark*
* also calls into Hive
For Developers Code Size
0
20000
40000
60000
80000
100000
120000
140000
Hadoop
MapReduce
Storm
(Streaming)
Impala
(SQL)
Giraph
(Graph)
Spark
non-test, non-example source lines
Streaming
GraphX
Shark*
* also calls into Hive
Berkeley Data Analytics
Stack
AMP
Alpha or
Soon
AMP Released BSD/Apach
e 3rd Party
Open Source
Apache Mesos YARN Resource Manager Resource
HDFS / Hadoop Storage
Tachyon
Storage
Apache Spark
Spark
Streaming ML-lib Processing
and Data
Management
Applications: Traffic, Carat, Genomics, 3
rdParty
Tools: IPython, Visualization, Data Cleaning, …
Shark
(SQL) BlinkDB GraphX MLBase Analytics
Frameworks
Spark-R
PySpark
Fast, MapReduce-like engine
»In-memory storage abstraction
for iterative/interactive queries
»General execution graphs
»Up to 100x faster than Hadoop MR (2-10x even for on-
disk)
Compatible with Hadoop’s storage APIs
»Can access HDFS, HBase, S3, SequenceFiles, etc
Great example of ML/Systems/DB collaboration
Example: Logistic Regression
Goal: find best line separating two sets of points
target
random initial line
ML and Queries in Hadoop
. . .
iter. 1
Input
HDFS
read
HDFS
write
iter. 2
HDFS
read
HDFS
write
query 2 result 2
query 3 result 3
. . .
Input
query 1 result 1
HDFS
read
In-Memory Data Sharing
iter. 1 iter. 2 . . .
Input
query 1
query 2
query 3
. . .
Distributed
memory
Input
one-time
processing
0
10
20
30
40
50
60
1 10 20 30
R unn in g Time (min )
Number of Iterations
Hadoop
Spark
Logistic Regression
Performance
110 s / iteration
first iteration 80 s
further iterations 1 s
29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)
Challenge
A distributed memory abstraction that is
both:
efficient
and
fault-tolerant
Resilient Distributed Datasets
(RDDs)
API: coarse-grained transformations (map,
group-by, join, sort, filter, sample,…) on
immutable collections
Efficient fault recovery using lineage
» Log one operation to apply to many elements
» Recompute lost partitions of RDD on failure
» No cost if nothing fails
Rich enough to capture many models:
» Data flow models: MapReduce, Dryad, SQL, …
» Specialized models: Pregel, Hama, …
M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory
cluster computing, NSDI 2012. Best Paper Award
Example: Log Mining
Load error messages from a log into memory, then
interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD
Transformed RDD
Action
Fault Tolerance with RDDs
RDDs track the series of transformations
used to build them (their lineage)
Enables per-node recomputation of lost data
messages = textFile(...).filter(_.contains(“error”))
.map(_.split(‘\t’)(2))
HadoopRDD
path = hdfs://…
FilteredRDD
func = _.contains(...)
MappedRDD
func = _.split(…)
Spark Status
Current release (0.9) includes Java and Python APIs
» Apache Top Level Project (as of Feb 2014)
» Includes: streaming, MLlib, YARN integration, EC2, GraphX
» 100+ Developers, 20 Organizations contributed code
Supported by: Cloudera, Wandisco & others TBA
Spark Apps to be certified by DataBricks (AMPLab spin-
out)
Sample Use cases:
» In-memory analytics on Hive data (Conviva)
» Interactive queries on data streams (Quantifind)
» Business intelligence (Yahoo!)
» DNA sequence alignment (SNAP)
Berkeley Data Analytics
Stack
AMP
Alpha or
Soon
AMP Released BSD/Apach
e 3rd Party
Open Source
Apache Mesos YARN Resource Manager Resource
HDFS / Hadoop Storage
Tachyon
Storage
Apache Spark
Spark
Streaming ML-lib Processing
and Data
Management
Applications: Traffic, Carat, Genomics, 3
rdParty
IPython, Visualization, Data Cleaning, …
Shark
(SQL) BlinkDB GraphX MLBase Analytics
Frameworks
Spark-R
PySpark
Shark
27
Shark = Spark + Hive
Uses Spark’s in-memory RDD caching and
language
»Result reuse and low latency
»Scalable, fault-tolerant, fast
Query Compatible with Hive
» Run HiveQL queries (w/ UDFs, UDAs…) without modifications
» Convert logical query plan generated from Hive into Spark execution
graph
Data Compatible with Hive
» Use existing HDFS data and Hive metadata, without
modifications
C. Engle, et al, Shark: Fast Data Analysis Using Coarse-grained Distributed Memory, SIGMOD 2012 (system demonstration). Best Demo Award
R. Xin et al., Shark: SQL and Rich Analytics at Scale, SIGMOD 2013
.
Shark Optimizations
• Fast task start-up
• Optimized column-oriented storage
• Dynamic (mid-query) join algorithm
selection based on statistical properties of
data
• Runtime selection of # of reducers
• Partition pruning using range statistics
• Controllable table partitioning across
nodes
Running Time: Join + Order
By
https://amplab.cs.berkeley.edu/benchmark/
30SELECT srcIP, AVG(pageRank), SUM(adRevenue) as totalRev
FROM Rankings AS R, UserVisits AS UV
WHERE R.pageURL = UV.destURL
AND UV.visitDate BETWEEN Date(`1980-01-01') AND Date(`X')
GROUP BY UV.sourceIP
ORDER BY totalRev DESC LIMIT 1
485K
UserVisits
Time (sec)
533M
UserVisits
Time (sec)
A Unified System for SQL & ML
def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p =>
val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid") val features = users.mapRows { row =>
new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}
val trainedVector = logRegress(features.cache())
Deep integration of
Shark and Spark
Both share the same
set of workers and
caches
Can move
seamlessly between
SQL and Machine
Learning worlds
GraphX – Adding Graphs to the Mix
GraphX Unified
Representation
Graph View
Table View
Tables and Graphs are composable views of the
same physical data
Each view has its own operators that exploit the
semantics of the view to achieve efficient
execution
R. Xin, J. Gonzalez, M. Franklin, I. Stoica, “GraphX: In-situ Graph Computation Made Easy
“ GRADES Workshop at SIGMOD, June 2013.
Speed/Accuracy Trade-off
Execution Time
Err or
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
5 sec
Agarwal et al., BlinkDB: Queries with Bounded Errors and Bounded Response Times
on Very Large Data. ACM EuroSys 2013, Best Paper Award
Fast, approximate answers with error bars by executing queries
on small, pre-collected samples of data
Compatible with Apache Hive (storage, serdes, UDFs, types,
metadata) and HiveQL (with minor modifications)
Sampling Vs. No Sampling
0
100
200
300
400
500
600
700
800
900
1000
1 10 -1 10 -2 10 -3 10 -4 10 -5
Fraction of full data
Qu er y R es po ns e Time (Se con ds )
103
1020
18 13 10 8
10x as response time
is dominated by I/O
Sampling Vs. No Sampling
0
100
200
300
400
500
600
700
800
900
1000
1 10 -1 10 -2 10 -3 10 -4 10 -5
Fraction of full data
Qu er y R es po ns e Time (Se con ds )
103
1020
18 13 10 8
(0.02%)
(0.07%) (1.1%) (3.4%) (11%)
Error Bars
Execution Time
Err or
30 mins
Time to
Execute on
Entire Dataset
Interactive
Queries
5 sec
Speed/Accuracy Trade-off
Pre-Existing
Noise
People Resources
38
Hybrid Human-Machine
Computation
• Data Cleaning
• Active Learning
• Handling the last 5%
Disk 2 Disk 1
Parser
Optimizer
Statistics
CrowdSQL Results
Executor
Files Access Methods
UI Template Manager Form Editor UI
Creation
HIT Manager
MetaData
Turker Relationship Manager
Franklin, Kossmann et al., CrowdDB: Answering Queries with Crowdsourcing, SIGMOD 2011
Wang et al., CrowdER: Crowdsourcing Entity Resolution, VLDB 2012
Trushkowsky et al., Crowdsourcing Enumeration Queries, ICDE 2013 Best Paper Award
Supporting Data Scientists
• Interactive Analytics
• Visual Analytics
• Collaboration
Working with the Crowd
39
Incentives
Fatigue, Fraud, & other Failure
Modes
Latency & Prediction
Work Conditions
Interface Impacts Answer Quality
Task Structuring
Task Routing
Less is More?
Data Cleaning + Sampling
J. Wang et al., Work in Progress
Other Things We’re Working
• MLBase: Declarative Scalable Machine On
Learning
• OLTP and Serving Workloads
• MDCC: Mutli Data Center Consistency
• HAT: Highly-Available Transactions
• PBS: Probabilistically Bounded Staleness
• PLANET: Predictive Latency-Aware Networked Transactions
• Fast Matrix Manipulation Libraries
• Cold Storage, Partitioning, Distributed Caching
• Machine Learning Pipelines, GPUs,
• …
Summary
• The Berkeley AMPLab is integrating Algorithms,
Machines and People to make sense of data at
scale.
• BDAS is our main delivery vehicle.
• Open Source has been a great way to have real
industry impact with academic research.
• Our direction: advanced analytics, extreme elasticity,
and support for people in all phases of Big Data
analytics.
For More Information
amplab.cs.berkeley.
edu
[email protected]
du
UC BERKELEY