• No results found

Making Sense at Scale with the Berkeley Data Analytics Stack

N/A
N/A
Protected

Academic year: 2021

Share "Making Sense at Scale with the Berkeley Data Analytics Stack"

Copied!
58
0
0

Loading.... (view fulltext now)

Full text

(1)

Making Sense at Scale with the

Berkeley Data Analytics Stack

UC  BERKELEY  

Michael Franklin February 3, 2015 WSDM 2015 Shanghai

(2)

Agenda

•  A Little Bit on Big Data

•  AMPLab Overview

•  BDAS Philosophy: Unification not Specialization

•  Spark, GraphX, and other BDAS components

(3)

Sources Driving Big Data

It’s  All  Happening  On-­‐line   Every:

Click

Ad impression Billing event

Fast Forward, pause,… Friend Request

Transaction

Network message Fault

User  Generated  (Web  &   Mobile)  

… ..

(4)

Big Data – A Bad Definition

Data sets, typically consisting of billions or trillions

of records, that are so vast and complex that they require new and powerful computational

resources to process.

- Dictionary.com

(5)

Big Data as a Resource

“For a long time, we thought that Tamoxifen was roughly 80% effective for breast cancer patients. But now we know much more:

we know that it’s 100% effective in 70% to 80% of the patients, and ineffective in the rest.”

- Tim O’Reilly et al. “How Data Science is Transforming Health Care”

With enough of the right data you can determine precisely who the treatment will work for.

(6)

A Big Data Pattern

6

750,000+ downloads

(7)

AMPLab: Integrating 3 Resources

Algorithms  

•  Machine  Learning,  Statistical  Methods  

•  Prediction,  Business  Intelligence  

Machines  

•  Clusters  and  Clouds  

•  Warehouse  Scale  Computing  

People  

•  Crowdsourcing,  Human  Computation  

(8)

AMPLab Overview

UC  BERKELEY  

•  70+ Students, Postdocs, Faculty and Staff from:

Databases, Machine Learning, Systems, Security, and Networking

•  50/50 Split: Industry Sponsors and Govt

White House Big Data Program:

NSF CISE Expeditions in Computing and Darpa XData •  Fixed Timeline (ends Dec 2016); Collaborative Working Space

See Dave Patterson “How to Build a Bad Research Center”, CACM March, 2014 “… Berkeley’s AMPLab has already left an indelible mark on world of

information technology, and even the web. But we haven’t yet experienced the full impact of the group … Not even close.”

– Derrick Harris, GigaOM, Aug 2, 2014

(9)

A Nexus of Industrial Engagement

•  Open Source Software – Industrial-Strength

(10)

Open Source Community Building

MeetUp on MLBase @Twitter (Aug 6, 2013)

(11)

Big Data Ecosystem Evolution

MapReduce Pregel Dremel GraphLab Storm Giraph Drill Tez Impala S4 … Specialized systems

(iterative, interactive and"

streaming apps)

General batch"

(12)

AMPLab Unification Philosophy

Don’t specialize MapReduce – Generalize it!

Two additions to Hadoop MR can enable all the models shown earlier!

1. General Task DAGs 2. Data Sharing

For Users:

Fewer Systems to Use

Less Data Movement Spark

Streaming GraphX … SparkSQL MLbase

(13)

B

erkeley

D

ata

A

nalytics

S

tack

(Apache and BSD open source)

Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps

(14)

In-Memory

Dataflow

System

M. Zaharia, M. Choudhury, M. Franklin, I. Stoica, S. Shenker, “Spark: Cluster Computing with Working Sets, USENIX HotCloud, 2010.

“It’s only September but it’s already clear that 2014 will be the year of Apache Spark”

-- Datanami, 9/15/14

•  Developed in AMPLab and its predecessor the RADLab

•  Alternative to Hadoop MapReduce

•  10-100x speedup for ML and interactive queries

•  Central component of the BDAS Stack

(15)

Apache Spark Contributors:

0 25 50 75 100 2011 2012 2013 2014

(16)

Apache Spark:

Compared to Other Projects

M ap R ed uce YAR N HDFS St or m Spar k 0 200 400 600 800 1000 1200 1400 1600 1800 2000 M ap R ed uce YAR N HD FS St or m Spar k 0 50000 100000 150000 200000 250000 300000 350000

Commits Lines of Code Changed

Activity in past 6 months

2-3x more activity than: Hadoop, Storm, MongoDB, NumPy, D3, Julia, …

(17)

Iteration in Map-Reduce

Training

Data

Map Reduce Learned Model w(1) w(2) w(3) w(0) Initial Model

(18)

Cost of Iteration in Map-Reduce

Map Reduce Learned

Model w(1) w(2) w(3) w(0) Initial Model Training Data Read 2

Repeatedly

(19)

Cost of Iteration in Map-Reduce

Map Reduce Learned

Model w(1) w(2) w(3) w(0) Initial Model Training

Data

Redundantly

save

output between

(20)

Dataflow View

Training Data (HDFS) Map R ed uce Map R ed uce Map R ed uce

(21)

Memory Opt. Dataflow

Training Data (HDFS) Map R ed uce Map R ed uce Map R ed uce Cached Load

(22)

Memory Opt. Dataflow View

Training Data (HDFS) Map R ed uce Map R ed uce Map R ed uce Efficiently move data between stages

(23)

Resilient Distributed Datasets (RDDs)

API: coarse-grained transformations (map, group-by, join, sort, filter, sample,…) on immutable collections Resilient Distributed Datasets (RDDs)

» Collections of objects that can be stored in memory or

disk across a cluster

» Built via parallel transformations (map, filter, …)

» Automatically rebuilt on failure

Rich enough to capture many models:

» Data flow models: MapReduce, Dryad, SQL, …

» Specialized models: Pregel, Hama, …

M. Zaharia, et al, Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing, NSDI 2012.

(24)

Abstraction:

Dataflow Operators

map filter groupBy sort union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey groupByKey cogroup cross zip sample take first partitionBy mapWith pipe save ...

(25)

Fault Tolerance with RDDs

RDDs track the series of transformations used to build them (their lineage)

» Log one operation to apply to many elements

» No cost if nothing fails

Enables per-node recomputation of lost data

messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDD  

(26)

A Unified System for SQL & ML

def logRegress(points: RDD[Point]): Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS) { val gradient = points.map { p => val denom = 1 + exp(-p.y * (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ + _) w -= gradient } w }

val users = sql2rdd("SELECT * FROM user u JOIN comment c ON c.uid=u.uid")

val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)}

val trainedVector = logRegress(features.cache())

Deep integration of SQL and Spark

Both share the same set of workers and caches

Can move seamlessly between SQL and Machine Learning worlds

R. Xin, J. Rosen, M. Zaharia, M. Franklin,S. Shenker, I. Stoica, “Shark: SQL and Rich Analytics at Scale, SIGMOD 2013.

(27)

Spark Streaming

Microbatch approach provides lower latency

Additional operators provide windowed operations

M. Zaharia, et al, Discretized Streams: Fault-Tollerant Streaming Computation at Scale, SOSP 2013.

(28)

Batch/Streaming Unification

Batch and streaming codes virtually the same

» Easy to develop and maintain consistency

// count words from a file (batch)

val file = sc.textFile("hdfs://.../pagecounts-*.gz") val words = file.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print()

// count words from a network stream, every 10s (streaming)

val ssc = new StreamingContext(args(0), "NetCount", Seconds(10), ..) val lines = ssc.socketTextStream("localhost”, 3456)

val words = lines.flatMap(_.split(" "))

val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print()

(29)

Apache Spark v1.2

(12/14)

Includes

» Spark (core)

» Spark Streaming

» GraphX (alpha release)

» MLlib

» SparkSQL – Query

Processing

Wide range of interfaces:

» Python / interactive ipython

» Scala / interactive scala shell

» R / interactive R-shell

» Java

(30)

Graph Analytics

Raw Wikipedia < / >< / > < / > XML

Hyperlinks PageRank Top 20 Pages

Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Discussion Table User Disc. Top Communities Com. PR..

(31)

Separate

Systems

(32)

Separate

Systems

Graphs

Dataflow Systems

Table Result Row Row Row Row

(33)

Separate

Systems

Dataflow Systems

Graph Systems

Dependency Graph 6. Before 8. After 7. After Table Result Row Row Row Row

(34)

Difficult to Use

Users must

Learn

,

Deploy

, and

Manage

multiple systems

Leads to brittle and often

complex interfaces

(35)

22 354 1340 0 200 400 600 800 1000 1200 1400 1600 GraphLab Spark Hadoop

Runtime (in seconds, PageRank for 10 iterations) Live-Journal Graph

Efficiencies are Possible

Specialized Graph System can be faster than

general MR computation

(36)

But

36  

Extensive data movement and duplication across

the network and file system

< / >< / > < / > XML

HDFS   HDFS   HDFS   HDFS  

Limited reuse internal data-structures

(37)

GraphX – Adding Graphs to the Mix

GraphX Unified

Representation Graph View

Table View

Tables and Graphs are composable views of the same

physical data

Each view has its own operators that exploit the

semantics of the view to achieve efficient execution

J. Gonzalez, R. Xin, A. Dave, D. Crankshaw, M. Franklin, I. Stoica, “GraphX: Graph Processing in a Distributed Dataflow Framework“ OSDI Conf., Oct 2014

(38)

Representation

Optimizations

Distributed Graphs Horizontally Partitioned Tables Join Vertex Programs Dataflow Operators

Advances in Graph Processing Systems

Distributed Join Optimization Materialized View

(39)

Property Graph Data Model

B C A D F E A DD Property Graph B C D E A A F

Vertex Property:

•  User Profile

•  Current PageRank Value

Edge Property:

•  Weights

(40)

Part. 2 Part. 1 Vertex Table (RDD) B C A D F E A D

Encoding Property Graphs as Tables

D Property Graph B C D E A A F M ach in e 1 M ach in e 2 Edge Table (RDD) A B A C C D B C A E A F E F E D B C D E A F Routing Table (RDD) B C D E A F 1   2   1   2   1   2   1   2   Vertex Cut

(41)

class Graph [ V, E ] {

def Graph(vertices: Table[ (Id, V) ],

edges: Table[ (Id, Id, E) ])

// Table Views

def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ]

def triplets: Table [ ((Id, V), (Id, V), E) ]

// Transformations

def reverse: Graph[V, E]

def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]

def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]

// Joins

def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]

// Computation ---

def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E] }

Graph Operators (Scala)

(42)

class Graph [ V, E ] {

def Graph(vertices: Table[ (Id, V) ],

edges: Table[ (Id, Id, E) ])

// Table Views

def vertices: Table[ (Id, V) ] def edges: Table[ (Id, Id, E) ]

def triplets: Table [ ((Id, V), (Id, V), E) ]

// Transformations

def reverse: Graph[V, E]

def subgraph(pV: (Id, V) => Boolean,

pE: Edge[V,E] => Boolean): Graph[V,E]

def mapV(m: (Id, V) => T ): Graph[T,E] def mapE(m: Edge[V,E] => T ): Graph[V,T]

// Joins

def joinV(tbl: Table [(Id, T)]): Graph[(V, T), E ] def joinE(tbl: Table [(Id, Id, T)]): Graph[V, (E, T)]

// Computation ---

def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E] }

Graph Operators (Scala)

42

def mrTriplets(mapF: (Edge[V,E]) => List[(Id, T)],

reduceF: (T, T) => T): Graph[T, E]

capture the

Gather-Scatter pattern

from

GraphLab

(43)

Graph System Optimizations

43   Specialized Data-Structures Vertex-Cuts Partitioning Remote Caching / Mirroring

(44)

0 500 1000 1500 2000 2500 3000 3500 0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges)

PageRank Benchmark

GraphX performs comparably to state-of-the-art graph processing systems.

R un time (S econ ds)

EC2 Cluster of 16 x m2.4xLarge Nodes + 1GigE

(45)

The GraphX Stack

(Lines of Code)

GraphX (2,500) Spark (30,000) GraphLab/Pregel API (34) PageRank (20) Connected Comp. (20) K-core (60) TriangleCount (50) LDA (220) SVD++ (110)

Some algorithms are more naturally expressed using the GraphX primitive operators

(46)

Graphs are just one stage….

(47)

HDFS HDFS

Compute

Spark Preprocess Spark Post.

A Small Pipeline in GraphX

Timed end-to-end GraphX is the fastest

Raw Wikipedia

< / >< / > < / > XML

Hyperlinks PageRank Top 20 Pages

0 200 400 600 800 1000 1200 1400 1600

GraphX GraphLab + Spark Giraph + Spark Spark

Total Runtime (in Seconds) 605

375

1492

(48)

MLBase: Distributed ML Made Easy

DB Query Language Analogy:

Specify What not How MLBase chooses:

•  Algorithms/Operators

•  Ordering and Physical

Placement

•  Parameter and

Hyperparameter Settings

•  Featurization

Leverages Spark for Speed and Scale

T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. Franklin, M. Jordan, “MLBase: A Distributed Machine Learning System”, CIDR 2013.

(49)

ML Pipeline Generation & Optimization

MLBase Optimizer Focus:

•  Better Resource Utilization

•  Algorithmic Speedups

•  Reduced Model Search Time Ex: Image Classification Pipeline

E. Sparks, S. Venkataraman, M. Franklin , B. Recht, “ML Pipelines”, In Process. (see AMPLab Blog for an overview)

(50)

50

Data

Model

Where do models go?

Conference Papers Sales Reports Drive Actions Training

(51)

Driving Actions

51 Suggesting Items at Checkout Fraud Detection Cognitive Assistance Internet of Things

(52)

Problem:

Separate Systems

52

Offline Analytics

Systems

6. Before 8. After 7. After Sophisticated ML on static data. Low-Latency data serving

How do we serve low-latency predictions and

train on live data?

Online Serving

Systems

(53)

Velox Model Serving System

Decompose personalized predictive models:

53 [CIDR’15]

(54)

Velox Model Serving System

Decompose personalized predictive models:

54 [CIDR’15] Split Personalization Model Feature Model Online Batch Feature Caching Approx. Features Online Updates Active Learning

(55)

B

erkeley

D

ata

A

nalytics

S

tack

(Apache and BSD open source)

Resource Virtualization Storage Processing Engine Access and Interfaces In-house Apps

(56)

Big Data Architecture Open

Questions & Research Issues

•  What is the role of a unified stack in a

fast-changing software landscape?

•  Single-node vs. Cluster; Elastic Cloud vs HPC

•  GPUs, FPGAs, XYZs, ???

•  New memory hierarchies: SSDs, RDMA, …

•  Serving vs. Analytics Workloads

(57)

Summary

•  Big Data – yes, there’s hype but it is a Big Deal

•  AmpLab project

•  6 Yrs, Cross-disciplinary team, Industry engagement

•  Open Source development and community building

•  BDAS philosophy: Unification

•  Spark + SQL + Graphs + ML + …

•  The Big Data landscape continues to evolve

•  Open Source enables Academic research to play a

huge role

(58)

To find out more or

get involved:

amplab.berkeley.edu

[email protected]

UC  BERKELEY  

Thanks to NSF CISE Expeditions in Computing, DARPA XData, Founding Sponsors: Amazon Web Services, Google, and SAP,

the Thomas and Stacy Siebel Foundation,

References

Related documents

Conclusion: Women in this study had inadequate knowledge and inappropriate practice related to mammography as a procedure for breast cancer investigation.. Pan African

Once you run the above script hive read data from the Cassandra storage and summarize it, then the summarized data will persist into RDBMS storage to visualize via

Then, each Facebook post and Twitter tweet were interpreted and classified into 13 general categories of planning topics, including transportation, infrastructure,

character in the Game of Thrones series, Jon Snow’s entire role in the show serves as the source of his description, which begins in the first season when he is the bastard of

BCP with Cycle Rulers methodology says the first 12-year cycle is ruled by Moon, second 12-year cycle is ruled by Mercury, third by Venus, fourth by Sun, fifth by Mars, sixth

Termasuk dalam hal ini adalah mengetahui ciri-ciri, sifat-sifat, bagaimana cara menentukan suku ke-n barisan aritmetika dan geometri serta jumlah n suku pertama deret aritmetika

Among many TCM medical and philosophical concepts, I specifically focus on the healing, the silence and the miracle cure and how they are embodied and co-constructed by

Data Intensive Processing Systems: Architecture of large scale data processing systems, Hadoop, Apache Spark, Storm, parallel data processing concepts such as map-reduce, directed