• No results found

Big Data Processing. Patrick Wendell Databricks

N/A
N/A
Protected

Academic year: 2021

Share "Big Data Processing. Patrick Wendell Databricks"

Copied!
33
0
0

Loading.... (view fulltext now)

Full text

(1)

Patrick Wendell

Databricks

Big Data Processing

(2)

About me

Committer and PMC member of Apache Spark

“Former” PhD student at Berkeley

Left Berkeley to help found Databricks

Now managing open source work at Databricks Focus is on networking and operating systems

(3)

Outline

Overview of today

The Big Data problem

The Spark Computing Framework

(4)

Straw poll. Have you:

-  Written code in a numerical programming environment (R/Matlab/Weka/etc)?

-  Written code in a general programming language (Python/Java/etc)?

-  Written multi-threaded or distributed computer programs?

(5)

Today’s workshop

Overview of trends in large-scale data analysis

Introduction to the Spark cluster-computing engine with a focus on numerical computing Hands-on workshop with TA’s covering

Scala basics and using Spark for machine learning

(6)

Hands-on Exercises

You’ll be given a cluster of 5 machines*

(a) Text mining of full text of Wikipedia (b) Movie recommendations with ALS

5 machines * 200 people = 1,000 machines

(7)

Outline

Overview of today

The Big Data problem

The Spark Computing Framework

(8)

The Big Data Problem

Data is growing faster than computation speeds

Growing data sources

» Web, mobile, scientific, …

Cheap storage

» Doubling every 18 months

Stalling CPU speeds Storage bottlenecks

(9)

Examples

Facebook’s daily logs: 60 TB 1000 genomes project: 200 TB Google web index: 10+ PB

Cost of 1 TB of disk: $50

Time to read 1 TB from disk: 6 hours (50 MB/s)

(10)

The Big Data Problem

Single machine can no longer process or even store all the data!

Only solution is to distribute over large clusters

(11)

Google Datacenter

How  do  we  program  this  thing?  

(12)

What’s hard about cluster

computing?

How to divide work across nodes?

•  Must consider network, data locality

•  Moving data may be very expensive

How to deal with failures?

•  1 server fails every 3 years => 10K nodes see 10 faults/day

•  Even worse: stragglers (node is not failed, but slow)

(13)

A comparison:

How to divide work across nodes?

What if certain elements of a matrix/array became 1000 times slower to access than others?

How to deal with failures?

What if with 10% probability certain variables in your program became undefined?

(14)

Outline

Overview of today

The Big Data problem

The Spark Computing Framework

(15)

The Spark Computing

Framework

Provides a programming abstraction and parallel runtime to hide this complexity.

“Here’s an operation, run it on all of the data”

» I don’t care where it runs (you schedule that)

» In fact, feel free to run it twice on different nodes

(16)

Resilient Distributed Datasets

Key programming abstraction in Spark Think “parallel data frame”

Supports collections/set API’s

#  get  variance  of  5th  field  of  a  tab-­‐delimited  dataset   rdd  =  spark.hadoopFile(“big-­‐file.txt”)  

 

rdd.map(line  =>  line.split(“\t”)(4))        .map(cell  =>  cell.toDouble())  

     .variance()  

(17)

RDD Execution

Automatically split work into many small, idempotent tasks

Send tasks to nodes based on data locality Load-balance dynamically as tasks finish

(18)

Fault Recovery

1. If a task crashes:

» Retry on another node

•  OK for a map because it had no dependencies

•  OK for reduce because map outputs are on disk

Requires  user  code  to  be  deterministic  

(19)

Fault  Recovery  

2. If a node crashes:

» Relaunch its current tasks on other nodes

» Relaunch any maps the node previously ran

•  Necessary because their output files were lost

(20)

Fault  Recovery  

3. If a task is going slowly (straggler):

» Launch second copy of task on another node

» Take the output of whichever copy finishes first, and kill the other one

(21)

Spark Compared with Earlier

Approaches

Higher level, declarative API.

Built from the “ground up” for performance – including support for leveraging distributed cluster memory.

Optimized for iterative computations (e.g.

machine learning)

(22)

Spark Platform

Spark  RDD  API  

(23)

Spark Platform: GraphX

Spark  RDD  API   GraphX  

Graph   (alpha)  

RDD-­‐Based     Graphs  

graph  =  Graph(vertexRDD,  edgeRDD)  

graph.connectedComponents()  #  returns  a  new  RDD  

(24)

Spark Platform: MLLib

Spark  RDD  API   GraphX  

Graph   (alpha)  

MLLib  

machine   learning  

RDD-­‐Based    Matrices   RDD-­‐Based    

Graphs  

model  =  LogisticRegressionWithSGD.train(trainRDD)   dataRDD.map(point  =>  model.predict(point))  

(25)

Spark Platform: Streaming

Spark  RDD  API   GraphX  

Graph   (alpha)  

MLLib  

machine   learning  

RDD-­‐Based    Matrices   RDD-­‐Based    

Graphs  

dstream  =  spark.networkInputStream()   dstream.countByWindow(Seconds(30))  

Streaming  

RDD-­‐Based    Streams  

(26)

Spark Platform: SQL

Spark  RDD  API   GraphX  

Graph   (alpha)  

MLLib  

machine   learning  

RDD-­‐Based    Matrices   RDD-­‐Based    

Graphs  

rdd  =  sql“select  *  from  rdd1  where  age  >  10”  

Streaming  

RDD-­‐Based    Streams  

SQL  

Schema  RDD’s  

(27)

Performance

Impala  (disk)   Impala  (mem)   Redshift   Shark  (disk)   Shark  (mem)  

0   5   10   15   20   25  

Response  Time  (s)  

SQL[1]  

Storm   Spark  

0   5   10   15   20   25   30   35  

Throughput  (MB/s/node)  

Streaming[2]  

Hadoop   Giraph   GraphX  

0   5   10   15   20   25   30  

Response  Time  (min)  

Graph[3]  

[1] https://amplab.cs.berkeley.edu/benchmark/

[2] Discretized Streams: Fault-Tolerant Streaming Computation at Scale. At SOSP 2013.

[3] https://amplab.cs.berkeley.edu/publication/graphx-grades/

(28)

Composition

Most large-scale data programs involve parsing, filtering, and cleaning data.

Spark allows users to compose patterns elegantly. E.g.:

Select input dataset with SQL, then run machine learning on the result.

(29)

One of the largest open source projects in big data 150+ developers contributing

30+ companies contributing

Spark Community

0 50 100 150

Contributors in past year

(30)

Community Growth

Spark 0.6:

17 contributors

Sept ‘13 Feb ‘13

Oct ‘12

Spark 0.7:"

31 contributors

Spark 0.8:

67 contributors

Feb ‘14

Spark 0.9:

83 contributors

(31)

Databricks

Primary developers of Spark today Founded by Spark project creators A nexus of several research areas:

à OS and computer architecture

à Networking and distributed systems à Statistics and machine learning

(32)

Today’s Speakers

Holden Karau – Worked on large-scale storage infrastructure @ Google.

Intro to Spark and Scala

Hossein Falaki – Data scientist at Apple.

Numerical Programming with Spark

Xiangrui Meng – PhD from Stanford ICME.

Lead contributor on MLLib.

Deep dive on Spark’s MLLib

(33)

Questions?

References

Related documents

Attack types are using apache spark version of structured data types are adding a scala. Closest in spark example, containing the

Fog computing requires extremely low latency, parallel processing of machine learning and complex graph analytical algorithms that is provided by Apache Spark. Spark

By 2011, mobiles accounted for 83 percent of total global phone lines, 43 percent of originated international call traffic, and 58 percent of terminated international traffic

For example, if you want a page which displays two separate frames, you will require 3 HTML files to build that page: one frameset, and the two 'regular' HTML documents to

Unlike any other Wi-Fi based location solution, Trapeze delivers reliable real-time positioning data across any Mobility Domain™ service without causing additional load on the

tion and secondary prevention of coronary heart dis- ease: an American Heart Association scientific statement from the Council on Clinical Cardiology (Subcommittee on

After discussing the frequency and feature of the relatively high-frequency valency sentence patterns of the verb APPOINT in the passive sentences, this

The pathology and immunohistochemical stains were consistent with histologic impression of extra nodal mar- ginal B-cell lymphoma (mucosa-associated lymphoid tissue [MALT] lymphoma)