• No results found

Unified Big Data Analytics Pipeline. 连 城

N/A
N/A
Protected

Academic year: 2021

Share "Unified Big Data Analytics Pipeline. 连 城"

Copied!
37
0
0

Loading.... (view fulltext now)

Full text

(1)

Unified Big Data Analytics

Pipeline

连城 [email protected]

(2)

What is

● A fast and general engine for large-scale

data processing

● An open source implementation of Resilient

Distributed Datasets (RDD)

● Has an advanced DAG execution engine

that supports cyclic data flow and in-memory

computing

(3)

Why

● Fast

○ Run machine learning like iterative programs up to 100x faster than Hadoop MapReduce in memory or 10x faster on disk

○ Run HiveQL compatible queries 100x faster than Hive (with Shark / Spark SQL)

(4)

Why

● Compatibility

○ Compatible with most popular storage systems on top of HDFS

○ New users needn’t suffer ETL to deploy Spark

(5)

Why

● Easy to use

○ Fluent Scala / Java / Python API

○ Interactive shell

○ 2-5x less code (than Hadoop MapReduce)

(6)

Why

● Easy to use

○ Fluent Scala / Java / Python API

○ Interactive shell

○ 2-5x less code (than Hadoop MapReduce)

sc.textFile("hdfs://...") .flatMap(_ split " ") .map(_ -> 1)

.reduceByKey(_ + _) .collectAsMap()

(7)

Why

● Easy to use

○ Fluent Scala / Java / Python API

○ Interactive shell

○ 2-5x less code (than Hadoop MapReduce)

sc.textFile("hdfs://...") .flatMap(_ split " ") .map(_ -> 1)

.reduceByKey(_ + _) .collectAsMap()

Try to write down WordCount in 30 seconds with Hadoop

MapReduce :-)

(8)

Why

● Unified big data analytics pipeline for

○ Batch / interactive (Spark Core vs MR / Tez)

○ SQL (Shark / Spark SQL vs Hive)

○ Streaming (Spark Streaming vs Storm)

○ Machine learning (MLlib vs Mahout)

○ Graph (GraphX vs Giraph)

Shark (SQL)

Spark Streaming

MLlib (machine learning)

GraphX (graph)

Spark Core

(9)

The Hadoop Family

(10)

The Hadoop Family

General Batching

Specialized systems

Streaming Iterative Ad-hoc / SQL Graph

MapReduce Storm Mahout Pig Giraph

S4 Hive

Samza Drill

Impala

(11)

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS HDFS HDFS

(12)

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS read

HDFS write

HDFS read

HDFS write

HDFS

read Query

HDFS write

Train

ETL

(13)

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS read

HDFS write

HDFS read

HDFS write

HDFS

read Query

HDFS write

Train

ETL

HDFS read

HDFS write

HDFS read

HDFS write

HDFS

read Query

HDFS write

Train

ETL

(14)

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS

read Query

HDFS write

Train

ETL

HDFS read

HDFS write

HDFS read

HDFS write

HDFS

read Query

HDFS write

Train

ETL

(15)

Using separate frameworks

Big Data Analytics Pipeline Today

HDFS read

HDFS write

HDFS read

HDFS write

HDFS

read Query

HDFS write

Train

ETL

How?

HDFS

read Query

HDFS write

Train

ETL

(16)

D A T A H A S

Ahhh!!!

(17)

Resilient Distributed Datasets

● RDDs

○ are read-only, partitioned collections of records

○ can be cached in-memory and are locality aware

○ can only be created through deterministic operations on either (1) data in stable storage, or (2) other

RDDs

(18)

Resilient Distributed Datasets

● Core data sharing abstraction in Spark

● Computation can be represented by lazily

evaluated RDD lineage DAG(s)

(19)

Resilient Distributed Datasets

● An RDD

○ is a read-only, partitioned, locality aware distributed collection of records

○ either points to a stable data source, or

○ is created through deterministic transformations on some other RDD(s)

(20)

Resilient Distributed Datasets

● Coarse grained

Applies the same operation on many records

● Fault tolerant

○ Failed tasks can be recovered in parallel with the help of the lineage DAG

(21)

Resilient Distributed Datasets

(22)

Data Sharing in Spark

● Frequently accessed RDDs can be

materialized and cached in memory

○ Cached RDD can also be replicated for fault tolerance

● Spark scheduler takes cached data locality

into account

(23)

Data Sharing in Spark

Worker node

Executor

TasksTasksTasks

Cluster Manager

Cache

Worker node

Executor

TasksTasksTasks Cache

Driver

SparkContext CacheManager

Scheduler tries to schedule tasks to nodes where stable data source or cached input are available.

Executors report caching information back to driver to improve data locality.

(24)

Data Sharing in Spark

● As a consequence, intermediate results are

○ either cached in memory, or

○ written to local file system as shuffle map output and waiting to be fetched by reducer tasks (similar to

MapReduce)

● Enables in-memory data sharing, avoids

unnecessary HDFS I/O

(25)

Generality of RDDs

● From an expressiveness point of view

○ The coarse grained nature is not a big limitation since many parallel programs naturally apply the

same operation to many records, making them easy to express

○ In fact, RDDs can emulate any distributed system, and will do so efficiently in many cases unless the system is sensitive to network latency

(26)

Generality of RDDs

● From system perspective

○ RDDs gave applications control over the most

common bottleneck resources in cluster applications (i.e., network and storage I/O)

○ Makes it possible to express the same optimizations that specialized systems have and achieve similar performance

(27)

Limitations of RDDs

● Not suitable for

○ Latency (millisecond scale)

○ Communication patterns beyond all-to-all

○ Asynchrony

○ Fine-grained updates

(28)

Analytics Models Built over RDDs

MR / Dataflow

spark

.textFile("hdfs://...") .flatMap(_ split " ") .map(_ -> 1)

.reduceByKey(_ + _) .collectAsMap()

Streaming

val streaming = new StreamingContext(

spark, Seconds(1))

streaming.socketTextStream(host, port) .flatMap(_ split " ")

.map(_ -> 1)

.reduceByKey(_ + _) .print()

streaming.start()

(29)

Analytics Models Built over RDDs

SQL

val hiveContext = new HiveContext(spark) import hiveContext._

val children = hiveql(

"""SELECT name, age FROM people |WHERE age < 13

""".striMargin).map {

case Row(name: String, age: Int) =>

name -> age }.foreach(println)

Machine Learning

val points = spark.textFile("hdfs://...") .map(_.split.map(_.toDouble)).splitAt(1) .map { case (Array(label), features) =>

LabeledPoint(label, features) }

val model = KMeans.train(points)

(30)

Mixing Different Analytics Models

MapReduce & SQL

// ETL with Spark Core

case class Person(name: String, age: Int)

val people = spark.textFile("hdfs://people.txt").map { line =>

val Array(name, age) = line.split(",") Person(name, age.trim.toInt)

}.registerAsTable("people") // Query with Spark SQL

val teenagers = sql("SELECT name FROM people WHERE age >= 13 AND age <= 19") teenagers.collect().foreach(println)

(31)

Mixing Different Analytics Models

SQL & iterative computation (machine learning)

// ETL with Spark SQL & Spark Core

val trainingData = sql("""SELECT e.action, u.age, u.latitude, u.longitude |FROM Users u JOIN Events e ON u.userId = e.userId """.stripMargin).map {

case Row(_, age: Double, latitude: Double, longitude: Double) =>

val features = Array(age, latitude, longitude) LabeledPoint(age, features)

}

// Training with MLlib

val model = new LogisticRegressionWithSGD().run(trainingData)

(32)

Unified Big Data Analytics Pipeline

HDFS HDFS

Core MLlib SQL

RDD RDD

(33)

Unified Big Data Analytics Pipeline

HDFS

read Query

HDFS write

Train

ETL

Core MLlib SQL

RDD RDD

(34)

Unified Big Data Analytics Pipeline

HDFS

read Query

HDFS write

Train

ETL

Core MLlib SQL

(35)

Unified Big Data Analytics Pipeline

HDFS

read Query

HDFS write

Train

ETL

Interactive analysis

(36)

● Unified in-memory data sharing abstraction

● Efficient runtime brings significant

performance boost and enables interactive

analysis

● “One stack to rule them all”

Unified Big Data Analytics Pipeline

HDFS

read Query

HDFS write

Train

ETL

(37)

Thanks!

Q&A

References

Related documents

Information Maintained: Business and/or organization and/or farm name, primary name, secondary name, mailing address, phone number, email, website, legal land description

The main wall of the living room has been designated as a &#34;Model Wall&#34; of Delta Gamma girls -- ELLE smiles at us from a Hawaiian Tropic ad and a Miss June USC

Negative-pressure houses with built-up litter presented higher emission rates during the first rearing week due to the high NH 3 concentration during the brooding period, when

• Display the dependent member • Click the Edit Member Record button • Click Admin to enable the contract fields • Click the Set Responsible Member button.. • Enter

Como veremos más adelante (§ 4.3.2), este enfoque, que con- figura la violencia de género como doblemente unidireccional, respecto a los autores (solo hombres) y a las víctimas

El grado de satisfacción tanto de los pacientes, padres como de los profesionales, indica un efecto positivo y beneficioso de la música en el tratamiento de los niños y adolescentes

10.1 Handle Telecom Server Platform based nodes in OSS-RC 10.2 Explain functions located on Telecom Server Platform 10.3 Explicate the Configuration Management support 10.4

“Borland’s software technology solutions for Linux*, Windows* and Java* combined with Intel’s robust architecture provide an ideal, cost-effec- tive development and deployment