Introduc8on to Apache Spark

(1)

Introduc8on to Apache Spark

(2)

Analyzing Data on Large Data Sets

•

Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc.

•

Why are these tools popular?

•

Easy to learn

and

maximizes produc8vity

for data engineers, data scien8sts, sta8s8cians

•

Build

robust soOware

and do

interac8ve data analysis

•

Large, diverse

open source development communi8es

•

Comprehensive libraries

: data wrangling, ML, visualiza8on, etc.

•

Limita8ons do exist:

•

Largely

conﬁned to single-‐node analysis

and

smaller data sets

•

Requires

sampling or aggrega8ons

for larger data

•

Distributed tools compromise in various ways – adds

complexity and 8me

(3)

Key Advances by MapReduce:

•

Data Locality:

Automa8c split computa8on and launch of mappers appropriately

•

Fault-‐Tolerance:

Write out of intermediate results and restartable mappers meant ability to run on

commodity hardware

•

Linear Scalability:

Combina8on of locality + programming model that forces developers to write

generally scalable solu8ons to problems

MapReduce – Analysis on Large Data Sets (Hadoop)

Map

(4)

Map Reduce is Not Perfect

Map

Reduce

Map

Reduce

Map

Limited to map-‐reduce paradigm

Lots of I/O

à

slower jobs

Map

Reduce

Map

Reduce

Map

Reduce

Itera8ve jobs (ML)

à

even slower

(5)

(6)

(7)

Apache Spark

Flexible, in-‐memory data processing for Hadoop

Easy

Development

Flexible Extensible

API

Fast Batch & Stream

Processing

•

Rich APIs for Scala,

Java, and Python

•

Interac8ve shell

•

APIs for diﬀerent

types of workloads:

•

Batch (MR)

•

Streaming

•

Machine Learning

•

Graph

•

In-‐Memory

processing and

caching

(8)

Spark Basics

•

Distributed

cluster framework (like MR), running tasks in

parallel

across a

cluster

•

Tasks operate

in-‐memory

, spill to disk when memory exceeded.

•

Resilient Distributed Datasets (RDD):

Read-‐only

par88oned collec8on

of

records

•

RDDs ac8onable through parallel

transforma8ons

and

ac8ons

•

Lazy materializa8on

op8mizes resources

•

RDD

lineage

from storage to compute and

caching

layer provides fault-‐

tolerance

•

Users control

persistence and par88oning

(9)

Fast Processing

Using RAM, Operator Graphs

In-‐Memory Caching

•

Data Par88ons read from RAM

instead of disk

Operator Graphs

•

Scheduling Op8miza8ons

•

Fault Tolerance

join

ﬁlter

groupBy

B:

C:

D:

E:

F:

Ç√

Ω

map

A:

map

take

= cached par88on

= RDD

(10)

Logis8c Regression Performance

(Data Fits in Memory)

0

500

1000

1500

2000

2500

3000

3500

4000

1

5

10

20

30

Ru

nn

in

g

Ti

m

e(

s)

# of IteraMons

MapReduce

Spark

110 s/itera8on

First itera8on = 80s

Further itera8ons 1s

due to caching

(11)

(12)

Spark will replace MapReduce

To become the standard execu8on engine for Hadoop

Spark

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();

StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0;

while (values.hasNext()) { sum += values.next().get(); }

output.collect(key, new IntWritable(sum)); }

Hadoop MapReduce

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.

map

(word => (word, 1))

.

reduceByKey

(_ + _)

counts.

saveAsTextFile

("hdfs://...")

(13)

The Future of Data Processing on Hadoop

Spark complemented by specialized ﬁt-‐for-‐purpose engines

General Data Processing w/

Spark

Fast Batch Processing, Machine Learning,

and Stream Processing

AnalyMc

Database w/

Impala

Low-‐Latency

Massively Concurrent

Queries

Full-‐Text Search w/Solr

Querying textual data

On-‐Disk Processing

w/MapReduce

Jobs at extreme scale and

extremely disk IO intensive

Shared:

•

Data Storage

•

Metadata

•

Resource

Management

•

Administra8on

•

Security

•

Governance

(14)

Easy Development

High Produc8vity Language Support

•

Na8ve support for mul8ple

languages with iden8cal APIs

•

Scala, Java, Python

•

Use of closures, itera8ons, and

other common language

constructs to minimize code

•

2-‐5x less code

Python

lines = sc.textFile(...)

lines.

filter

(

lambda s: “ERROR” in s

).

count

()

Scala

val lines = sc.textFile(...)

lines.

filter

(

s => s.contains(“ERROR”)

).

count

()

Java

JavaRDD<String> lines = sc.textFile(...);

lines.

filter

(new Function<String, Boolean>() {

Boolean call(String s) {

return s.contains(

“error”

);

}

(15)

Easy Development

Use Interac8vely

•

Interac8ve explora8on of data

for data scien8sts

•

No need to develop

“applica8ons”

•

Developers can prototype

applica8on

on live system

percolateur:spark srowen$ ./bin/spark-shell --master local[*]

...

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT

/_/

Using Scala version 2.10.4

(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)

Type in expressions to have them evaluated.

Type :help for more information.

...

scala> val words = sc.textFile("file:/usr/share/dict/words")

...

words: org.apache.spark.rdd.RDD[String] =

MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count

...

res0: Long = 235886

scala>

(16)

Easy Development

Expressive API

•

map

•

filter

•

groupBy

•

sort

•

union

•

join

•

leftOuterJoin

•

rightOuterJoin

•

sample

•

take

•

first

•

partitionBy

•

mapWith

•

pipe

•

save

•

…

•

reduce

•

count

•

fold

•

reduceByKey

•

groupByKey

•

cogroup

•

cross

•

zip

(17)

Example

Logis8c Regression

data = spark.textFile(...).map(

readPoint

).cache()

w = numpy.random.rand(D)

for i in range(iterations):

gradient = data

.

map

(

lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x

)

.reduce(

lambda x, y: x + y

)

w -= gradient

(18)

The Spark Ecosystem & Hadoop

Spark

Streaming

MLlib

SparkSQL

GraphX

frames

Data-‐

SparkR

STORAGE

HDFS, HBase

RESOURCE MANAGEMENT

YARN

(19)

One Plauorm, Many Workloads

Batch, Interac8ve,

and Real-‐Time.

Leading performance and

usability in one plauorm.

• 

End-‐to-‐end analy8c workﬂows

• 

Access more data

• 

Work with data in new ways

• 

Enable new users

Security and Administra8on

Process

Ingest

Sqoop, Flume,

Kaxa,

Spark

Streaming

Transform

MapReduce,

Hive, Pig, Spark

Discover

Analy8c Database

Impala

Search

Solr

Model

Machine Learning

SAS, R,

Spark

,

Mahout

Serve

NoSQL Database

HBase

Streaming

Spark Streaming

Unlimited Storage

HDFS, HBase

YARN, Cloudera Manager,

Cloudera Navigator

(20)

Cloudera Customer Use Cases

Core Spark

Spark Streaming

•

Poruolio Risk Analysis

•

ETL Pipeline Speed-‐Up

•

20+ years of stock data

Financial

Services

Health

•

Iden8fy disease-‐causing genes in

the full human genome

•

Calculate Jaccard scores on health

care data sets

ERP

•

Op8cal Character Recogni8on and Bill

Classiﬁca8on

•

Trend analysis

•

Document classiﬁca8on (LDA)

•

Fraud analy8cs

Data

Services

1010

•

Online Fraud Detec8on

Financial

Services

Ad Tech

•

Real-‐Time Ad Performance Analysis

Over

150

customers using Spark

Spark clusters as large as

800

nodes

(21)

Uni8ng Spark and Hadoop

The One Plauorm Ini8a8ve Investment Areas

Management

Leverage Hadoop-‐na8ve

resource management.

Security

Full support for Hadoop security

and beyond.

Scale

Enable 10k-‐node clusters.

Streaming

Support for 80% of common stream

processing workloads.

(22)

Spark Resources

•

Learn Spark

•

O’Reilly

Advanced Analy8cs with Spark

eBook (wri{en by Clouderans)

•

Cloudera Developer Blog

•

cloudera.com/spark

•

Get Trained

•

Cloudera Spark Training

•

Try it Out

(23)

Thank You

[email protected]