Introduc8on to Apache Spark
Analyzing Data on Large Data Sets
•
Python, R, etc. are popular tools among data scien8sts/analysts, sta8s8cians, etc.
•
Why are these tools popular?
•
Easy to learn
and
maximizes produc8vity
for data engineers, data scien8sts, sta8s8cians
•
Build
robust soOware
and do
interac8ve data analysis
•
Large, diverse
open source development communi8es
•
Comprehensive libraries
: data wrangling, ML, visualiza8on, etc.
•
Limita8ons do exist:
•
Largely
confined to single-‐node analysis
and
smaller data sets
•
Requires
sampling or aggrega8ons
for larger data
•
Distributed tools compromise in various ways – adds
complexity and 8me
Key Advances by MapReduce:
•
Data Locality:
Automa8c split computa8on and launch of mappers appropriately
•
Fault-‐Tolerance:
Write out of intermediate results and restartable mappers meant ability to run on
commodity hardware
•
Linear Scalability:
Combina8on of locality + programming model that forces developers to write
generally scalable solu8ons to problems
MapReduce – Analysis on Large Data Sets (Hadoop)
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map
Map Reduce is Not Perfect
Map
Reduce
Map
Map
Reduce
Map
Limited to map-‐reduce paradigm
Lots of I/O
à
slower jobs
Map
Reduce
Map
Reduce
Map
Reduce
Itera8ve jobs (ML)
à
even slower
Apache Spark
Flexible, in-‐memory data processing for Hadoop
Easy
Development
Flexible Extensible
API
Fast Batch & Stream
Processing
•
Rich APIs for Scala,
Java, and Python
•
Interac8ve shell
•
APIs for different
types of workloads:
•
Batch (MR)
•
Streaming
•
Machine Learning
•
Graph
•
In-‐Memory
processing and
caching
Spark Basics
•
Distributed
cluster framework (like MR), running tasks in
parallel
across a
cluster
•
Tasks operate
in-‐memory
, spill to disk when memory exceeded.
•
Resilient Distributed Datasets (RDD):
Read-‐only
par88oned collec8on
of
records
•
RDDs ac8onable through parallel
transforma8ons
and
ac8ons
•
Lazy materializa8on
op8mizes resources
•
RDD
lineage
from storage to compute and
caching
layer provides fault-‐
tolerance
•
Users control
persistence and par88oning
Fast Processing
Using RAM, Operator Graphs
In-‐Memory Caching
•
Data Par88ons read from RAM
instead of disk
Operator Graphs
•
Scheduling Op8miza8ons
•
Fault Tolerance
join
filter
groupBy
B:
B:
C:
D:
E:
F:
Ç√
Ω
map
A:
map
take
= cached par88on
= RDD
Logis8c Regression Performance
(Data Fits in Memory)
0
500
1000
1500
2000
2500
3000
3500
4000
1
5
10
20
30
Ru
nn
in
g
Ti
m
e(
s)
# of IteraMons
MapReduce
Spark
110 s/itera8on
First itera8on = 80s
Further itera8ons 1s
due to caching
Spark will replace MapReduce
To become the standard execu8on engine for Hadoop
Spark
public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();
public void map(LongWritable key, Text value,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();
StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0;
while (values.hasNext()) { sum += values.next().get(); }
output.collect(key, new IntWritable(sum)); }
Hadoop MapReduce
val spark = new SparkContext(master, appName, [sparkHome], [jars])
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.
map
(word => (word, 1))
.
reduceByKey
(_ + _)
counts.
saveAsTextFile
("hdfs://...")
The Future of Data Processing on Hadoop
Spark complemented by specialized fit-‐for-‐purpose engines
General Data Processing w/
Spark
Fast Batch Processing, Machine Learning,
and Stream Processing
AnalyMc
Database w/
Impala
Low-‐Latency
Massively Concurrent
Queries
Full-‐Text Search w/Solr
Querying textual data
On-‐Disk Processing
w/MapReduce
Jobs at extreme scale and
extremely disk IO intensive
Shared:
•
Data Storage
•
Metadata
•
Resource
Management
•
Administra8on
•
Security
•
Governance
Easy Development
High Produc8vity Language Support
•
Na8ve support for mul8ple
languages with iden8cal APIs
•
Scala, Java, Python
•
Use of closures, itera8ons, and
other common language
constructs to minimize code
•
2-‐5x less code
Python
lines = sc.textFile(...)
lines.
filter
(
lambda s: “ERROR” in s
).
count
()
Scala
val lines = sc.textFile(...)
lines.
filter
(
s => s.contains(“ERROR”)
).
count
()
Java
JavaRDD<String> lines = sc.textFile(...);
lines.
filter
(new Function<String, Boolean>() {
Boolean call(String s) {
return s.contains(
“error”
);
}
Easy Development
Use Interac8vely
•
Interac8ve explora8on of data
for data scien8sts
•
No need to develop
“applica8ons”
•
Developers can prototype
applica8on
on live system
percolateur:spark srowen$ ./bin/spark-shell --master local[*]
...
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT
/_/
Using Scala version 2.10.4
(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)
Type in expressions to have them evaluated.
Type :help for more information.
...
scala> val words = sc.textFile("file:/usr/share/dict/words")
...
words: org.apache.spark.rdd.RDD[String] =
MapPartitionsRDD[1] at textFile at <console>:21
scala> words.count
...
res0: Long = 235886
scala>
Easy Development
Expressive API
•
map
•
filter
•
groupBy
•
sort
•
union
•
join
•
leftOuterJoin
•
rightOuterJoin
•
sample
•
take
•
first
•
partitionBy
•
mapWith
•
pipe
•
save
•
…
•
reduce
•
count
•
fold
•
reduceByKey
•
groupByKey
•
cogroup
•
cross
•
zip
Example
Logis8c Regression
data = spark.textFile(...).map(
readPoint
).cache()
w = numpy.random.rand(D)
for i in range(iterations):
gradient = data
.
map
(
lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))
* p.y * p.x
)
.reduce(
lambda x, y: x + y
)
w -= gradient
The Spark Ecosystem & Hadoop
Spark
Streaming
MLlib
SparkSQL
GraphX
frames
Data-‐
SparkR
STORAGE
HDFS, HBase
RESOURCE MANAGEMENT
YARN
One Plauorm, Many Workloads
Batch, Interac8ve,
and Real-‐Time.
Leading performance and
usability in one plauorm.
•
End-‐to-‐end analy8c workflows
•
Access more data
•
Work with data in new ways
•
Enable new users
Security and Administra8on
Process
Ingest
Sqoop, Flume,
Kaxa,
Spark
Streaming
Transform
MapReduce,
Hive, Pig, Spark
Discover
Analy8c Database
Impala
Search
Solr
Model
Machine Learning
SAS, R,
Spark
,
Mahout
Serve
NoSQL Database
HBase
Streaming
Spark Streaming
Unlimited Storage
HDFS, HBase
YARN, Cloudera Manager,
Cloudera Navigator
Cloudera Customer Use Cases
Core Spark
Spark Streaming
•
Poruolio Risk Analysis
•
ETL Pipeline Speed-‐Up
•
20+ years of stock data
Financial
Services
Health
•
Iden8fy disease-‐causing genes in
the full human genome
•
Calculate Jaccard scores on health
care data sets
ERP
•
Op8cal Character Recogni8on and Bill
Classifica8on
•
Trend analysis
•
Document classifica8on (LDA)
•
Fraud analy8cs
Data
Services
1010
•
Online Fraud Detec8on
Financial
Services
Ad Tech
•
Real-‐Time Ad Performance Analysis
Over
150
customers using Spark
Spark clusters as large as
800
nodes
Uni8ng Spark and Hadoop
The One Plauorm Ini8a8ve Investment Areas
Management
Leverage Hadoop-‐na8ve
resource management.
Security
Full support for Hadoop security
and beyond.
Scale
Enable 10k-‐node clusters.
Streaming
Support for 80% of common stream
processing workloads.
Spark Resources
•
Learn Spark
•
O’Reilly
Advanced Analy8cs with Spark
eBook (wri{en by Clouderans)
•
Cloudera Developer Blog
•
cloudera.com/spark
•
Get Trained
•
Cloudera Spark Training
•
Try it Out