• No results found

Introduc8on to Apache Spark

N/A
N/A
Protected

Academic year: 2021

Share "Introduc8on to Apache Spark"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

Introduc8on  to  Apache  Spark  

(2)

Analyzing  Data  on  Large  Data  Sets  

Python,  R,  etc.  are  popular  tools  among  data  scien8sts/analysts,  sta8s8cians,  etc.  

Why  are  these  tools  popular?  

Easy  to  learn  

and  

maximizes  produc8vity

 for  data  engineers,  data  scien8sts,  sta8s8cians  

Build  

robust  soOware  

and  do  

interac8ve  data  analysis

   

Large,  diverse  

open  source  development  communi8es  

Comprehensive  libraries

:  data  wrangling,  ML,  visualiza8on,  etc.  

Limita8ons  do  exist:  

Largely  

confined  to  single-­‐node  analysis  

and  

smaller  data  sets

 

Requires  

sampling  or  aggrega8ons  

for  larger  data  

Distributed  tools  compromise  in  various  ways  –  adds  

complexity  and  8me  

(3)

Key  Advances  by  MapReduce:  

Data  Locality:

 Automa8c  split  computa8on  and  launch  of  mappers  appropriately  

Fault-­‐Tolerance:

 Write  out  of  intermediate  results  and  restartable  mappers  meant  ability  to  run  on  

commodity  hardware  

Linear  Scalability:

 Combina8on  of  locality  +  programming  model  that  forces  developers  to  write  

generally  scalable  solu8ons  to  problems  

MapReduce  –  Analysis  on  Large  Data  Sets  (Hadoop)  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

Map  

(4)

Map  Reduce  is  Not  Perfect  

Map  

Reduce  

Map  

Map  

Reduce  

Map  

Limited  to  map-­‐reduce  paradigm  

Lots  of  I/O  

à

 slower  jobs  

Map  

Reduce  

Map  

Reduce  

Map  

Reduce  

Itera8ve  jobs  (ML)  

à

 even  slower  

(5)
(6)
(7)

Apache  Spark  

Flexible,  in-­‐memory  data  processing  for  Hadoop

 

Easy    

Development  

Flexible  Extensible    

API  

Fast  Batch  &  Stream  

Processing  

Rich  APIs  for  Scala,  

Java,  and  Python  

 

Interac8ve  shell  

APIs  for  different  

types  of  workloads:  

Batch  (MR)  

Streaming  

Machine  Learning  

Graph  

In-­‐Memory  

processing  and  

caching  

(8)

Spark  Basics  

Distributed

 cluster  framework  (like  MR),  running  tasks  in  

parallel

 across  a  

cluster  

Tasks  operate  

in-­‐memory

,  spill  to  disk  when  memory  exceeded.    

Resilient  Distributed  Datasets  (RDD):  

Read-­‐only  

par88oned  collec8on  

of  

records  

RDDs  ac8onable  through  parallel  

transforma8ons

 and  

ac8ons  

Lazy  materializa8on  

op8mizes  resources

 

RDD  

lineage

 from  storage  to  compute  and  

caching

 layer  provides  fault-­‐

tolerance  

Users  control  

persistence  and  par88oning

 

(9)

Fast  Processing  

Using  RAM,  Operator  Graphs  

In-­‐Memory  Caching  

Data  Par88ons  read  from  RAM  

instead  of  disk  

 

Operator  Graphs  

Scheduling  Op8miza8ons

 

Fault  Tolerance  

join  

filter  

groupBy  

B:  

B:  

C:  

D:  

E:  

F:  

Ç√

Ω  

map  

A:  

map  

take  

=  cached  par88on  

=  RDD  

(10)

Logis8c  Regression  Performance    

(Data  Fits  in  Memory)  

0  

500  

1000  

1500  

2000  

2500  

3000  

3500  

4000  

1  

5  

10  

20  

30  

Ru

nn

in

g  

Ti

m

e(

s)

 

#  of  IteraMons  

MapReduce  

Spark  

110  s/itera8on  

First  itera8on  =  80s  

Further  itera8ons  1s  

due  to  caching  

(11)
(12)

Spark  will  replace  MapReduce  

To  become  the  standard  execu8on  engine  for  Hadoop  

 

Spark  

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value,

OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString();

StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0;

while (values.hasNext()) { sum += values.next().get(); }

output.collect(key, new IntWritable(sum)); }

Hadoop  MapReduce  

val spark = new SparkContext(master, appName, [sparkHome], [jars])

val file = spark.textFile("hdfs://...")

val counts = file.flatMap(line => line.split(" "))

.

map

(word => (word, 1))

.

reduceByKey

(_ + _)

counts.

saveAsTextFile

("hdfs://...")

(13)

The  Future  of  Data  Processing  on  Hadoop  

Spark  complemented  by  specialized  fit-­‐for-­‐purpose  engines

 

General  Data  Processing  w/

Spark  

Fast  Batch  Processing,  Machine  Learning,    

and  Stream  Processing  

 

AnalyMc  

Database  w/

Impala

 

Low-­‐Latency  

Massively  Concurrent  

Queries  

 

 

 

Full-­‐Text  Search  w/Solr    

Querying  textual  data  

On-­‐Disk  Processing  

w/MapReduce  

Jobs  at  extreme  scale  and  

extremely  disk  IO  intensive    

 

Shared:  

Data  Storage  

Metadata  

Resource  

Management  

Administra8on  

Security  

Governance  

(14)

Easy  Development  

High  Produc8vity  Language  Support  

Na8ve  support  for  mul8ple  

languages  with  iden8cal  APIs  

Scala,  Java,  Python  

Use  of  closures,  itera8ons,  and  

other  common  language  

constructs  to  minimize  code  

2-­‐5x  less  code  

Python

 

lines = sc.textFile(...)

lines.

filter

(

lambda s: “ERROR” in s

).

count

()

Scala  

val lines = sc.textFile(...)

lines.

filter

(

s => s.contains(“ERROR”)

).

count

()

Java  

JavaRDD<String> lines = sc.textFile(...);

lines.

filter

(new Function<String, Boolean>() {

Boolean call(String s) {

return s.contains(

“error”

);

}

(15)

Easy  Development  

Use  Interac8vely  

Interac8ve  explora8on  of  data  

for  data  scien8sts  

No  need  to  develop  

“applica8ons”  

Developers  can  prototype  

applica8on

 on  live  system  

percolateur:spark srowen$ ./bin/spark-shell --master local[*]

...

Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.5.0-SNAPSHOT

/_/

Using Scala version 2.10.4

(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_51)

Type in expressions to have them evaluated.

Type :help for more information.

...

scala> val words = sc.textFile("file:/usr/share/dict/words")

...

words: org.apache.spark.rdd.RDD[String] =

MapPartitionsRDD[1] at textFile at <console>:21

scala> words.count

...

res0: Long = 235886

scala>

(16)

Easy  Development  

Expressive  API  

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

sample

take

first

partitionBy

mapWith

pipe

save

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

(17)

Example  

Logis8c  Regression  

data = spark.textFile(...).map(

readPoint

).cache()

w = numpy.random.rand(D)

for i in range(iterations):

gradient = data

.

map

(

lambda p: (1 / (1 + exp(-p.y * w.dot(p.x))))

* p.y * p.x

)

.reduce(

lambda x, y: x + y

)

w -= gradient

(18)

The  Spark  Ecosystem  &  Hadoop

 

Spark  

Streaming  

MLlib  

SparkSQL  

GraphX  

frames  

Data-­‐

SparkR  

STORAGE

 

HDFS,  HBase  

RESOURCE  MANAGEMENT  

YARN  

(19)

One  Plauorm,  Many  Workloads  

Batch,  Interac8ve,  

and  Real-­‐Time.  

Leading  performance  and  

usability  in  one  plauorm.  

• 

End-­‐to-­‐end  analy8c  workflows  

• 

Access  more  data  

• 

Work  with  data  in  new  ways  

• 

Enable  new  users

 

Security  and  Administra8on

 

Process  

Ingest  

Sqoop,  Flume,  

Kaxa,  

Spark  

Streaming  

Transform  

MapReduce,  

Hive,  Pig,  Spark  

Discover  

Analy8c  Database  

Impala  

Search  

Solr  

Model  

Machine  Learning  

SAS,  R,  

Spark

,  

Mahout  

Serve  

NoSQL  Database  

HBase  

Streaming  

Spark  Streaming  

Unlimited  Storage  

HDFS,  HBase  

YARN,  Cloudera  Manager,  

Cloudera  Navigator  

(20)

Cloudera  Customer  Use  Cases  

Core  Spark  

Spark  Streaming  

Poruolio  Risk  Analysis  

ETL  Pipeline  Speed-­‐Up  

20+  years  of  stock  data  

Financial  

Services  

Health  

Iden8fy  disease-­‐causing  genes  in  

the  full  human  genome  

Calculate  Jaccard  scores  on  health  

care  data  sets  

ERP  

Op8cal  Character  Recogni8on  and  Bill  

Classifica8on  

Trend  analysis    

Document  classifica8on  (LDA)  

Fraud  analy8cs  

Data  

Services  

1010  

Online  Fraud  Detec8on  

Financial  

Services  

Ad  Tech  

Real-­‐Time  Ad  Performance  Analysis  

Over  

150  

customers  using  Spark                                          

Spark  clusters  as  large  as  

800

 

nodes  

(21)

Uni8ng  Spark  and  Hadoop  

The  One  Plauorm  Ini8a8ve  Investment  Areas  

Management  

Leverage  Hadoop-­‐na8ve  

resource  management.  

Security  

Full  support  for  Hadoop  security  

and  beyond.  

Scale  

Enable  10k-­‐node  clusters.  

Streaming  

Support  for  80%  of  common  stream  

processing  workloads.  

(22)

Spark  Resources  

Learn  Spark  

O’Reilly  

Advanced  Analy8cs  with  Spark

 eBook  (wri{en  by  Clouderans)  

Cloudera  Developer  Blog

 

cloudera.com/spark  

 

 

Get  Trained  

Cloudera  Spark  Training

 

 

Try  it  Out  

(23)

Thank  You  

[email protected]

 

References

Related documents

It re-constructs actual trips from a Free- Floating Car Sharing System (FFCS), where customers use available cars freely within the city’s limits. The study analyses the

Load error messages from a log into memory, then interactively search for various patterns.. Example:

In “A Hybrid Distributed Collaborative Filtering Recommender Engine Using Apache Spark” paper collaborative filtering using apache spark is explained...

Pat McDonough - Databricks.. Apache Spark spark.incubator.apache.org github.com/apache/incubator-spark [email protected].. INTRODUCTION TO APACHE SPARK..

© 2014 IBM Corporation Cost-based optimization Intent Select Join Product Regex Intent Regex Select Product Join … Intent Join Select Product Regex.. § Declarative

Apache Spark SQL is a distributed framework for structured data analysis and processing. which helps in computation and information on the structure of the data, with use

1) To understand the cluster computing system of Apache Spark and implement on large datasets. 2) To understand the MLlib platform and measure performance metrics such as accuracy,

The library also provides a number of low-level primitives and basic utilities for convex optimization, distributed linear algebra, statistical analysis, and feature extraction,