The Spark Big Data Analytics Platform

(1)

The Spark Big Data Analytics Platform

Amir H. Payberah

[email protected]

Amirkabir University of Technology (Tehran Polytechnic)

1393/10/10

(2)

(3)

I Big Data refers to datasets and flows large enough that has outpaced our capability to

store,process,analyze, andunderstand.

(4)

Where Does

Big Data Come From?

(5)

Big Data Market Driving Factors

Thenumber of web pages indexed by Google, which were around

one millionin 1998, have exceeded one trillionin 2008, and its

expansion is accelerated by appearance of the social networks.∗

∗“Mining big data: current status, and forecast to the future” [Wei Fan et al., 2013]

(6)

Big Data Market Driving Factors

The amount ofmobile data traffic is expected to grow to10.8

Exabyte per month by2016.∗

∗“Worldwide Big Data Technology and Services 2012-2015 Forecast” [Dan Vesset et al., 2013]

(7)

More than 65 billiondeviceswere connected to the Internet by

2010, and this number will go up to230 billion by2020.∗

∗“The Internet of Things Is Coming” [John Mahoney et al., 2013]

(8)

Big Data Market Driving Factors

Many companies are moving towards using Cloud servicesto

access Big Data analytical tools.

(9)

Open source communities

(10)

How To Store and Process

Big Data?

(11)

Scale Up vs. Scale Out (1/2)

I Scale up or scale vertically: adding resources to a single node in a system.

I Scale outor scale horizontally: addingmore nodes to a system.

(12)

Scale Up vs. Scale Out (2/2)

I Scale up: more expensive than scaling out.

I Scale out: more challenging for fault toleranceand software devel-opment.

(13)

Taxonomy of Parallel Architectures

DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM Communications, 35(6), 85-98, 1992.

(14)

Taxonomy of Parallel Architectures

DeWitt, D. and Gray, J. “Parallel database systems: the future of high performance database systems”. ACM Communications, 35(6), 85-98, 1992.

(15)

(16)

Big Data Analytics Stack

(17)

Hadoop Big Data Analytics Stack

(18)

Spark Big Data Analytics Stack

(19)

Big Data - File systems

I Traditional file-systems are not well-designed for large-scale data processing systems.

I Efficiency has a higher priority than other features, e.g., directory service.

I Massive size of data tends to store it across multiple machinesin a distributed way.

I HDFS/GFS, Amazon S3, ...

(20)

Big Data - File systems

(21)

Big Data - File systems

(22)

Big Data - File systems

(23)

Big Data - Database

I Relational Databases Management Systems (RDMS) were not

de-signed to be distributed.

I NoSQL databasesrelax one or more of theACIDproperties: BASE

I Different data models: key/value,column-family,graph,document.

I Hbase/BigTable, Dynamo, Scalaris, Cassandra, MongoDB,

Volde-mort, Riak, Neo4J, ...

(24)

Big Data - Database

I NoSQLdatabases relax one or more of theACIDproperties: BASE

(25)

Big Data - Database

(26)

Big Data - Database

(27)

Big Data - Resource Management

I Different frameworks require different computing resources.

I Large organizations need the ability to share data and resources

between multiple frameworks.

I Resource managementshare resources in a cluster betweenmultiple

frameworks while providing resourceisolation.

I Mesos, YARN, Quincy, ...

(28)

Big Data - Resource Management

(29)

Big Data - Resource Management

(30)

Big Data - Resource Management

(31)

Big Data - Execution Engine

I Scalable and fault tolerance parallel data processing on clusters of unreliable machines.

I Data-parallel programming model for clusters of commodity

ma-chines.

I MapReduce, Spark, Stratosphere, Dryad, Hyracks, ...

(32)

Big Data - Execution Engine

ma-chines.

(33)

Big Data - Execution Engine

ma-chines.

(34)

Big Data - Query/Scripting Language

I Low-level programming of execution engines, e.g., MapReduce, is

noteasy for end users.

I Need high-level language to improve the query capabilities of exe-cution engines.

I It translatesuser-definedfunctions tolow-levelAPI of the execution engines.

I Pig, Hive, Shark, Meteor, DryadLINQ, SCOPE, ...

(35)

Big Data - Query/Scripting Language

(36)

Big Data - Query/Scripting Language

(37)

Big Data - Query/Scripting Language

(38)

Big Data - Stream Processing

I Providing users with freshandlow latencyresults.

I Database Management Systems (DBMS) vs. Data Stream

Man-agement Systems (DSMS)

I Storm, S4, SEEP, D-Stream, Naiad, ...

(39)

Big Data - Stream Processing

(40)

Big Data - Stream Processing

(41)

Big Data - Graph Processing

I Many problems are expressed using graphs: sparse computational

dependencies, andmultiple iterationsto converge.

I Data-parallel frameworks, such as MapReduce, are not ideal for

these problems: slow

I Graph processing frameworks are optimized for graph-based

prob-lems.

I Pregel, Giraph, GraphX, GraphLab, PowerGraph, GraphChi, ...

(42)

Big Data - Graph Processing

prob-lems.

(43)

Big Data - Graph Processing

prob-lems.

(44)

Big Data - Graph Processing

prob-lems.

(45)

Big Data - Machine Learning

I Implementing and consuming machine learning techniques at scale

are difficult tasks for developers and end users.

I There exist platforms that address it by providing scalable machine-learning and data mining libraries.

I Mahout, MLBase, SystemML, Ricardo, Presto, ...

(46)

Big Data - Machine Learning

(47)

Big Data - Machine Learning

(48)

Big Data - Configuration and Synchronization Service

I A means to synchronize distributed applications accesses to shared

resources.

I Allows distributed processes to coordinate with each other.

I Zookeeper, Chubby, ...

(49)

Big Data - Configuration and Synchronization Service

resources.

(50)

Big Data - Configuration and Synchronization Service

resources.

(51)

Outline

I Introduction to HDFS

I Data processing with MapReduce

I Introduction to Scala

I Data exploration using Spark

I Stream processing with Spark Streaming

I Graph analytics withGraphX

(52)

(53)

What is Filesystem?

I Controls how data is storedin and retrievedfromdisk.

(54)

What is Filesystem?

I Controls how data is storedin and retrievedfromdisk.

(55)

Distributed Filesystems

I When data outgrowsthe storage capacity of a singlemachine:

par-tition it across a number of separatemachines.

I Distributed filesystems: manage the storage across a network of

machines.

(56)

HDFS

I HadoopDistributedFileSystem

I Appears as a singledisk

I Runs on top of a nativefilesystem, e.g., ext3

I Fault tolerant: can handle disk crashes, machine crashes, ...

I Based on Google’s filesystem GFS

(57)

HDFS is Good for ...

I Storing large files

• _{Terabytes, Petabytes, etc...} • _{100MB or more per file.}

I Streaming data access

• _{Data is}_{written once}_and_{read many times}_.

• _{Optimized for batch reads rather than}_random_reads.

I Cheapcommodity hardware

• _{No need for super-computers, use less reliable commodity hardware.}

(58)

HDFS is Not Good for ...

I Low-latency reads

• _{High-throughput}_{rather than}_{low latency}_for_{small chunks}_{of data.} • _HBase_{addresses this issue.}

I Large amount of small files

• _{Better for millions of}_large_{files instead of billions of}_small_files.

I Multiple writers

• _{Single writer}_{per file.}

• _{Writes only at the}_{end of file}_{, no-support for arbitrary offset.}

(59)

HDFS Daemons (1/2)

I HDFS cluster is manager by three types of processes.

I Namenode

• _{Manages the}_filesystem_{, e.g., namespace, meta-data, and file blocks} • _{Metadata is stored in}_memory_.

I Datanode

• _Stores_and_retrieves_{data blocks} • _Reports_{to Namenode}

• _{Runs on}_many_machines

I Secondary Namenode

• _{Only for}_{checkpointing}_. • _{Not a backup}_{for Namenode}

(60)

HDFS Daemons (1/2)

I Namenode

I Datanode

(61)

HDFS Daemons (1/2)

I Namenode

I Datanode

(62)

HDFS Daemons (1/2)

I Namenode

I Datanode

(63)

HDFS Daemons (2/2)

(64)

Files and Blocks (1/2)

I Files are split intoblocks.

I Blocks

• _Single_unit _{of storage: a contiguous piece of information on a disk.} • _Transparent_{to user.}

• _{Managed by}_Namenode_{, stored by}_Datanode_.

• _{Blocks are traditionally either}_64MB_or_128MB_{: default is}_64MB_.

(65)

Files and Blocks (2/2)

I Same block isreplicated on multiple machines: default is3

• _{Replica placements are}_{rack aware}_. • _1st_{replica on the}_{local rack}_.

• _2nd_{replica on the}_{local rack but different machine}_. • _3rd_{replica on the} _{different rack}_.

I Namenodedetermines replica placement.

(66)

HDFS Client

I Client interacts with Namenode

• _To _update_{the Namenode namespace.}

• _To _{retrieve block locations}_{for writing and reading.}

I Client interacts directly withDatanode

• _To _read_and_write_data_.

I Namenode doesnotdirectly write or read data.

(67)

HDFS Write

I 1. Create a new filein the Namenode’s Namespace; calculate block topology.

I 2, 3, 4. Stream datato the first, second and third node.

I 5, 6, 7. Success/failureacknowledgment.

(68)

HDFS Read

I 1. Retrieveblock locations.

I 2, 3. Read blocksto re-assemble the file.

(69)

Namenode Memory Concerns

I For fast accessNamenode keeps all block metadata in-memory.

• _{Will work well for clusters of}_{100 machines}_.

I Changingblock sizewill affect how muchspacea cluster can host.

• _64MB_to _128MB_{will reduce the number of blocks and increase the} space that Namenode can support.

(70)

HDFS Federation

I Hadoop 2+

I Each Namenode will host part of the blocks.

I A Block Poolis a set of blocks that belong to a single namespace.

I Support for 1000+ machineclusters.

(71)

Namenode Fault-Tolerance (1/2)

I Namenode is asingle point of failure.

I If Namenode crashes then cluster is down.

I Secondary Namenode periodically merges the namespaceimageand

log and a persistentrecord of it written to disk (checkpointing).

I But, the state of thesecondary Namenodelagsthat of theprimary: does notprovide high-availability of the filesystem

(72)

Namenode Fault-Tolerance (1/2)

I Namenode is asingle point of failure.

I If Namenode crashes then cluster is down.

I Secondary Namenode periodically merges the namespaceimageand

log and a persistentrecord of it written to disk (checkpointing).

I But, the state of thesecondary Namenodelagsthat of theprimary: doesnotprovide high-availability of the filesystem

(73)

Namenode Fault-Tolerance (2/2)

I High availability Namenode.

• _{Hadoop 2+}

• _{Active standby}_{is always running and takes over in case main}

Namen-ode fails.

(74)

HDFS Installation and Shell

(75)

HDFS Installation

I Three options

• _{Local (Standalone) Mode} • _{Pseudo-Distributed Mode} • _{Fully-Distributed Mode}

(76)

Installation - Local

I Default configuration after the download.

I Executes as a single Java process.

I Works directly with localfilesystem.

I Useful for debugging.

(77)

Installation - Pseudo-Distributed (1/6)

I Still runs on a single node.

I Each daemon runs in itsown Java process.

• _Namenode • _{Secondary Namenode} • _Datanode I Configuration files: • _{hadoop-env.sh} • _{core-site.xml} • _{hdfs-site.xml}

(78)

Installation - Pseudo-Distributed (2/6)

I Specify environment variables in hadoop-env.sh

export JAVA_HOME=/opt/jdk1.7.0_51

(79)

I Specify location of Namenodein core-site.sh

<name>fs.defaultFS</name>

<value>hdfs://localhost:8020</value> <description>NameNode URI</description> </property>

(80)

I Configurations of Namenode in hdfs-site.sh

I Pathon the local filesystem where theNamenode storesthe

names-pace and transaction logs persistently.

<name>dfs.namenode.name.dir</name>

<value>/opt/hadoop-2.2.0/hdfs/namenode</value> <description>description...</description> </property>

(81)

Installation - Pseudo-Distributed (5/6)

I Configurations of Secondary Namenode in hdfs-site.sh

I Path on the local filesystem where theSecondary Namenode stores

the temporary images to merge.

<name>dfs.namenode.checkpoint.dir</name> <value>/opt/hadoop-2.2.0/hdfs/secondary</value> <description>description...</description> </property>

(82)

I Configurations of Datanode in hdfs-site.sh

I Comma separated list ofpathson the local filesystem of aDatanode

where it should store its blocks.

<name>dfs.datanode.data.dir</name>

<value>/opt/hadoop-2.2.0/hdfs/datanode</value> <description>description...</description> </property>

(83)

Start HDFS and Test

I Format the Namenode directory (do this only once, thefirst time).

hdfs namenode -format

I Startthe Namenode, Secondary namenode and Datanodedaemons.

hadoop-daemon.sh start namenode

hadoop-daemon.sh start secondarynamenode hadoop-daemon.sh start datanode

jps

I Verify the deamons are running:

• _Namenode: _{http://localhost:50070}

• _{Secondary Namenode:} _{http://localhost:50090} • _Datanode: _{http://localhost:50075}

(84)

Start HDFS and Test

jps

(85)

Start HDFS and Test

jps

(86)

HDFS Shell

hdfs dfs -<command> -<option> <path>

hdfs dfs -ls / hdfs dfs -ls file:///home/big hdfs dfs -ls hdfs://localhost/ hdfs dfs -cat /dir/file.txt hdfs dfs -cp /dir/file1 /otherDir/file2 hdfs dfs -mv /dir/file1 /dir2/file2 hdfs dfs -mkdir /newDir

hdfs dfs -put file.txt /dir/file.txt # can also use copyFromLocal

hdfs dfs -get /dir/file.txt file.txt # can also use copyToLocal

hdfs dfs -rm /dir/fileToDelete hdfs dfs -help

(87)

HDFS Shell

hdfs dfs -<command> -<option> <path> hdfs dfs -ls / hdfs dfs -ls file:///home/big hdfs dfs -ls hdfs://localhost/ hdfs dfs -cat /dir/file.txt hdfs dfs -cp /dir/file1 /otherDir/file2 hdfs dfs -mv /dir/file1 /dir2/file2 hdfs dfs -mkdir /newDir

hdfs dfs -put file.txt /dir/file.txt # can also use copyFromLocal

hdfs dfs -get /dir/file.txt file.txt # can also use copyToLocal

hdfs dfs -rm /dir/fileToDelete hdfs dfs -help

(88)

(89)

MapReduce

I A shared nothing architecture for processing large data sets with a

parallel/distributed algorithm onclusters.

(90)

MapReduce Definition

I A programming model: to batch process large data sets (inspired

byfunctional programming).

I An execution framework: to run parallel algorithms on clusters of

commodity hardware.

(91)

MapReduce Definition

I A programming model: to batch process large data sets (inspired

byfunctional programming).

I An execution framework: to run parallel algorithms on clusters of

commodity hardware.

(92)

Simplicity

I Don’t worry about parallelization,fault tolerance,data distribution, and load balancing(MapReducetakes care of these).

I Hide system-level details from programmers.

Simplicity!

(93)

Programming Model

(94)

MapReduce Dataflow

I map function: processes data and generates a set of intermediate

key/value pairs.

I reduce function: merges all intermediate values associated with the same intermediate key.

(95)

Example: Word Count

I Consider doing a word count of the following file using MapReduce:

Hello World Bye World

Hello Hadoop Goodbye Hadoop

(96)

Example: Word Count -

map

I The mapfunction reads in words one a time and outputs(word, 1)

for each parsed input word.

I The mapfunction outputis:

(Hello, 1) (World, 1) (Bye, 1) (World, 1) (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1)

(97)

Example: Word Count -

shuffle

I The shuffle phase between map and reduce phase creates a list of

values associated with each key.

I The reducefunction inputis:

(Bye, (1)) (Goodbye, (1)) (Hadoop, (1, 1) (Hello, (1, 1)) (World, (1, 1))

(98)

Example: Word Count -

reduce

I The reduce function sums the numbers in the list for each key and

outputs(word, count)pairs.

I The output of the reduce function is the output of the MapReduce

job: (Bye, 1) (Goodbye, 1) (Hadoop, 2) (Hello, 2) (World, 2)

(99)

Combiner Function (1/2)

I In some cases, there is significantrepetitionin theintermediate keys

produced by eachmaptask, and thereducefunction iscommutative

and associative. Machine 1: (Hello, 1) (World, 1) (Bye, 1) (World, 1) Machine 2: (Hello, 1) (Hadoop, 1) (Goodbye, 1) (Hadoop, 1)

(100)

Combiner Function (2/2)

I Users can specify an optionalcombiner function to merge partially

data before it is sent over the network to the reduce function.

I Typically the same code is used to implement both the combiner

and the reducefunction.

Machine 1: (Hello, 1) (World, 2) (Bye, 1) Machine 2: (Hello, 1) (Hadoop, 2) (Goodbye, 1)

(101)

map

public static class MyMap extends Mapper<...> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context)

throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one); }

} }

(102)

Example: Word Count -

reduce

public static class MyReduce extends Reducer<...> {

public void reduce(Text key, Iterator<...> values, Context context)

throws IOException, InterruptedException { int sum = 0;

while (values.hasNext())

sum += values.next().get();

context.write(key, new IntWritable(sum)); }

}

(103)

driver

public staticvoid main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(MyMap.class);

job.setCombinerClass(MyReduce.class);

job.setReducerClass(MyReduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true); }

(104)

Example: Word Count -

Compile and Run

(1/2)

# start hdfs

> hadoop-daemon.sh start namenode > hadoop-daemon.sh start datanode

# make the input folder in hdfs

> hdfs dfs -mkdir -p input

# copy input files from local filesystem into hdfs

> hdfs dfs -put file0 input/file0 > hdfs dfs -put file1 input/file1 > hdfs dfs -ls input/

input/file0 input/file1

> hdfs dfs -cat input/file0 Hello World Bye World > hdfs dfs -cat input/file1 Hello Hadoop Goodbye Hadoop

(105)

Example: Word Count -

Compile and Run

(2/2)

> mkdir wordcount_classes > javac -classpath $HADOOP_HOME/share/hadoop/common/hadoop-common-2.2.0.jar: $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.2.0.jar: $HADOOP_HOME/share/hadoop/common/lib/commons-cli-1.2.jar -d wordcount_classes sics/WordCount.java

> jar -cvf wordcount.jar -C wordcount_classes/ . > hadoop jar wordcount.jar sics.WordCount input output > hdfs dfs -ls output output/part-00000 > hdfs dfs -cat output/part-00000 Bye 1 Goodbye 1 Hadoop 2 Hello 2 World 2

(106)

Execution Engine

(107)

MapReduce Execution (1/7)

I The user program divides the input files intoM splits.

• _{A typical size of a split is the size of a}_HDFS_{block (64 MB).} • _{Converts them to}_key/value_pairs.

I It starts up many copies of the program on a cluster of machines.

J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008. Amir H. Payberah (Tehran Polytechnic) Spark 1393/10/10 74 / 171

(108)

MapReduce Execution (2/7)

I One of the copies of the program ismaster, and the rest areworkers.

I The masterassigns works to theworkers.

• _{It picks} _idle_{workers and assigns each one a}_map_{task or a}_reduce task.

(109)

MapReduce Execution (3/7)

I A map workerreads the contents of the corresponding input splits.

I It parses key/value pairs out of the input data and passes each pair to the user defined map function.

I Theintermediate key/valuepairs produced by the mapfunction are

buffered in memory.

J. Dean and S. Ghemawat, “MapReduce: simplified data processing on large clusters”, ACM Communications 51(1), 2008.

(110)

MapReduce Execution (4/7)

I The buffered pairs areperiodically written tolocal disk.

• _{They are partitioned into}_{R regions} ₍_{hash(key) mod R}_).

I Thelocationsof the buffered pairs on the local disk are passed back to the master.

I The masterforwards these locations to thereduce workers.

(111)

MapReduce Execution (5/7)

I A reduce workerreads the buffered data from the local disks of the map workers.

I When a reduce worker has read all intermediate data, it sorts it by theintermediate keys.

(112)

MapReduce Execution (6/7)

I The reduce worker iterates over the intermediate data.

I For each unique intermediate key, it passes the key and the

cor-responding set of intermediate values to the user defined reduce

function.

I The output of the reduce function is appended to afinal output file

for this reduce partition.

(113)

MapReduce Execution (7/7)

I When all map tasks and reduce tasks have been completed, the

master wakes up theuser program.

(114)

Hadoop MapReduce and HDFS

(115)

Fault Tolerance

I Onworker failure:

• _{Detect failure via} _{periodic heartbeats}_. • _Re-execute_in-progress_map_and_reduce _tasks.

• _Re-execute_completed_map_{tasks: their output is stored on the local} disk of the failed machine and is therefore inaccessible.

• _Completed _reduce _{tasks do not need to be re-executed since their} output is stored in a global filesystem.

I Onmaster failure:

• _{State is periodically}_checkpointed_{: a new copy of master starts from} the last checkpoint state.

(116)

(117)

Scala

I Scala: scalable language

I A blend ofobject-oriented and functional programming

I Runs on the Java Virtual Machine

I Designed by Martin Odersky atEPFL

(118)

Functional Programming Languages

I In a restricted sense: a language that does not have mutable vari-ables,assignments, or imperative controlstructures.

I In awidersense: it enables the construction of programs thatfocus on functions.

I Functions arefirst-classcitizens:

• _{Defined anywhere}_{(including inside other functions).} • _{Passed as parameters}_{to functions and}_{returned as results}_. • _Operators_{to compose functions.}

(119)

Functional Programming Languages

I In a restricted sense: a language that does not have mutable vari-ables,assignments, or imperative controlstructures.

I In awidersense: it enables the construction of programs thatfocus on functions.

I Functions arefirst-classcitizens:

• _{Defined anywhere}_{(including inside other functions).} • _{Passed as parameters}_{to functions and}_{returned as results}_. • _Operators_{to compose functions.}

(120)

Scala Variables

I Values: immutable

I Variables: mutable

var myVar: Int = 0

val myVal: Int = 1

I Scala data types:

• _{Boolean, Byte, Short, Char, Int, Long, Float, Double, String}

(121)

If ... Else

var x = 30; if (x == 10) { println("Value of X is 10"); } else if (x == 20) { println("Value of X is 20"); } else {

println("This is else statement"); }

(122)

Loop

var a = 0 var b = 0 for (a <- 1 to 3; b <- 1 until 3) { println("Value of a: " + a + ", b: " + b ) }

// loop with collections

val numList = List(1, 2, 3, 4, 5, 6)

for (a <- numList) {

println("Value of a: " + a) }

(123)

Functions

def functionName([list of parameters]): [return type] = {

function body

return [expr] }

def addInt(a: Int, b: Int): Int = {

var sum: Int = 0

sum = a + b sum

}

println("Returned Value: " + addInt(5, 7))

(124)

Anonymous Functions

I Lightweight syntax for defining functions.

var mul = (x: Int, y: Int) => x * y println(mul(3, 4))

(125)

Higher-Order Functions

def apply(f: Int => String, v: Int) = f(v)

def layout(x: Int) = "[" + x.toString() + "]"

println(apply(layout, 10))

(126)

Collections (1/2)

I Array: fixed-sizesequential collection of elements of thesame type

val t = Array("zero", "one", "two")

val b = t(0) // b = zero

I List: sequential collection of elements of the same type

val t = List("zero", "one", "two")

I Set: sequential collection of elements of the same type without

duplicates

val t = Set("zero", "one", "two")

val t.contains("zero")

(127)

Collections (1/2)

duplicates

(128)

Collections (1/2)

duplicates

(129)

Collections (2/2)

I Map: collection ofkey/value pairs

val m = Map(1 -> "sics", 2 -> "kth")

val b = m(1) // b = sics

I Tuple: Afixed number of items ofdifferent types together

val t = (1, "hello")

val b = t._1 // b = 1

val c = t._2 // c = hello

(130)

Collections (2/2)

I Map: collection ofkey/value pairs

val m = Map(1 -> "sics", 2 -> "kth")

val b = m(1) // b = sics

I Tuple: Afixed number of items ofdifferent types together

val t = (1, "hello")

val b = t._1 // b = 1

val c = t._2 // c = hello

(131)

Functional Combinators

I map: applies a function over each element in the list

val numbers = List(1, 2, 3, 4)

numbers.map(i => i * 2) // List(2, 4, 6, 8)

I flatten: it collapses one level of nested structure

List(List(1, 2), List(3, 4)).flatten // List(1, 2, 3, 4)

I flatMap: map + flatten

I foreach: it is like map but returns nothing

(132)

Classes and Objects

class Calculator {

val brand: String = "HP"

def add(m: Int, n: Int): Int = m + n

}

val calc = new Calculator

calc.add(1, 2)

println(calc.brand)

I A singleton is a class that can have only one instance.

object Test {

def main(args: Array[String]) { ... } }

Test.main(null)

(133)

Classes and Objects

class Calculator {

val brand: String = "HP"

def add(m: Int, n: Int): Int = m + n

}

val calc = new Calculator

calc.add(1, 2)

println(calc.brand)

I A singleton is a class that can have only one instance.

object Test {

def main(args: Array[String]) { ... } }

Test.main(null)

(134)

Case Classes and Pattern Matching

I Case classesare used to store and match on the contents of a class.

I They are designed to be used with pattern matching.

I You can construct them without using new.

case class Calc(brand: String, model: String)

def calcType(calc: Calc) = calc match {

case Calc("hp", "20B") => "financial" case Calc("hp", "48G") => "scientific" case Calc("hp", "30B") => "business" case _ => "Calculator of unknown type"

}

calcType(Calc("hp", "20B"))

(135)

Simple Build Tool (SBT)

I An open source build tool for Scala and Java projects.

I Similar to Java’sMaven or Ant.

I It is written inScala.

(136)

SBT - Hello World!

// make dir hello and edit Hello.scala

object Hello {

def main(args: Array[String]) {

println("Hello world.") }

}

$ cd hello

$ sbt compile run

(137)

Common Commands

I compile: compiles the main sources.

I run <argument>*: run the main class.

I package: creates a jar file.

I console: starts the Scala interpreter.

I clean: deletes all generated files.

I help <command>: displays detailed help for the specified command.

(138)

Create a Simple Project

I Create project directory.

I Create src/main/scala directory.

I Create build.sbtin the project root.

(139)

build.sbt

I A list of Scala expressions, separated by blank lines.

I Located in the project’s base directory.

$ cat build.sbt name := "hello"

version := "1.0"

scalaVersion := "2.10.4"

(140)

Add Dependencies

I Add inbuild.sbt.

I Module ID format:

"groupID" %% "artifact" % "version" % "configuration"

libraryDependencies += "org.apache.spark" %% "spark-core" % "1.0.0"

// multiple dependencies libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.0.0",

"org.apache.spark" %% "spark-streaming" % "1.0.0"

)

I sbt uses the standard Maven2 repository by default, but you can add

moreresolvers.

resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

(141)

Scala Hands-on Exercises (1/3)

I Declare a list of integers as a variable called myNumbers

val myNumbers = List(1, 2, 5, 4, 7, 3)

I Declare a function,pow, that computes the second power of an Int

def pow(a: Int): Int = a * a

(142)

Scala Hands-on Exercises (1/3)

(143)

Scala Hands-on Exercises (1/3)

(144)

(145)

Scala Hands-on Exercises (2/3)

I Apply the function to myNumbers using themap function

myNumbers.map(x => pow(x))

// or

myNumbers.map(pow(_))

// or

myNumbers.map(pow)

I Write the powfunction inline in amap call, using closure notation

myNumbers.map(x => x * x)

I Iterate through myNumbers and print out its items

for (i <- myNumbers)

println(i)

// or

myNumbers.foreach(println)

(146)

Scala Hands-on Exercises (2/3)

// or

myNumbers.map(pow)

println(i)

// or

(147)

Scala Hands-on Exercises (2/3)

// or

myNumbers.map(pow)

println(i)

// or

(148)

Scala Hands-on Exercises (2/3)

// or

myNumbers.map(pow)

println(i)

// or

(149)

Scala Hands-on Exercises (2/3)

// or

myNumbers.map(pow)

println(i)

// or

(150)

Scala Hands-on Exercises (2/3)

// or

myNumbers.map(pow)

println(i)

// or

(151)

Scala Hands-on Exercises (3/3)

I Declare a list of pair of string and integers as a variable calledmyList

val myList = List[(String, Int)](("a", 1), ("b", 2), ("c", 3))

I Write an inline function to increment the integer values of the list

myList

val x = v.map { case (name, age) => age + 1 }

// or

val x = v.map(i => i._2 + 1)

// or

val x = v.map(_._2 + 1)

(152)

myList

// or

val x = v.map(i => i._2 + 1)

// or

val x = v.map(_._2 + 1)

(153)

Scala Hands-on Exercises (3/3)

myList

// or

val x = v.map(i => i._2 + 1)

// or

val x = v.map(_._2 + 1)

(154)

myList

// or

val x = v.map(i => i._2 + 1)

// or

val x = v.map(_._2 + 1)

(155)

(156)

What is Spark?

I An efficient distributedgeneral-purpose data analysis platform.

I Focusing on easeof programming andhigh performance.

(157)

Motivation

I MapReduce programming model has not been designed forcomplex

operations, e.g., data mining.

I Very expensive, i.e., always goes to disk and HDFS.

(158)

Solution

I Extends MapReduce with moreoperators.

I Support for advanceddata flow graphs.

I In-memoryandout-of-core processing.

(159)

Spark vs. Hadoop

(160)

Spark vs. Hadoop

(161)

Spark vs. Hadoop

(162)

Spark vs. Hadoop

(163)

Resilient Distributed Datasets (RDD) (1/2)

I A distributed memoryabstraction.

I Immutable collectionsof objectsspread across a cluster.

(164)

I A distributed memoryabstraction.

I Immutable collectionsof objectsspread across a cluster.

(165)

I An RDD is divided into a number of partitions, which are atomic

pieces of information.

I Partitions of an RDD can be stored on different nodesof a cluster.

(166)

RDD Operators

I Higher-order functions: transformationsandactions.

I Transformations: lazy operators that createnewRDDs.

I Actions: launch a computation and return a value to the program

or write data to the external storage.

(167)

Transformations vs. Actions

(168)

RDD Transformations -

Map

I All pairs areindependently processed.

// passing each element through a function.

val nums = sc.parallelize(Array(1, 2, 3))

val squares = nums.map(x => x * x) // {1, 4, 9}

(169)

RDD Transformations -

Map

I All pairs areindependently processed.

// passing each element through a function.

val squares = nums.map(x => x * x) // {1, 4, 9}

(170)

GroupBy

I Pairs withidentical keyare grouped.

I Groups are independently processed.

val schools = sc.parallelize(Seq(("sics", 1), ("kth", 1), ("sics", 2)))

schools.groupByKey()

// {("sics", (1, 2)), ("kth", (1))}

schools.reduceByKey((x, y) => x + y)

// {("sics", 3), ("kth", 1)}

(171)

RDD Transformations -

GroupBy

I Pairs withidentical keyare grouped.

I Groups are independently processed.

val schools = sc.parallelize(Seq(("sics", 1), ("kth", 1), ("sics", 2)))

schools.groupByKey()

// {("sics", (1, 2)), ("kth", (1))}

schools.reduceByKey((x, y) => x + y)

// {("sics", 3), ("kth", 1)}

(172)

Join

I Performs anequi-join on the key.

I Join candidates are independently

pro-cessed.

val list1 = sc.parallelize(Seq(("sics", "10"), ("kth", "50"), ("sics", "20")))

val list2 = sc.parallelize(Seq(("sics", "upsala"), ("kth", "stockholm")))

list1.join(list2)

// ("sics", ("10", "upsala")) // ("sics", ("20", "upsala")) // ("kth", ("50", "stockholm"))

(173)

RDD Transformations -

Join

I Performs anequi-join on the key.

I Join candidates are independently

pro-cessed.

val list1 = sc.parallelize(Seq(("sics", "10"), ("kth", "50"), ("sics", "20")))

val list2 = sc.parallelize(Seq(("sics", "upsala"), ("kth", "stockholm")))

list1.join(list2)

// ("sics", ("10", "upsala")) // ("sics", ("20", "upsala")) // ("kth", ("50", "stockholm"))

(174)

Basic RDD Actions

I Return all the elements of the RDD as an array.

nums.collect() // Array(1, 2, 3)

I Return an array with the first n elements of the RDD.

nums.take(2) // Array(1, 2)

I Return the number of elements in the RDD.

nums.count() // 3

I Aggregate the elements of the RDD using the given function.

nums.reduce((x, y) => x + y) // 6

(175)

Basic RDD Actions

nums.count() // 3

(176)

Basic RDD Actions

nums.count() // 3

(177)

Basic RDD Actions

nums.count() // 3

(178)

Creating RDDs

I Turn a collection into an RDD.

val a = sc.parallelize(Array(1, 2, 3))

I Load text file from local FS, HDFS, or S3.

val a = sc.textFile("file.txt")

val b = sc.textFile("directory/*.txt")

val c = sc.textFile("hdfs://namenode:9000/path/file")

(179)

SparkContext

I Main entrypoint to Spark functionality.

I Available in shellas variablesc.

I Instandaloneprograms, you should make yourown.

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._

val sc = new SparkContext(master, appName, [sparkHome], [jars])

(180)

Spark Hands-on Exercises (1/3)

I Read data from a text file and create an RDD namedpagecounts.

val pagecounts= sc.textFile("hamlet")

I Get the first 10 lines of the text file.

pagecounts.take(10).foreach(println)

I Count the total records in the data set pagecounts.

pagecounts.count

(181)

Spark Hands-on Exercises (1/3)

pagecounts.count

(182)

Spark Hands-on Exercises (1/3)

pagecounts.count

(183)

pagecounts.count

(184)

Spark Hands-on Exercises (1/3)

pagecounts.count

(185)

pagecounts.count

(186)

Spark Hands-on Exercises (2/3)

I Filter the data setpagecounts and return the items that have the

word this, and cache in the memory.

val linesWithThis = pagecounts.filter(line => line.contains("this")).cache

\\ or

val linesWithThis = pagecounts.filter(_.contains("this")).cache

I Find the lines with the most number of words.

linesWithThis.map(line => line.split(" ").size) .reduce((a, b) => if (a > b) a else b)

I Count the total number of words.

val wordCounts= linesWithThis.flatMap(line => line.split(" ")).count

\\ or

val wordCounts= linesWithThis.flatMap(_.split(" ")).count

(187)

Spark Hands-on Exercises (2/3)

val linesWithThis= pagecounts.filter(line => line.contains("this")).cache

\\ or

val linesWithThis= pagecounts.filter(_.contains("this")).cache

I Find the lines with the most number of words.

linesWithThis.map(line => line.split(" ").size) .reduce((a, b) => if (a > b) a else b)

I Count the total number of words.

val wordCounts= linesWithThis.flatMap(line => line.split(" ")).count

\\ or

val wordCounts= linesWithThis.flatMap(_.split(" ")).count

(188)

Spark Hands-on Exercises (2/3)

val linesWithThis= pagecounts.filter(line => line.contains("this")).cache

\\ or

val lin