• No results found

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

N/A
N/A
Protected

Academic year: 2021

Share "E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics"

Copied!
36
0
0

Loading.... (view fulltext now)

Full text

(1)

© CY Lin, 2015 Columbia University E6895 Advanced Big Data Analytics — Lecture 3

E6895 Advanced Big Data Analytics Lecture 3:

! Spark and Data Analytics

Ching-Yung Lin, Ph.D.

Adjunct Professor, Dept. of Electrical Engineering and Computer Science

Mgr., Dept. of Network Science and Big Data Analytics, IBM Watson Research Center

(2)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

2

Updated Information on TAs

! !

▪ TAs:

Ghazal Fazelnia <gf2293>, EE; Tuesdays 9:30 - 11:00 am, EE Student Lounge, Mudd 13th floor Lev Givon <leg22>, EE; Wednesdays 6:00 - 7:30 pm, 8LE5 CEPSR Building

(3)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

3

Reference

(4)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

4

Spark Stack

(5)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

5

Spark Core

Basic functionality of Spark, including components for:

• Task Scheduling

• Memory Management

• Fault Recovery

• Interacting with Storage Systems

• and more

!

Home to the API that defines resilient distributed datasets (RDDs) - Spark’s main programming abstraction.

!

RDD represents a collection of items distributed across many compute nodes that can be manipulated in parallel.

(6)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

6

Download Spark

http://spark.apache.org/downloads.html

(7)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

7

First language to use — Python

(8)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

8

Spark’s Python Shell (PySpark Shell)

bin/pyspark

(9)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

9

Test installation

(10)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

10

Disable logging

(11)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

11

Core Spark Concepts

• At a high level, every Spark application consists of a driver program that launches various parallel operations on a cluster.

!

• The driver program contains your application’s main function and defines distributed databases on the cluster, then applies operations to them.

!

• In the preceding example, the driver program was the Spark shell itself.

!

• Driver programs access Spark through a SparkContext object, which represents a connection to a computing cluster.

!

• In the shell, a SparkContext is automatically created as the variable called sc.

(12)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

12

Driver Programs

Driver programs typically manage a number of nodes called executors.

!

If we run the count() operation on a cluster, different machines might count lines in different ranges of the file.

(13)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

13

Example filtering

lambda —> define functions inline in Python.

(14)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

14

Running as a Standalone Application

(15)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

15

Example — word count

(16)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

16

Resilient Distributed Dataset (RDD) Basics

• An RDD in Spark is an immutable distributed collection of objects.

!

• Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.

!

• Users create RDDs in two ways: by loading an external dataset, or by distributing a collection of objects in their driver program.

!

• Once created, RDDs offer two types of operations: transformations and actions.

<== create RDD

<== transformation

<== action

Transformations and actions are different because of the way Spark computes RDDs.

==> Only computes when something is, the first time, in an action.

(17)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

17

Persistance in Spark

• By default, RDDs are computed each time you run an action on them.

• If you like to reuse an RDD in multiple actions, you can ask Spark to persist it using RDD.persist().

• RDD.persist() will then store the RDD contents in memory and reuse them in future actions.

• Persisting RDDs on disk instead of memory is also possible.

• The behavior of not persisting by default seems to be unusual, but it makes sense for big data.

(18)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

18

Parallelizing in Spark

The simplest way to create RDDs is to take an existing collection in your program and pass it to SparkContext’s parallelize() method. But, that needs all dataset in memory on one machine.

(19)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

19

Spark file loading

(20)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

20

Loading and Saving

(21)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

21

Loading and Saving (II)

(22)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

22

Loading and Saving (III)

(23)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

23

File compression options

(24)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

24

Example of Spark Programming

(25)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

25

Numeric RDD Operations

(26)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

26

Removing outliers

(27)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

27

Machine Learning Library in Spark — MLlib

An example of using MLlib for text classification task, e.g., identifying spammy emails.

(28)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

28

Example: Spam Detection

(29)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

29

Feature Extraction Example — TF-IDF

(30)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

30

Linear Regression

(31)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

31

Classifiers

(32)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

32

Decision trees and Random Forests

(33)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

33

Clustering

(34)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

34

Collaborative Filtering

(35)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

35

Dimensionality Reduction

(36)

© 2015 CY Lin, Columbia University E6895 Advanced Big Data Analytics – Lecture 3: Spark and Data Analytics

36

Homework #1

1. Download and Install Spark. Learn how to use it.

!

2. Download Wikipedia dataset. Extract about 1,000 pages (items) based on your own interest. Create TF-IDF of each page.

!

3. Use Twitter Streaming API to receive real-time twitter data. Collect 30 mins of Twitter data on 5 companies using keyword=xxx (e.g., ibm). Consider all Twitter data from a company is one document. Create TF-IDF of each company’s tweets in that 30 minutes.

!

4. Use Yahoo Finance to receive the Stock price data. Collect 30 mins of Finance data on 5 companies, one value per minute. Use the outlier function to display outliers that are large than two standard deviation.

References

Related documents

Such value can only be provided by using big data analytics, which is the application of advanced analytics techniques on big data.. This paper aims to analyze

In this paper, The 3V’s of Big Data, Big Data Analytics, Application of Big Data analytics, Big Data Technology and various Big Data Case Studies are presented.. In

Data Volume Affects Spark Based Data Analytics on a Scale-up Server”, in 6 th International Workshop on Big Data. Benchmarking, Performance Optimization and Emerging Hardware

Two of the commonly used cluster computing frameworks for big data analytics are: (i) Apache Spark [47], a recently developed platform from UC Berkley that exploits

- Predictive Analytics 101 Operationalizing Big Data - Reducing Client Incidents through Big Data Predictive - Welcome to the Era of Big Data and Predictive Analytics in - Unleash

– Enterprise support for customers of Oracle Advanced Analytics, Big Data Appliance, and Oracle Linux. Deliver enterprise-level advanced analytics in

This book explores Spark as well as the other components of the Berkeley Data Analytics Stack (BDAS), a data processing alternative to Hadoop, especially in the realm of big

• General agreement that Big Data can be leveraged to gain a competitive advantage in the future, only half of the survey respondents (users of advanced analytics) either have