B490 Mining the Big Data. 0 Introduction

(1)

Qin Zhang

B490 Mining the Big Data

(2)

Data Mining

What is Data Mining?

A “definition”: Discovery of useful, possibly unexpected,

(3)

Data Mining

patterns in data.

I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)

(4)

Data Mining

patterns in data.

I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)

I think, most of the time, people just mean to

• Compute some functions defined on the data

(Efficient algorithms).

• Fit data into some concrete models

(5)

In this course, we will talk about

. . .

In this course we will focus on efficient algorithms.

In particular, we will discuss

(6)

In this course, we will talk about

. . .

(7)

In this course, we will talk about

. . .

Finding similar items Mining frequent items

(8)

In this course, we will talk about

. . .

Finding similar items Mining frequent items

Clustering (aggregate similar items) Link analysis

(9)

(10)

• : over

2 .

5 petabytes

of sales transactions

• : an index of over

19 billion

web pages

• : over

40 billion

of pictures

• . . .

Big Data

(11)

• : over

2 .

5 petabytes

of sales transactions

• : an index of over

19 billion

web pages

• : over

40 billion

of pictures

• . . .

Big Data

Big data is everywhere

Nature ’06 Nature ’08 CACM ’08 Economist ’10

(12)

• Retailer databases: Amazon, Walmart

• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter

• Pictures by mobile devices: iphone • Internet traffic: IP addresses

• New forms of scientific data: Large Synoptic Survey Telescope

Source and Challenge

(13)

Source and Challenge

Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,

Photographs, Audio & Video, 3D models, Location data, . . . )

(14)

Source and Challenge

Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,

Photographs, Audio & Video, 3D models, Location data, . . . )

Challenge

(15)

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

(16)

What does Big Data Really Mean?

The data is too big to fit in memory. What can we do?

Processing one by one as they come,

(17)

What does Big Data Really Mean?

and throw some of them away on the fly.

(18)

What does Big Data Really Mean?

and throw some of them away on the fly.

Store in multiple machines, which collaborate via communication

RAM model does not fit

A processor and an infinite size memory

Probing each cell of the memory has a unit cost

RAM

(19)

(20)

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

RAM CPU Widely used: Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .

(21)

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

Applications Internet Router.

RAM

CPU

Router Packets limited space

Stock data, ad auction, flight logs on tapes, etc.

The router wants to maintain some statistics on data. E.g., want to detect anomalies for security.

Widely used:

Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .

(22)

Difficulty: See and forget!

(23)

Difficulty: See and forget!

Game 1: A sequence of numbers

52

(24)

Difficulty: See and forget!

45

(25)

Difficulty: See and forget!

18

(26)

Difficulty: See and forget!

23

(27)

Difficulty: See and forget!

17

(28)

Difficulty: See and forget!

41

(29)

Difficulty: See and forget!

33

(30)

Difficulty: See and forget!

29

(31)

Difficulty: See and forget!

49

(32)

Difficulty: See and forget!

12

(33)

Difficulty: See and forget!

35

(34)

Difficulty: See and forget!

(35)

Difficulty: See and forget!

A:

Q: What’s the median?

(36)

Difficulty: See and forget!

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

(37)

Difficulty: See and forget!

A:

33

(38)

Difficulty: See and forget!

A:

33

(39)

Difficulty: See and forget!

A:

33

(40)

Difficulty: See and forget!

A:

33

(41)

Difficulty: See and forget!

A:

33

(42)

Difficulty: See and forget!

A:

33

(43)

Difficulty: See and forget!

A:

33

(44)

Difficulty: See and forget!

A:

33

(45)

Difficulty: See and forget!

A:

33

(46)

Difficulty: See and forget!

A:

33

(47)

Difficulty: See and forget!

A:

33

(48)

Difficulty: See and forget!

A:

33

Q: Are Eva and Bob connected by friends?

(49)

Difficulty: See and forget!

A:

33

Q: Are Eva and Bob connected by friends?

A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob

(50)

MapReduce

The MapReduce model (Dean & Ghemawat 2004)

Input

Map Shuffle _Reduce

Output Standard model in industry for massive data computation E.g., Hadoop.

(51)

MapReduce

Input

Map Shuffle _Reduce

For each value x_i,

x_i → {(key₁, v₁),(key₂, v₂), . . .}

{(key₁, v₁),(key₁, v₂), . . .} → {y₁,y₂, . . .}

(52)

MapReduce

Goal

Input

Map Shuffle _Reduce

For each value x_i,

x_i → {(key₁, v₁),(key₂, v₂), . . .}

{(key₁, v₁),(key₁, v₂), . . .} → {y₁,y₂, . . .}

(53)

ActiveDHT

The ActiveDHT model (Bahmani, Chowdhury & Goel 2010)

responsible for keys with

hash = 6, 7

responsible for keys with

hash = 4, 5 • Update (key, at) • Query (key) Used in Yahoo! S4 & Twitter Storm 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 14 7

(54)

Tentative course plan

Part 0 : Introductions

Part 1 : Finding Similar Items

– Jaccard Similarty and Min-Hashing

– Locality Sensitive Hashing (LSH) and Distances

– Implementing LSH in ActiveDHT Part 2 : Clustering

– Hierachical Clustering

– Assignment-based Clustering (k-center, k-mean, k-median)

– Spectural Clustering Part 3 : Mining Frequent Items

– Finding Frequent Itemsets

– Finding Frequent Items in Data Stream Part 4 : Link Analysis

– Markov Chain Basics

(55)

Resources

There is no official textbook for the class.

Background on Randomized Algorithms:

• Probability and Computing

by Mitzenmacher and Upfal

Main reference book:

• Mining Massive Data Sets

(56)

Instructors

Instructor: Qin Zhang

Email: [email protected]

Office hours: By email appointment

Assitant Instructor: Prasanth Velamala Email: [email protected]

(57)

Grading

Assignments 50% : There will be several homework

assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.

Project 50% : The project consists of three components: 1. Write a proposal.

2. Write a report.

3. Make a presentation.

(Details will be posted online)

Use A, B, . . . for each item (assignments or projects). Final

(58)

Grading

Assignments 50% : There will be several homework

assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.

Project 50% : The project consists of three components: 1. Write a proposal.

2. Write a report.

3. Make a presentation.

(Details will be posted online)

Most important thing:

Learn something about models / algorithmic techniques

Use A, B, . . . for each item (assignments or projects). Final

(59)

LaTeX

LaTeX: Highly recommended tools for assignments/reports

1. Read wiki articles:

http://en.wikipedia.org/wiki/LaTeX

2. Find a good LaTeX editor.

3. Learn how to use it, e.g., read “A Not So Short Introduction to LaTeX 2e” (Google it)

(60)

Prerequisites

One is expected to know:

Basics on algorithm design and analysis + probability + programming.

e.g., have taken

(Math) M365 ”Introduction to Probability and Statistics”, (Math) M301 ”Linear Algebra and Applications”,

(CS) C241 ”Discrete Structures for Computer Science”,

(CS) B403 ”Introduction to Algorithm Design and Analysis”, or equivalent courses.

I will NOT start with things like big-O notations, the

definitions of random variables and expectation. But, please always ask at any time if you don’t understand sth.

(61)

Possible project topics

Part 1 : Finding Similar Items

– Locality Sensitive Hashing: Given a dictionary of a large number of documents (or other objects) and a set of query docs. For each query doc, find all docs in the dictionary that are similar. Compare LSH with other methods that you can think of (e.g., the trivial one: compute the query with each of the docs in the dictionary), in terms of the running time.

Part 2 : Clustering

– Assignment-based Clustering (k-center, k-mean, k-median):

Select clustering algorithms taught in class, and run them on large data sets. One can also try to compare it with the hierarchical

clustering.

Part 3 : Mining Frequent Items

– Finding Frequent Itemsets: Run the A-priori algorithms on large data sets to find frequent itemsets.

– Finding Frequent Items in Data Stream: Implement streaming algorithms taught in class, and run them on large data sets to find frequent items.

(62)

(63)

Approximation and Randomization

Approximation

Return ˆf (A) instead of f (A) where

f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A).

(64)

Approximation and Randomization

Approximation

f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A). Randomization

Pr h f (A) − ˆ f (A) ≤ f (A) i ≥ 1 − δ is a (1 + , δ)-approximation of f (A).

(65)

Markov and Chebyshev inequalities

Markov Inequality

Let X ≥ 0 be a random variable. Then for all a > 0,

Pr[X ≥ a] ≤ E[X]

(66)

Markov and Chebyshev inequalities

Markov Inequality

Pr[X ≥ a] ≤ E[X]

a .

Chebyshev’s Inequality

Pr[|X − E[X]| ≥ a] ≤ Var[X]

(67)

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probability

that there exists at least a pair of them will have the same birthday?

Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.

(68)

Application: Birthday Paradox

Birthday Paradox

Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k₂ pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(

k

(69)

Application: Birthday Paradox

Birthday Paradox

k

2).

(70)

Application: Birthday Paradox

Birthday Paradox

k

2).

Take 2: 1 − n−_n0 · n−_n1 · n−_n2 · . . . · n−(k_n−1)

Pr[exists collision] ≈ k2/(2n)

(71)

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of n

different coupons. Once you obtain one of every type of coupon, you can send in for a prize.

Assuming that the coupon in each box is chosen

independently and uniformly at random from the n

possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?

(72)

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of n

different coupons. Once you obtain one of every type of coupon, you can send in for a prize.

Assuming that the coupon in each box is chosen

independently and uniformly at random from the n

possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?

(73)

The Union Bound

Consider t possible dependent random events X₁, . . . , X_t.

The probability that all events occur is at least

1 −

t

X

i=1

(74)

Summary for the introduction

We have discussed Big Data and Data Mining

We have introduced three popular models for modern computation.

We have talked about the course plan and assessment.

(75)