• No results found

B490 Mining the Big Data. 0 Introduction

N/A
N/A
Protected

Academic year: 2021

Share "B490 Mining the Big Data. 0 Introduction"

Copied!
75
0
0

Loading.... (view fulltext now)

Full text

(1)

Qin Zhang

B490 Mining the Big Data

(2)

Data Mining

What is Data Mining?

A “definition”: Discovery of useful, possibly unexpected,

(3)

Data Mining

What is Data Mining?

A “definition”: Discovery of useful, possibly unexpected,

patterns in data.

I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)

(4)

Data Mining

What is Data Mining?

A “definition”: Discovery of useful, possibly unexpected,

patterns in data.

I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)

I think, most of the time, people just mean to

• Compute some functions defined on the data

(Efficient algorithms).

• Fit data into some concrete models

(5)

In this course, we will talk about

. . .

In this course we will focus on efficient algorithms.

In particular, we will discuss

(6)

In this course, we will talk about

. . .

In this course we will focus on efficient algorithms.

In particular, we will discuss

(7)

In this course, we will talk about

. . .

In this course we will focus on efficient algorithms.

In particular, we will discuss

Finding similar items Mining frequent items

(8)

In this course, we will talk about

. . .

In this course we will focus on efficient algorithms.

In particular, we will discuss

Finding similar items Mining frequent items

Clustering (aggregate similar items) Link analysis

(9)
(10)

: over

2

.

5 petabytes

of sales transactions

: an index of over

19 billion

web pages

: over

40 billion

of pictures

. . .

Big Data

(11)

: over

2

.

5 petabytes

of sales transactions

: an index of over

19 billion

web pages

: over

40 billion

of pictures

. . .

Big Data

Big data is everywhere

Nature ’06 Nature ’08 CACM ’08 Economist ’10

(12)

• Retailer databases: Amazon, Walmart

• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter

• Pictures by mobile devices: iphone • Internet traffic: IP addresses

• New forms of scientific data: Large Synoptic Survey Telescope

Source and Challenge

(13)

• Retailer databases: Amazon, Walmart

• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter

• Pictures by mobile devices: iphone • Internet traffic: IP addresses

• New forms of scientific data: Large Synoptic Survey Telescope

Source and Challenge

Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,

Photographs, Audio & Video, 3D models, Location data, . . . )

(14)

• Retailer databases: Amazon, Walmart

• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter

• Pictures by mobile devices: iphone • Internet traffic: IP addresses

• New forms of scientific data: Large Synoptic Survey Telescope

Source and Challenge

Source

• Volume

• Velocity

• Variety (Documents, Stock records, Personal profiles,

Photographs, Audio & Video, 3D models, Location data, . . . )

Challenge

(15)

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

(16)

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

The data is too big to fit in memory. What can we do?

Processing one by one as they come,

(17)

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

The data is too big to fit in memory. What can we do?

Processing one by one as they come,

and throw some of them away on the fly.

(18)

What does Big Data Really Mean?

We don’t define Big Data in terms of TB, PB, EB, . . .

The data is too big to fit in memory. What can we do?

Processing one by one as they come,

and throw some of them away on the fly.

Store in multiple machines, which collaborate via communication

RAM model does not fit

A processor and an infinite size memory

Probing each cell of the memory has a unit cost

RAM

(19)
(20)

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

RAM CPU Widely used: Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .

(21)

Data Streams

The data stream model (Alon, Matias & Szegedy 1996)

Applications Internet Router.

RAM

CPU

Router Packets limited space

Stock data, ad auction, flight logs on tapes, etc.

The router wants to maintain some statistics on data. E.g., want to detect anomalies for security.

Widely used:

Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .

(22)

Difficulty: See and forget!

(23)

Difficulty: See and forget!

Game 1: A sequence of numbers

52

(24)

Difficulty: See and forget!

Game 1: A sequence of numbers

45

(25)

Difficulty: See and forget!

Game 1: A sequence of numbers

18

(26)

Difficulty: See and forget!

Game 1: A sequence of numbers

23

(27)

Difficulty: See and forget!

Game 1: A sequence of numbers

17

(28)

Difficulty: See and forget!

Game 1: A sequence of numbers

41

(29)

Difficulty: See and forget!

Game 1: A sequence of numbers

33

(30)

Difficulty: See and forget!

Game 1: A sequence of numbers

29

(31)

Difficulty: See and forget!

Game 1: A sequence of numbers

49

(32)

Difficulty: See and forget!

Game 1: A sequence of numbers

12

(33)

Difficulty: See and forget!

Game 1: A sequence of numbers

35

(34)

Difficulty: See and forget!

Game 1: A sequence of numbers

(35)

Difficulty: See and forget!

Game 1: A sequence of numbers

A:

Q: What’s the median?

(36)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

(37)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(38)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(39)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(40)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(41)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(42)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(43)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(44)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(45)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(46)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(47)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

(48)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

Q: Are Eva and Bob connected by friends?

(49)

Difficulty: See and forget!

Game 1: A sequence of numbers

Game 2: Relationships between

Alice, Bob, Carol, Dave, Eva and Paul

Q: What’s the median?

A:

33

Q: Are Eva and Bob connected by friends?

A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob

(50)

MapReduce

The MapReduce model (Dean & Ghemawat 2004)

Input

Map Shuffle Reduce

Output Standard model in industry for massive data computation E.g., Hadoop.

(51)

MapReduce

The MapReduce model (Dean & Ghemawat 2004)

Input

Map Shuffle Reduce

Output Standard model in industry for massive data computation E.g., Hadoop.

For each value xi,

xi → {(key1, v1),(key2, v2), . . .}

{(key1, v1),(key1, v2), . . .} → {y1,y2, . . .}

(52)

MapReduce

The MapReduce model (Dean & Ghemawat 2004)

Goal

Input

Map Shuffle Reduce

Output Standard model in industry for massive data computation E.g., Hadoop.

For each value xi,

xi → {(key1, v1),(key2, v2), . . .}

{(key1, v1),(key1, v2), . . .} → {y1,y2, . . .}

(53)

ActiveDHT

The ActiveDHT model (Bahmani, Chowdhury & Goel 2010)

responsible for keys with

hash = 6, 7

responsible for keys with

hash = 4, 5 • Update (key, at) • Query (key) Used in Yahoo! S4 & Twitter Storm 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 14 7

(54)

Tentative course plan

Part 0 : Introductions

Part 1 : Finding Similar Items

– Jaccard Similarty and Min-Hashing

– Locality Sensitive Hashing (LSH) and Distances

– Implementing LSH in ActiveDHT Part 2 : Clustering

– Hierachical Clustering

– Assignment-based Clustering (k-center, k-mean, k-median)

– Spectural Clustering Part 3 : Mining Frequent Items

– Finding Frequent Itemsets

– Finding Frequent Items in Data Stream Part 4 : Link Analysis

– Markov Chain Basics

(55)

Resources

There is no official textbook for the class.

Background on Randomized Algorithms:

• Probability and Computing

by Mitzenmacher and Upfal

Main reference book:

• Mining Massive Data Sets

(56)

Instructors

Instructor: Qin Zhang

Email: [email protected]

Office hours: By email appointment

Assitant Instructor: Prasanth Velamala Email: [email protected]

(57)

Grading

Assignments 50% : There will be several homework

assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.

Project 50% : The project consists of three components: 1. Write a proposal.

2. Write a report.

3. Make a presentation.

(Details will be posted online)

Use A, B, . . . for each item (assignments or projects). Final

(58)

Grading

Assignments 50% : There will be several homework

assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.

Project 50% : The project consists of three components: 1. Write a proposal.

2. Write a report.

3. Make a presentation.

(Details will be posted online)

Most important thing:

Learn something about models / algorithmic techniques

Use A, B, . . . for each item (assignments or projects). Final

(59)

LaTeX

LaTeX: Highly recommended tools for assignments/reports

1. Read wiki articles:

http://en.wikipedia.org/wiki/LaTeX

2. Find a good LaTeX editor.

3. Learn how to use it, e.g., read “A Not So Short Introduction to LaTeX 2e” (Google it)

(60)

Prerequisites

One is expected to know:

Basics on algorithm design and analysis + probability + programming.

e.g., have taken

(Math) M365 ”Introduction to Probability and Statistics”, (Math) M301 ”Linear Algebra and Applications”,

(CS) C241 ”Discrete Structures for Computer Science”,

(CS) B403 ”Introduction to Algorithm Design and Analysis”, or equivalent courses.

I will NOT start with things like big-O notations, the

definitions of random variables and expectation. But, please always ask at any time if you don’t understand sth.

(61)

Possible project topics

Part 1 : Finding Similar Items

– Locality Sensitive Hashing: Given a dictionary of a large number of documents (or other objects) and a set of query docs. For each query doc, find all docs in the dictionary that are similar. Compare LSH with other methods that you can think of (e.g., the trivial one: compute the query with each of the docs in the dictionary), in terms of the running time.

Part 2 : Clustering

– Assignment-based Clustering (k-center, k-mean, k-median):

Select clustering algorithms taught in class, and run them on large data sets. One can also try to compare it with the hierarchical

clustering.

Part 3 : Mining Frequent Items

– Finding Frequent Itemsets: Run the A-priori algorithms on large data sets to find frequent itemsets.

– Finding Frequent Items in Data Stream: Implement streaming algorithms taught in class, and run them on large data sets to find frequent items.

(62)
(63)

Approximation and Randomization

Approximation

Return ˆf (A) instead of f (A) where

f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A).

(64)

Approximation and Randomization

Approximation

Return ˆf (A) instead of f (A) where

f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A). Randomization

Return ˆf (A) instead of f (A) where

Pr h f (A) − ˆ f (A) ≤ f (A) i ≥ 1 − δ is a (1 + , δ)-approximation of f (A).

(65)

Markov and Chebyshev inequalities

Markov Inequality

Let X ≥ 0 be a random variable. Then for all a > 0,

Pr[X ≥ a] ≤ E[X]

(66)

Markov and Chebyshev inequalities

Markov Inequality

Let X ≥ 0 be a random variable. Then for all a > 0,

Pr[X ≥ a] ≤ E[X]

a .

Chebyshev’s Inequality

Let X ≥ 0 be a random variable. Then for all a > 0,

Pr[|X − E[X]| ≥ a] ≤ Var[X]

(67)

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probability

that there exists at least a pair of them will have the same birthday?

Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.

(68)

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probability

that there exists at least a pair of them will have the same birthday?

Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.

Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(

k

(69)

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probability

that there exists at least a pair of them will have the same birthday?

Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.

Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(

k

2).

(70)

Application: Birthday Paradox

Birthday Paradox

In a set of k randomly chosen people, what is the probability

that there exists at least a pair of them will have the same birthday?

Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.

Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(

k

2).

Take 2: 1 − n−n0 · n−n1 · n−n2 · . . . · n−(kn−1)

Pr[exists collision] ≈ k2/(2n)

(71)

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of n

different coupons. Once you obtain one of every type of coupon, you can send in for a prize.

Assuming that the coupon in each box is chosen

independently and uniformly at random from the n

possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?

(72)

Application: Coupon Collector

Coupon Collector

Suppose that each of box of cereal contains one of n

different coupons. Once you obtain one of every type of coupon, you can send in for a prize.

Assuming that the coupon in each box is chosen

independently and uniformly at random from the n

possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?

(73)

The Union Bound

The Union Bound

Consider t possible dependent random events X1, . . . , Xt.

The probability that all events occur is at least

1 −

t

X

i=1

(74)

Summary for the introduction

We have discussed Big Data and Data Mining

We have introduced three popular models for modern computation.

We have talked about the course plan and assessment.

(75)

References

Related documents

The radio communication protocols developed in FleetNet will be implemented on existing radio hardware. An appropriate FleetNet radio subsystem must perform robust in case of a

The development of an improved and accurate statistical method to detect selection in population genomic analysis combined with genome-wide data in each of these subpopulations

In this experiment, we measured the time required by the algorithms to update their internal data structures in response to new items arriving in the stream, and the time required

Debt relief and the debt sustainability framework (as with IDA replenishments) are becoming Christmas trees onto which countries hang conditions reflecting potentially

Additional higher temperature annealing stages were of little help for the improvement of bending strength, but did enhance the Vickers indentation fracture toughness (VIF) from 0.93

To sum up, economic theories of ownership and behavior predict that public hospitals would be more costly than their public counterparts and private firms would provide lower

population grew by 90% during that period but became relatively stable, unlike California, where the rate is projected to triple during the next decade. Since crime rates in

Knowledge and Skill Statement: Understands the concepts and processes needed to obtain, develop, maintain, and improve a product or service mix in response to market opportunities