Qin Zhang
B490 Mining the Big Data
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,
patterns in data.
I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)
Data Mining
What is Data Mining?
A “definition”: Discovery of useful, possibly unexpected,
patterns in data.
I don’t think this is practical, until a day machines have intelligence. (You can have different opinions)
I think, most of the time, people just mean to
• Compute some functions defined on the data
(Efficient algorithms).
• Fit data into some concrete models
In this course, we will talk about
. . .
In this course we will focus on efficient algorithms.
In particular, we will discuss
In this course, we will talk about
. . .
In this course we will focus on efficient algorithms.
In particular, we will discuss
In this course, we will talk about
. . .
In this course we will focus on efficient algorithms.
In particular, we will discuss
Finding similar items Mining frequent items
In this course, we will talk about
. . .
In this course we will focus on efficient algorithms.
In particular, we will discuss
Finding similar items Mining frequent items
Clustering (aggregate similar items) Link analysis
•
: over
2
.
5 petabytes
of sales transactions
•
: an index of over
19 billion
web pages
•
: over
40 billion
of pictures
•
. . .
Big Data
•
: over
2
.
5 petabytes
of sales transactions
•
: an index of over
19 billion
web pages
•
: over
40 billion
of pictures
•
. . .
Big Data
Big data is everywhere
Nature ’06 Nature ’08 CACM ’08 Economist ’10
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter
• Pictures by mobile devices: iphone • Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter
• Pictures by mobile devices: iphone • Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
Source
• Volume
• Velocity
• Variety (Documents, Stock records, Personal profiles,
Photographs, Audio & Video, 3D models, Location data, . . . )
• Retailer databases: Amazon, Walmart
• Logistics, financial & health data: Stock prices • Social network: Facebook, twitter
• Pictures by mobile devices: iphone • Internet traffic: IP addresses
• New forms of scientific data: Large Synoptic Survey Telescope
Source and Challenge
Source
• Volume
• Velocity
• Variety (Documents, Stock records, Personal profiles,
Photographs, Audio & Video, 3D models, Location data, . . . )
Challenge
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,
and throw some of them away on the fly.
What does Big Data Really Mean?
We don’t define Big Data in terms of TB, PB, EB, . . .
The data is too big to fit in memory. What can we do?
Processing one by one as they come,
and throw some of them away on the fly.
Store in multiple machines, which collaborate via communication
RAM model does not fit
A processor and an infinite size memory
Probing each cell of the memory has a unit cost
RAM
Data Streams
The data stream model (Alon, Matias & Szegedy 1996)
RAM CPU Widely used: Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .
Data Streams
The data stream model (Alon, Matias & Szegedy 1996)
Applications Internet Router.
RAM
CPU
Router Packets limited space
Stock data, ad auction, flight logs on tapes, etc.
The router wants to maintain some statistics on data. E.g., want to detect anomalies for security.
Widely used:
Stanford Stream, Aurora, Telegraph, NiagaraCQ . . .
Difficulty: See and forget!
Difficulty: See and forget!
Game 1: A sequence of numbers
52
Difficulty: See and forget!
Game 1: A sequence of numbers
45
Difficulty: See and forget!
Game 1: A sequence of numbers
18
Difficulty: See and forget!
Game 1: A sequence of numbers
23
Difficulty: See and forget!
Game 1: A sequence of numbers
17
Difficulty: See and forget!
Game 1: A sequence of numbers
41
Difficulty: See and forget!
Game 1: A sequence of numbers
33
Difficulty: See and forget!
Game 1: A sequence of numbers
29
Difficulty: See and forget!
Game 1: A sequence of numbers
49
Difficulty: See and forget!
Game 1: A sequence of numbers
12
Difficulty: See and forget!
Game 1: A sequence of numbers
35
Difficulty: See and forget!
Game 1: A sequence of numbers
Difficulty: See and forget!
Game 1: A sequence of numbers
A:
Q: What’s the median?
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Q: Are Eva and Bob connected by friends?
Difficulty: See and forget!
Game 1: A sequence of numbers
Game 2: Relationships between
Alice, Bob, Carol, Dave, Eva and Paul
Q: What’s the median?
A:
33
Q: Are Eva and Bob connected by friends?
A: YES. Eva ⇔ Carol ⇔ Dave ⇔ Alice ⇔ Bob
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
Map Shuffle Reduce
Output Standard model in industry for massive data computation E.g., Hadoop.
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Input
Map Shuffle Reduce
Output Standard model in industry for massive data computation E.g., Hadoop.
For each value xi,
xi → {(key1, v1),(key2, v2), . . .}
{(key1, v1),(key1, v2), . . .} → {y1,y2, . . .}
MapReduce
The MapReduce model (Dean & Ghemawat 2004)
Goal
Input
Map Shuffle Reduce
Output Standard model in industry for massive data computation E.g., Hadoop.
For each value xi,
xi → {(key1, v1),(key2, v2), . . .}
{(key1, v1),(key1, v2), . . .} → {y1,y2, . . .}
ActiveDHT
The ActiveDHT model (Bahmani, Chowdhury & Goel 2010)
responsible for keys with
hash = 6, 7
responsible for keys with
hash = 4, 5 • Update (key, at) • Query (key) Used in Yahoo! S4 & Twitter Storm 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 14 7
Tentative course plan
Part 0 : IntroductionsPart 1 : Finding Similar Items
– Jaccard Similarty and Min-Hashing
– Locality Sensitive Hashing (LSH) and Distances
– Implementing LSH in ActiveDHT Part 2 : Clustering
– Hierachical Clustering
– Assignment-based Clustering (k-center, k-mean, k-median)
– Spectural Clustering Part 3 : Mining Frequent Items
– Finding Frequent Itemsets
– Finding Frequent Items in Data Stream Part 4 : Link Analysis
– Markov Chain Basics
Resources
There is no official textbook for the class.
Background on Randomized Algorithms:
• Probability and Computing
by Mitzenmacher and Upfal
Main reference book:
• Mining Massive Data Sets
Instructors
Instructor: Qin Zhang
Email: [email protected]
Office hours: By email appointment
Assitant Instructor: Prasanth Velamala Email: [email protected]
Grading
Assignments 50% : There will be several homework
assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.
Project 50% : The project consists of three components: 1. Write a proposal.
2. Write a report.
3. Make a presentation.
(Details will be posted online)
Use A, B, . . . for each item (assignments or projects). Final
Grading
Assignments 50% : There will be several homework
assignments. Solutions should be typeset in LaTeX (highly recommended) or Word.
Project 50% : The project consists of three components: 1. Write a proposal.
2. Write a report.
3. Make a presentation.
(Details will be posted online)
Most important thing:
Learn something about models / algorithmic techniques
Use A, B, . . . for each item (assignments or projects). Final
LaTeX
LaTeX: Highly recommended tools for assignments/reports
1. Read wiki articles:
http://en.wikipedia.org/wiki/LaTeX
2. Find a good LaTeX editor.
3. Learn how to use it, e.g., read “A Not So Short Introduction to LaTeX 2e” (Google it)
Prerequisites
One is expected to know:
Basics on algorithm design and analysis + probability + programming.
e.g., have taken
(Math) M365 ”Introduction to Probability and Statistics”, (Math) M301 ”Linear Algebra and Applications”,
(CS) C241 ”Discrete Structures for Computer Science”,
(CS) B403 ”Introduction to Algorithm Design and Analysis”, or equivalent courses.
I will NOT start with things like big-O notations, the
definitions of random variables and expectation. But, please always ask at any time if you don’t understand sth.
Possible project topics
Part 1 : Finding Similar Items
– Locality Sensitive Hashing: Given a dictionary of a large number of documents (or other objects) and a set of query docs. For each query doc, find all docs in the dictionary that are similar. Compare LSH with other methods that you can think of (e.g., the trivial one: compute the query with each of the docs in the dictionary), in terms of the running time.
Part 2 : Clustering
– Assignment-based Clustering (k-center, k-mean, k-median):
Select clustering algorithms taught in class, and run them on large data sets. One can also try to compare it with the hierarchical
clustering.
Part 3 : Mining Frequent Items
– Finding Frequent Itemsets: Run the A-priori algorithms on large data sets to find frequent itemsets.
– Finding Frequent Items in Data Stream: Implement streaming algorithms taught in class, and run them on large data sets to find frequent items.
Approximation and Randomization
Approximation
Return ˆf (A) instead of f (A) where
f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A).
Approximation and Randomization
Approximation
Return ˆf (A) instead of f (A) where
f (A) − ˆ f (A) ≤ f (A) is a (1 + )-approximation of f (A). Randomization
Return ˆf (A) instead of f (A) where
Pr h f (A) − ˆ f (A) ≤ f (A) i ≥ 1 − δ is a (1 + , δ)-approximation of f (A).
Markov and Chebyshev inequalities
Markov Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[X ≥ a] ≤ E[X]
Markov and Chebyshev inequalities
Markov Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[X ≥ a] ≤ E[X]
a .
Chebyshev’s Inequality
Let X ≥ 0 be a random variable. Then for all a > 0,
Pr[|X − E[X]| ≥ a] ≤ Var[X]
Application: Birthday Paradox
Birthday ParadoxIn a set of k randomly chosen people, what is the probability
that there exists at least a pair of them will have the same birthday?
Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.
Application: Birthday Paradox
Birthday ParadoxIn a set of k randomly chosen people, what is the probability
that there exists at least a pair of them will have the same birthday?
Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(
k
Application: Birthday Paradox
Birthday ParadoxIn a set of k randomly chosen people, what is the probability
that there exists at least a pair of them will have the same birthday?
Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(
k
2).
Application: Birthday Paradox
Birthday ParadoxIn a set of k randomly chosen people, what is the probability
that there exists at least a pair of them will have the same birthday?
Assuming each person’s birthday is randomly chosen from Jan. 1 to Dec. 31.
Take 1: For any pair of people, the probability that they have the same birthday is 1/n. For k people, we have k2 pairs of people. The probability that none of them have the same birthday is (1 − 1/n)(k2). Thus the answer is 1 − (1 − 1/n)(
k
2).
Take 2: 1 − n−n0 · n−n1 · n−n2 · . . . · n−(kn−1)
Pr[exists collision] ≈ k2/(2n)
Application: Coupon Collector
Coupon Collector
Suppose that each of box of cereal contains one of n
different coupons. Once you obtain one of every type of coupon, you can send in for a prize.
Assuming that the coupon in each box is chosen
independently and uniformly at random from the n
possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?
Application: Coupon Collector
Coupon Collector
Suppose that each of box of cereal contains one of n
different coupons. Once you obtain one of every type of coupon, you can send in for a prize.
Assuming that the coupon in each box is chosen
independently and uniformly at random from the n
possibilities, how many boxes of cereal must you buy before you obtain at least one of every type of coupon?
The Union Bound
The Union Bound
Consider t possible dependent random events X1, . . . , Xt.
The probability that all events occur is at least
1 −
t
X
i=1
Summary for the introduction
We have discussed Big Data and Data Mining
We have introduced three popular models for modern computation.
We have talked about the course plan and assessment.