Big Data & Scripting Part II
Streaming Algorithms
a note on sampling and filtering
• sampling:
(randomly) choose a representative subset
• filtering:
given some criterion (e.g. membership in a set), retain only elements matching that criterion
• example scenario: stream of requests (user,request)
• sampling requests is straightforward (e.g. which pages are accessed most frequently)
• analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x )
sampling and filtering example
• n = 200, 000 events, m = 40, 000 different requests, uniform distribution
all queries
s frequency 051015
10% sample
id
frequency
0 10000 20000 30000 40000
0123456
sampling and filtering example
• same dataset, but frequency vs. # queries with this frequency
all queries − by frequency
frequency
number of queries with frequency
0 5 10 15
0200040006000
10% sample − by frequency
frequency
number of queries with frequency
0 1 2 3 4 5 6
050001500025000
sampling and filtering example
• same dataset, but frequency vs. # queries with this frequency
• this time sample is selected by a fixed subset of ids
all queries − by frequency
frequency
number of queries with frequency
0 5 10 15
0200040006000
corrected 10% sample − by frequency
frequency
number of queries with frequency
0 5 10 15
0100300500700
Histograms and Frequency Skews
stream and histogram
consider the following input:
0 20 40 60 80 100
1 2 3 4 5 6
time
objects/buckets
• as time/stream progresses, data points come in – e.g. users issue requests
• distinguished by some id or bucket (from hashing)
• some are seen more often (e.g. 4) some less often (e.g. 1) – e.g. user 4 sending requests with high frequency,
user 1 only one request
this is highly valuable information for an analysis
stream and histogram
0 20 40 60 80 100
1 2 3 4 5 6
time
objects/buckets
to analyze these frequency distributions, histograms are helpful:
1 2 3 4 5 6
0 5 10 15 20 25 30
object
frequency
comparing histograms - different distributions
an example of two different streams of observations:
0 100 200 300 400 500 600 700
objects
frequency
0 100 200 300 400 500 600 700
objects
frequency
both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed
sorting objects by frequencies makes the difference more obvious:
0 100 200 300 400 500 600 700
objects
frequency
0 100 200 300 400 500 600 700
objects
frequency
the plan
• information about the distribution of observation is crucial for many applications
• knowing the complete, exact histogram – would be helpful
– is often not possible, due to the large number of distinct objects
workaround:
• characterize histogram without knowing the complete picture
• characteristic properties easier to determine
• analogous to descriptions of distributions on R
characterizing frequency distributions
1 2 3 4 5 6
0 5 10 15 20 25 30
object
frequency
• mi: frequency of object i
• number of distinct objects seen so far: Pi(mi)0
• total number of objects seen so far: Pi(mi)1 =Pimi
generalization: Mk =Pi(mi)k kth moment
M
2– the second frequency moment
what we have so far
• M0 – Flajolet-Martin algorithm from last lecture
• M1 – counting
• combination: average frequency M1/M0
next: estimate M2 =Pimi2
M
2– the second frequency moment
0 100 200 300 400 500 600 700
objects
frequency
M2 = 1.678.672
0 100 200 300 400 500 600 700
objects
frequency
M2 = 3.320.852
Motivation
• M2 describes the “skewness” of a distribution
• smaller M2 less skewed distribution
• related to the Gini-Index (surprise index)
• used to limit approximation errors, query optimization in database systems
M
2and Var(X )
• variance describes the distribution of values
• M2 describes the distribution of their frequencies
• M2 comparable to variance of frequencies:
Var({mi}) = 1/NPi(mi − µ({mi}))2
M
2– the second frequency moment: approximation
• storing and counting distinct objects impossible
• approximation by Alon-Matias-Szegedy algorithm1:
algorithm
• N observations in stream
• choose k random positions pj ∈ {1, . . . , N}
• when reaching position pj: – store object at position
– start counting occurrences of this object in mj
• estimate: M2 ≈ n/k(Pki =1(2mi − 1))
1Alon, N.; Matias Y.; Szegedy, M.:
“The space complexity of approximating the frequency moments”, 1999
M
2– the second frequency moment: example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c e c f a e g f f b b c g b a a f d a e
• N=20
• random positions 3, 7, 14, 5
• position 3: encounter c, counting results in 2
• position 7: encounter g, 2
• position 14: b 1
• position 5 a 4
• estimate:
M2 ≈ 204[2 · (2 · 2 − 1) + (2 · 1 − 1) + (2 · 4 − 1] = 204 · 14 = 70
• true value: M2 = 42+ 32+ 32+ 12+ 32 + 42+ 22 = 64
M
2– the second frequency moment: summary
• the algorithm is simple to implement
• needs to store only the k counters
• gets more precise with larger k, proof idea:
– expected value of each counter is fraction of M2 – average of k counters approaches M2
problem: N may not be known in the beginning
approximating M
2with unknown stream length
• stream may be of unknown length or unlimited
• still each position must be chosen random and uniform from {1, . . . , N}
solution
• keep count of k objects beginning with the first k
• when object at position p > k is processed:
– choose with probability k/(p + 1)
– drop existing element (chosen with equal probability) each position chosen with equal probability
clustering data streams
clustering data streams – the problem
• many formulations of “the” clustering problem possible
• wide application ranges, strong variance in – preconditions
– objective function
• common ground:
– objects connected by relation
– identify groups of “similar objects” with respect to relation – problem is intractable (N P-hard)
some basic questions
• what kind of relation (e.g. binary, distance, similarity)
• can objects have a mean value (continuous space)
• what is a “good” cluster (objective function)
• possibility of overlapping clusters
clustering data streams – STREAM
in the following: a single example problem and a single algorithm
• k-median
• on a data stream
• in one pass
• with guaranteed approximation quality
• algorithm: STREAM
• Guha, Mishra,Motwani, O’Callaghan:
“Clustering Data Streams”,2000
clustering data streams – the k-median problem
• input:
– objects X = {xi : i = 1, . . . , N}
– distance d : X × X → R
– every xi is seen once in arbitrary order (i = 1, . . . , N) – k - number of clusters to find
• objective:
– identify k elements m1, . . . , mk ∈ X (cluster centers) – let N(mj) = {xi ∈ X : j = arg minl ∈1,...,kd (xi, ml)}
all xi for which mi is the nearest center
– minimize C ({m1, . . . , mk}) = Pk
j=1
P
xi∈N(mj)
d (xi, mj)
clustering data streams – approximating k-median
• for small problem instances k-median can be fixed parameter approximated
• fixed parameter approximation: Capprox ≤ a · Qopt
(approximation is maximal by factor a worse than optimal solution for fixed a)
• this approximation is useful to approximate larger instances
approximation (idea)
• k-medians can be stated as integer program PI
• this program can be relaxed to a linear program PL
• solution of PL can be rounded to solution of PI
linear problems can be solved efficiently
clustering data streams – weighted k-medians extending k-medians with weights:
• k-medians with weighted samples w : X → R>0:
• distance of objects to their centers multiplied by weight:
C ({m1, . . . , mk}) =PjPi ∈1,...,Nw (xi) · d (xi, mj)
• k-medians is special case with unit weights
• weighted k-means can be approximated similar to k-means:
• algorithm can only be applied to “small” instances
• use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k
output: k weighted centers
2
first step - clustering with low memory
approach: divide and conquer Small-Space(X)
1. divide X into l disjoint subsets X1, . . . , Xl 2. cluster each Xi individually into l · k clusters 3. result: X0 set of lk cluster centers
4. cluster X0, using for each c ∈ X0 |N(c)| as weight
• 2. can be solved with a constant factor approximation:
– solution ≤ b times worse than optimum
• 4. can be solved with constant factor approximation not worse than ≤ c times optimum
result: constant factor approximation partial solutions and their combination
extending to a solution
Small-Space(X)
1. divide X into l disjoint subsets X1, . . . , Xl 2. cluster each Xi individually into O(k) clusters 3. result: X0 set of O(lk) cluster centers
4. cluster X0, using for each c ∈ X0 |N(c)| as weight
• constant factor approximation
• needs to cluster Xi
memory problem 1: size of subsets versus l
• needs to cluster X0
memory problem 2: clustering O(lk) elements