Big Data & Scripting Part II Streaming Algorithms

(1)

Big Data & Scripting Part II

Streaming Algorithms

(2)

a note on sampling and filtering

• sampling:

(randomly) choose a representative subset

• filtering:

given some criterion (e.g. membership in a set), retain only elements matching that criterion

• example scenario: stream of requests (user,request)

• sampling requests is straightforward (e.g. which pages are accessed most frequently)

• analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x )

(3)

sampling and filtering example

• n = 200, 000 events, m = 40, 000 different requests, uniform distribution

all queries

s frequency 051015

10% sample

id

frequency

0 10000 20000 30000 40000

0123456

(4)

sampling and filtering example

• same dataset, but frequency vs. # queries with this frequency

all queries − by frequency

frequency

number of queries with frequency

0 5 10 15

0200040006000

10% sample − by frequency

frequency

0 1 2 3 4 5 6

050001500025000

(5)

sampling and filtering example

• same dataset, but frequency vs. # queries with this frequency

• this time sample is selected by a fixed subset of ids

all queries − by frequency

frequency

0 5 10 15

0200040006000

corrected 10% sample − by frequency

frequency

0 5 10 15

0100300500700

(6)

Histograms and Frequency Skews

(7)

stream and histogram

consider the following input:

0 20 40 60 80 100

1 2 3 4 5 6

time

objects/buckets

• as time/stream progresses, data points come in – e.g. users issue requests

• distinguished by some id or bucket (from hashing)

• some are seen more often (e.g. 4) some less often (e.g. 1) – e.g. user 4 sending requests with high frequency,

user 1 only one request

this is highly valuable information for an analysis

(8)

stream and histogram

0 20 40 60 80 100

1 2 3 4 5 6

time

objects/buckets

to analyze these frequency distributions, histograms are helpful:

1 2 3 4 5 6

0 5 10 15 20 25 30

object

frequency

(9)

comparing histograms - different distributions

an example of two different streams of observations:

0 100 200 300 400 500 600 700

objects

frequency

0 100 200 300 400 500 600 700

objects

frequency

both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed

sorting objects by frequencies makes the difference more obvious:

0 100 200 300 400 500 600 700

objects

frequency

0 100 200 300 400 500 600 700

objects

frequency

(10)

the plan

• information about the distribution of observation is crucial for many applications

• knowing the complete, exact histogram – would be helpful

– is often not possible, due to the large number of distinct objects

workaround:

• characterize histogram without knowing the complete picture

• characteristic properties easier to determine

• analogous to descriptions of distributions on R

(11)

characterizing frequency distributions

1 2 3 4 5 6

0 5 10 15 20 25 30

object

frequency

• m_i: frequency of object i

• number of distinct objects seen so far: ^P_i(m_i)⁰

• total number of objects seen so far: ^P_i(mi)¹ =^P_imi

generalization: M_k =^P_i(m_i)^k kth moment

(12)

M

2

– the second frequency moment

what we have so far

• M₀ – Flajolet-Martin algorithm from last lecture

• M₁ – counting

• combination: average frequency M₁/M₀

next: estimate M₂ =^P_im_i²

(13)

M

2

– the second frequency moment

0 100 200 300 400 500 600 700

objects

frequency

M2 = 1.678.672

0 100 200 300 400 500 600 700

objects

frequency

M2 = 3.320.852

Motivation

• M₂ describes the “skewness” of a distribution

• smaller M₂ less skewed distribution

• related to the Gini-Index (surprise index)

• used to limit approximation errors, query optimization in database systems

(14)

M

2

and Var(X )

• variance describes the distribution of values

• M₂ describes the distribution of their frequencies

• M₂ comparable to variance of frequencies:

Var({m_i}) = 1/N^P_i(m_i − µ({m_i}))²

(15)

M

2

– the second frequency moment: approximation

• storing and counting distinct objects impossible

• approximation by Alon-Matias-Szegedy algorithm¹:

algorithm

• N observations in stream

• choose k random positions p_j ∈ {1, . . . , N}

• when reaching position p_j: – store object at position

– start counting occurrences of this object in mj

• estimate: M₂ ≈ n/k(^P^k_{i =1}(2m_i − 1))

1Alon, N.; Matias Y.; Szegedy, M.:

“The space complexity of approximating the frequency moments”, 1999

(16)

M

2

– the second frequency moment: example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c e c f a e g f f b b c g b a a f d a e

• N=20

• random positions 3, 7, 14, 5

• position 3: encounter c, counting results in 2

• position 7: encounter g, 2

• position 14: b 1

• position 5 a 4

• estimate:

M₂ ≈ ²⁰₄[2 · (2 · 2 − 1) + (2 · 1 − 1) + (2 · 4 − 1] = ²⁰₄ · 14 = 70

• true value: M₂ = 4²+ 3²+ 3²+ 1²+ 3² + 4²+ 2² = 64

(17)

M

2

– the second frequency moment: summary

• the algorithm is simple to implement

• needs to store only the k counters

• gets more precise with larger k, proof idea:

– expected value of each counter is fraction of M₂ – average of k counters approaches M₂

problem: N may not be known in the beginning

(18)

approximating M

2

with unknown stream length

• stream may be of unknown length or unlimited

• still each position must be chosen random and uniform from {1, . . . , N}

solution

• keep count of k objects beginning with the first k

• when object at position p > k is processed:

– choose with probability k/(p + 1)

– drop existing element (chosen with equal probability) each position chosen with equal probability

(19)

clustering data streams

(20)

clustering data streams – the problem

• many formulations of “the” clustering problem possible

• wide application ranges, strong variance in – preconditions

– objective function

• common ground:

– objects connected by relation

– identify groups of “similar objects” with respect to relation – problem is intractable (N P-hard)

some basic questions

• what kind of relation (e.g. binary, distance, similarity)

• can objects have a mean value (continuous space)

• what is a “good” cluster (objective function)

• possibility of overlapping clusters

(21)

clustering data streams – STREAM

in the following: a single example problem and a single algorithm

• k-median

• on a data stream

• in one pass

• with guaranteed approximation quality

• algorithm: STREAM

• Guha, Mishra,Motwani, O’Callaghan:

“Clustering Data Streams”,2000

(22)

clustering data streams – the k-median problem

• input:

– objects X = {x_i : i = 1, . . . , N}

– distance d : X × X → R

– every xi is seen once in arbitrary order (i = 1, . . . , N) – k - number of clusters to find

• objective:

– identify k elements m₁, . . . , m_k ∈ X (cluster centers) – let N(mj) = {xi ∈ X : j = arg min_{l ∈1,...,k}d (xi, ml)}

all xi for which mi is the nearest center

– minimize C ({m₁, . . . , m_k}) = ^P^k

j=1

P

xi∈N(m_j)

d (x_i, m_j)

(23)

clustering data streams – approximating k-median

• for small problem instances k-median can be fixed parameter approximated

• fixed parameter approximation: C_approx ≤ a · Q_opt

(approximation is maximal by factor a worse than optimal solution for fixed a)

• this approximation is useful to approximate larger instances

approximation (idea)

• k-medians can be stated as integer program P_I

• this program can be relaxed to a linear program P_L

• solution of P_L can be rounded to solution of P_I

linear problems can be solved efficiently

(24)

clustering data streams – weighted k-medians extending k-medians with weights:

• k-medians with weighted samples w : X → R>0:

• distance of objects to their centers multiplied by weight:

C ({m₁, . . . , m_k}) =^P_j^P_{i ∈1,...,N}w (x_i) · d (x_i, m_j)

• k-medians is special case with unit weights

• weighted k-means can be approximated similar to k-means:

• algorithm can only be applied to “small” instances

• use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k

output: k weighted centers

2

(25)

first step - clustering with low memory

approach: divide and conquer Small-Space(X)

1. divide X into l disjoint subsets X₁, . . . , X_l 2. cluster each X_i individually into l · k clusters 3. result: X⁰ set of lk cluster centers

4. cluster X⁰, using for each c ∈ X⁰ |N(c)| as weight

• 2. can be solved with a constant factor approximation:

– solution ≤ b times worse than optimum

• 4. can be solved with constant factor approximation not worse than ≤ c times optimum

result: constant factor approximation partial solutions and their combination

(26)

extending to a solution

Small-Space(X)

1. divide X into l disjoint subsets X₁, . . . , X_l 2. cluster each X_i individually into O(k) clusters 3. result: X⁰ set of O(lk) cluster centers

4. cluster X⁰, using for each c ∈ X⁰ |N(c)| as weight

• constant factor approximation

• needs to cluster X_i

memory problem 1: size of subsets versus l

• needs to cluster X⁰

memory problem 2: clustering O(lk) elements