• No results found

Big Data & Scripting Part II Streaming Algorithms

N/A
N/A
Protected

Academic year: 2022

Share "Big Data & Scripting Part II Streaming Algorithms"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

Big Data & Scripting Part II

Streaming Algorithms

(2)

a note on sampling and filtering

• sampling:

(randomly) choose a representative subset

• filtering:

given some criterion (e.g. membership in a set), retain only elements matching that criterion

• example scenario: stream of requests (user,request)

• sampling requests is straightforward (e.g. which pages are accessed most frequently)

• analyzing the distribution of frequencies is more complicated that is, we want to know, how many queries are repeated x times (for all x )

(3)

sampling and filtering example

• n = 200, 000 events, m = 40, 000 different requests, uniform distribution

all queries

s frequency 051015

10% sample

id

frequency

0 10000 20000 30000 40000

0123456

(4)

sampling and filtering example

• same dataset, but frequency vs. # queries with this frequency

all queries − by frequency

frequency

number of queries with frequency

0 5 10 15

0200040006000

10% sample − by frequency

frequency

number of queries with frequency

0 1 2 3 4 5 6

050001500025000

(5)

sampling and filtering example

• same dataset, but frequency vs. # queries with this frequency

• this time sample is selected by a fixed subset of ids

all queries − by frequency

frequency

number of queries with frequency

0 5 10 15

0200040006000

corrected 10% sample − by frequency

frequency

number of queries with frequency

0 5 10 15

0100300500700

(6)

Histograms and Frequency Skews

(7)

stream and histogram

consider the following input:

0 20 40 60 80 100

1 2 3 4 5 6

time

objects/buckets

• as time/stream progresses, data points come in – e.g. users issue requests

• distinguished by some id or bucket (from hashing)

• some are seen more often (e.g. 4) some less often (e.g. 1) – e.g. user 4 sending requests with high frequency,

user 1 only one request

this is highly valuable information for an analysis

(8)

stream and histogram

0 20 40 60 80 100

1 2 3 4 5 6

time

objects/buckets

to analyze these frequency distributions, histograms are helpful:

1 2 3 4 5 6

0 5 10 15 20 25 30

object

frequency

(9)

comparing histograms - different distributions

an example of two different streams of observations:

0 100 200 300 400 500 600 700

objects

frequency

0 100 200 300 400 500 600 700

objects

frequency

both have equal number of data points (10.000) and distinct objects (60) but objects have different probabilities to be observed

sorting objects by frequencies makes the difference more obvious:

0 100 200 300 400 500 600 700

objects

frequency

0 100 200 300 400 500 600 700

objects

frequency

(10)

the plan

• information about the distribution of observation is crucial for many applications

• knowing the complete, exact histogram – would be helpful

– is often not possible, due to the large number of distinct objects

workaround:

• characterize histogram without knowing the complete picture

• characteristic properties easier to determine

• analogous to descriptions of distributions on R

(11)

characterizing frequency distributions

1 2 3 4 5 6

0 5 10 15 20 25 30

object

frequency

• mi: frequency of object i

• number of distinct objects seen so far: Pi(mi)0

• total number of objects seen so far: Pi(mi)1 =Pimi

generalization: Mk =Pi(mi)k kth moment

(12)

M

2

– the second frequency moment

what we have so far

• M0 – Flajolet-Martin algorithm from last lecture

• M1 – counting

• combination: average frequency M1/M0

next: estimate M2 =Pimi2

(13)

M

2

– the second frequency moment

0 100 200 300 400 500 600 700

objects

frequency

M2 = 1.678.672

0 100 200 300 400 500 600 700

objects

frequency

M2 = 3.320.852

Motivation

• M2 describes the “skewness” of a distribution

• smaller M2 less skewed distribution

• related to the Gini-Index (surprise index)

• used to limit approximation errors, query optimization in database systems

(14)

M

2

and Var(X )

• variance describes the distribution of values

• M2 describes the distribution of their frequencies

• M2 comparable to variance of frequencies:

Var({mi}) = 1/NPi(mi − µ({mi}))2

(15)

M

2

– the second frequency moment: approximation

• storing and counting distinct objects impossible

• approximation by Alon-Matias-Szegedy algorithm1:

algorithm

• N observations in stream

• choose k random positions pj ∈ {1, . . . , N}

• when reaching position pj: – store object at position

– start counting occurrences of this object in mj

• estimate: M2 ≈ n/k(Pki =1(2mi − 1))

1Alon, N.; Matias Y.; Szegedy, M.:

“The space complexity of approximating the frequency moments”, 1999

(16)

M

2

– the second frequency moment: example

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 c e c f a e g f f b b c g b a a f d a e

• N=20

• random positions 3, 7, 14, 5

• position 3: encounter c, counting results in 2

• position 7: encounter g, 2

• position 14: b 1

• position 5 a 4

• estimate:

M2204[2 · (2 · 2 − 1) + (2 · 1 − 1) + (2 · 4 − 1] = 204 · 14 = 70

• true value: M2 = 42+ 32+ 32+ 12+ 32 + 42+ 22 = 64

(17)

M

2

– the second frequency moment: summary

• the algorithm is simple to implement

• needs to store only the k counters

• gets more precise with larger k, proof idea:

– expected value of each counter is fraction of M2 – average of k counters approaches M2

problem: N may not be known in the beginning

(18)

approximating M

2

with unknown stream length

• stream may be of unknown length or unlimited

• still each position must be chosen random and uniform from {1, . . . , N}

solution

• keep count of k objects beginning with the first k

• when object at position p > k is processed:

– choose with probability k/(p + 1)

– drop existing element (chosen with equal probability) each position chosen with equal probability

(19)

clustering data streams

(20)

clustering data streams – the problem

• many formulations of “the” clustering problem possible

• wide application ranges, strong variance in – preconditions

– objective function

• common ground:

– objects connected by relation

– identify groups of “similar objects” with respect to relation – problem is intractable (N P-hard)

some basic questions

• what kind of relation (e.g. binary, distance, similarity)

• can objects have a mean value (continuous space)

• what is a “good” cluster (objective function)

• possibility of overlapping clusters

(21)

clustering data streams – STREAM

in the following: a single example problem and a single algorithm

• k-median

• on a data stream

• in one pass

• with guaranteed approximation quality

• algorithm: STREAM

• Guha, Mishra,Motwani, O’Callaghan:

“Clustering Data Streams”,2000

(22)

clustering data streams – the k-median problem

• input:

– objects X = {xi : i = 1, . . . , N}

– distance d : X × X → R

– every xi is seen once in arbitrary order (i = 1, . . . , N) – k - number of clusters to find

• objective:

– identify k elements m1, . . . , mk ∈ X (cluster centers) – let N(mj) = {xi ∈ X : j = arg minl ∈1,...,kd (xi, ml)}

all xi for which mi is the nearest center

– minimize C ({m1, . . . , mk}) = Pk

j=1

P

xi∈N(mj)

d (xi, mj)

(23)

clustering data streams – approximating k-median

• for small problem instances k-median can be fixed parameter approximated

• fixed parameter approximation: Capprox ≤ a · Qopt

(approximation is maximal by factor a worse than optimal solution for fixed a)

• this approximation is useful to approximate larger instances

approximation (idea)

• k-medians can be stated as integer program PI

• this program can be relaxed to a linear program PL

• solution of PL can be rounded to solution of PI

linear problems can be solved efficiently

(24)

clustering data streams – weighted k-medians extending k-medians with weights:

• k-medians with weighted samples w : X → R>0:

• distance of objects to their centers multiplied by weight:

C ({m1, . . . , mk}) =PjPi ∈1,...,Nw (xi) · d (xi, mj)

• k-medians is special case with unit weights

• weighted k-means can be approximated similar to k-means:

• algorithm can only be applied to “small” instances

• use it to solve small sub-problems in the following, use procedure: wkm() input: objects, weights, k

output: k weighted centers

2

(25)

first step - clustering with low memory

approach: divide and conquer Small-Space(X)

1. divide X into l disjoint subsets X1, . . . , Xl 2. cluster each Xi individually into l · k clusters 3. result: X0 set of lk cluster centers

4. cluster X0, using for each c ∈ X0 |N(c)| as weight

• 2. can be solved with a constant factor approximation:

– solution ≤ b times worse than optimum

• 4. can be solved with constant factor approximation not worse than ≤ c times optimum

result: constant factor approximation partial solutions and their combination

(26)

extending to a solution

Small-Space(X)

1. divide X into l disjoint subsets X1, . . . , Xl 2. cluster each Xi individually into O(k) clusters 3. result: X0 set of O(lk) cluster centers

4. cluster X0, using for each c ∈ X0 |N(c)| as weight

• constant factor approximation

• needs to cluster Xi

memory problem 1: size of subsets versus l

• needs to cluster X0

memory problem 2: clustering O(lk) elements

References

Related documents

c) Is the firm moving towards or already accepting EMV technology? Explain fully and provide the time frame.. purchase basis? Do you offer an equipment maintenance plan? If

We demonstrate the effectiveness of the proposed approach on two data sets of handwritten digits: Western Ara- bic (MNIST) and Farsi for which we achieve high recognition rates of

2_o5 SCHEDULE scuevuum; CONSTRAINTS / SEARCH ENGINE 2_o§ , POTENTIAL SCHEDULES ANALYSIS OF scneouuss ESTIMATED SERVICE LEVELS AGENT REQUIREMENTS SCORING FUNCTION 210

Henry Mayhew spent whole chapters in Criminal Prisons of London categorizing the nature of crime and the criminal, almost always assuming a male subject, but when he

As per Reserve Bank of India guidelines, all current and savings bank accounts in which if there is no customer induced debit / credit transactions for over a period of 2 years

The risk assessment process requires examiners to determine: (1) products, services, and activities that are considered material to the organization; (2) the level of inherent

• Requests for image files associated with requests for particular pages; an user’s request to view a particular page often results in several log entries

1) This icon allows the user access to the review requests. 2) Select the list of requests or tasks for approval. 3) Filter the review request. 4) List of assigned requests. 5)