• No results found

Outline. 1 Introduction. 2 Algorithm. 3 Sum Problem. 4 Trends. 5 References. Stream Statistics Over Sliding Window. Anil Maheshwari.

N/A
N/A
Protected

Academic year: 2021

Share "Outline. 1 Introduction. 2 Algorithm. 3 Sum Problem. 4 Trends. 5 References. Stream Statistics Over Sliding Window. Anil Maheshwari."

Copied!
19
0
0

Loading.... (view fulltext now)

Full text

(1)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Stream Statistics Over Sliding Window

Anil Maheshwari

School of Computer Science Carleton University

(2)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Outline

1 Introduction 2 Algorithm 3 Sum Problem 4 Trends 5 References

(3)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Problem Setting

Main Problem

The input is an endless stream of binary bits. At any time,

among the lastN bits received, we are interested in

queries that seek an approximate count of the number of

1’s in the stream among the lastkbits, wherek≤N.

Result: A data structure of sizeO(1 log2N)that can

approximate the count of the number of1s within a factor

of1±

Reference: Maintaining stream statistics over sliding windows by Datar, Gionis, Indyk, and Motwani, SIAM Jl. Computing 2002

(4)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Variants

1 A stream of positive numbers. The query consists of

a valuek∈ {1, . . . , N}, and we want to know the

(approximate) sum of the lastknumbers in the

stream. (Uses sublinear space.)

2 A stream consisting of numbers from the set

{−1,0,+1}. We want to maintain the sum of lastN

numbers of the stream. (RequiresΩ(N)bits of

storage to approximate the sum that is within a constant factor of the exact sum.)

3 What are the most popular movies in the last week?

(5)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Main Problem

Main Problem

Report an approximate count of the number of1’s in the

stream of binary bits among the lastkbits, wherek≤N.

(6)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Algorithm for Approximate Count

Algorithm uses two structures:

Time Stamps: To track the most recentN bits.

Buckets: With the following features:

O(logN)buckets maintain the1’s among the latest

N bits

Number of1’s in a bucket is a power of2

Each1-bit is assigned to exactly one bucket (0-bit may or may not be assigned to any bucket)

At most two buckets of a givensize(size= #1s)

Each bucket stores time stamp of its most recent bit Most recent bit of any bucket is1-bit

(7)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Algorithm contd.

On receiving a new bit in the data stream:

0-bit: Increment the time stamp of each of the buckets by

1, and if any of the buckets time stamp exceedsN, we

discard that bucket.

1-bit: Following updates are done:

1 Create a bucketB

0 consisting of the newest1-bit

with a time stamp of1.

2 Scan the list of buckets in order of increasing size.

Case 1: Two buckets of size1.

Increment time stamp of each bucket (and possibly

discard buckets whose time stamps exceedN)

(8)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Illustration

1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 N A B 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 C 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 B0 B0 B1 B1 B2 B0 B0 B1 B1 B2 B0 B1 B2 B2 B0 B0 B1 B2 B2 B0 B0 B1 B2 D E Time Stamp 1 Time Stamp N

(9)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Space Complexity

We have:

O(logN)buckets as the size of window isN

BucketBi stores2i 1-bits

For each bucket we store its time stamp and its size

Time stamps requiresO(logN)bits

Storingiwith bucketBiis sufficient for its size

As0≤i≤logN,ican represented using

O(log logN)bits

Total space required

(10)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Time Complexity

On receiving a0-bit:

- We update time stamps of each of theO(logN)buckets

- RequiresO(logN)time

On receiving a1-bit:

- We update the time stamps of each bucket - Potentially merge & cascade buckets

- Time (merge & cascade)≈# of buckets

- Can be performed inO(logN)time

(11)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Answering Query

Query Problem

For any query valuek∈ {1, . . . , N}, report an

approximate count of the number of1’s among the latest

kbits of the stream.

1 Initializecount:= 0

2 Traverse buckets from right to left. For each bucket of

typeBithat is encountered in the traversal:

1 Biis completely contained in the window:

Increment count by2i

2 Biis completely outside the window:

count remains unchanged

3 Partially overlaps the window:

Increment count by2i

2

(12)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Analysis of Approximation Factor

Observation: Except of one bucket, sayBj, that is

partially in the window of sizek, we know that all buckets

of typeB0, B1, . . . , Bj−1 are completely within the window.

- For those buckets, the count of the number of1-bits is

j−1

P

i=0

2i ≥2j −1

- The true count (and the approximate count) value is at least2j (as the last bit ofBj is in the window of interest)

- For the bucketBj that overlaps partially with the

window, the number of bits that can be in the true count

can be anywhere from0upto2j−1. But we only took a

contribution of2j−1 in the reported count value

- Ratio of the true count to the reported count is within a factor of(12,2).

(13)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Refining the Analysis

- Letr≥2be an integer parameter.

- Maintainr−1orr copies ofBi for eachi≥1(buckets

of typeB0 may be less thanr−1)

- At any time we exceedrcopies of any type of buckets,

we take the oldest two buckets and merge them to form a new bucket of the next size.

- For the query, assume that the bucket labelledBj is only

partially overlapping the query window. - At least1 +

j−1

P

i=1

(r−1)2i1-bits are in the query window. - True count and the reported value are within a factor of

1± 1

r−1

=⇒ By settingr= 1 +1, we obtain a data structure of

sizeO(1 log2N)that approximates the count of the number of1s within a factor of1±.

(14)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Computation of Sum

The Sum Problem

A stream of positive numbers. The query consists of a valuek∈ {1, . . . , N}, and we want to know the

(approximate) sum of the lastknumbers in the stream.

(15)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Approach I: Computation of Sum

Assumingd-bit numbers. For each bit position, maintain a

stream. Approximate number of10sin each stream.

Report approximate sum value as

d−1

P

i=0

count(i)2i

(16)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Approach II: Computation of Sum

If the next number in the stream isx, insertx10s in the stream

(17)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

What is Trending?

Among the last1012movie tickets sold, list allpopular

movies?

Letc:= 10−3. Maintain (decaying) scores for movies

whose threshold is at leastτ ∈(0,1). For each new sale

of ticket (say for MovieM):

1 For each movie whose score is being maintained, its

new score is reduced by a factor of(1−c)

2 If we have the score ofM, add1to that score.

Otherwise, create a new score forM and initialize it

to1

(18)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Questions

1 How many scores are maintained at any given time?

2 What is sum of all scores at any point of time?

3 Answer above questions forτ = 1

2 and 1 3.

(19)

Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References

Conclusions

Main References:

1 Maintaining stream statistics over sliding windows, by

Datar, Gionis, Indyk, and Motwani, SIAM Jl. Computing 2002.

2 Chapter in MMDS book (mmds.org)

3 Chapter on Data Streams in My Notes on Topics in

References

Related documents

Solutions in this category may also offer enhanced remote control features, either by integrating with native Windows features (Remote Desktop, Remote Assistance) or by providing

I have talked about general ambitions, I have talked about the historical and geographical scope and what I want to do now is talk more about –something which is of much interest to

The studies that explicitly incorporate income and/or power distribution as a mediating factor of the income-environment relationship seem to agree on at least two conclusions:

The unavailability of real data for system available capacity, reliability indices of various components of a power grid are also a driving factors in choosing the IEEE-

* students who did not graduate or who graduated with another degree or certificate (but no bachelor's degree) Number of Enrolled Students Awarded Non-need-based Scholarships and

It especially gives arguments to disentangle the two major types of relations at work in food supply chains and especially SSFCs: namely distance relations on

(43) Datum waarop de aanvraag ter inzage is gelegd (43) Date of publication by printing or similar process of anexamined document on which no grant has taken place on or before

Controlled clinical trials (i.e. treatment given by experi- mental protocols compared with standard treatment) will be crucial for evaluating the efficacy and potential side