Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Stream Statistics Over Sliding Window
Anil Maheshwari
School of Computer Science Carleton University
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Outline
1 Introduction 2 Algorithm 3 Sum Problem 4 Trends 5 ReferencesStream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Problem Setting
Main ProblemThe input is an endless stream of binary bits. At any time,
among the lastN bits received, we are interested in
queries that seek an approximate count of the number of
1’s in the stream among the lastkbits, wherek≤N.
Result: A data structure of sizeO(1 log2N)that can
approximate the count of the number of1s within a factor
of1±
Reference: Maintaining stream statistics over sliding windows by Datar, Gionis, Indyk, and Motwani, SIAM Jl. Computing 2002
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Variants
1 A stream of positive numbers. The query consists of
a valuek∈ {1, . . . , N}, and we want to know the
(approximate) sum of the lastknumbers in the
stream. (Uses sublinear space.)
2 A stream consisting of numbers from the set
{−1,0,+1}. We want to maintain the sum of lastN
numbers of the stream. (RequiresΩ(N)bits of
storage to approximate the sum that is within a constant factor of the exact sum.)
3 What are the most popular movies in the last week?
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Main Problem
Main ProblemReport an approximate count of the number of1’s in the
stream of binary bits among the lastkbits, wherek≤N.
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Algorithm for Approximate Count
Algorithm uses two structures:
Time Stamps: To track the most recentN bits.
Buckets: With the following features:
O(logN)buckets maintain the1’s among the latest
N bits
Number of1’s in a bucket is a power of2
Each1-bit is assigned to exactly one bucket (0-bit may or may not be assigned to any bucket)
At most two buckets of a givensize(size= #1s)
Each bucket stores time stamp of its most recent bit Most recent bit of any bucket is1-bit
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Algorithm contd.
On receiving a new bit in the data stream:
0-bit: Increment the time stamp of each of the buckets by
1, and if any of the buckets time stamp exceedsN, we
discard that bucket.
1-bit: Following updates are done:
1 Create a bucketB
0 consisting of the newest1-bit
with a time stamp of1.
2 Scan the list of buckets in order of increasing size.
Case 1: Two buckets of size1.
Increment time stamp of each bucket (and possibly
discard buckets whose time stamps exceedN)
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Illustration
1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 N A B 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 C 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 1 0 0 1 1 0 0 1 1 0 1 1 0 1 1 1 0 1 1 0 1 1 0 B0 B0 B1 B1 B2 B0 B0 B1 B1 B2 B0 B1 B2 B2 B0 B0 B1 B2 B2 B0 B0 B1 B2 D E Time Stamp 1 Time Stamp NStream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Space Complexity
We have:O(logN)buckets as the size of window isN
BucketBi stores2i 1-bits
For each bucket we store its time stamp and its size
Time stamps requiresO(logN)bits
Storingiwith bucketBiis sufficient for its size
As0≤i≤logN,ican represented using
O(log logN)bits
Total space required
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Time Complexity
On receiving a0-bit:- We update time stamps of each of theO(logN)buckets
- RequiresO(logN)time
On receiving a1-bit:
- We update the time stamps of each bucket - Potentially merge & cascade buckets
- Time (merge & cascade)≈# of buckets
- Can be performed inO(logN)time
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Answering Query
Query ProblemFor any query valuek∈ {1, . . . , N}, report an
approximate count of the number of1’s among the latest
kbits of the stream.
1 Initializecount:= 0
2 Traverse buckets from right to left. For each bucket of
typeBithat is encountered in the traversal:
1 Biis completely contained in the window:
Increment count by2i
2 Biis completely outside the window:
count remains unchanged
3 Partially overlaps the window:
Increment count by2i
2
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Analysis of Approximation Factor
Observation: Except of one bucket, sayBj, that is
partially in the window of sizek, we know that all buckets
of typeB0, B1, . . . , Bj−1 are completely within the window.
- For those buckets, the count of the number of1-bits is
j−1
P
i=0
2i ≥2j −1
- The true count (and the approximate count) value is at least2j (as the last bit ofBj is in the window of interest)
- For the bucketBj that overlaps partially with the
window, the number of bits that can be in the true count
can be anywhere from0upto2j−1. But we only took a
contribution of2j−1 in the reported count value
- Ratio of the true count to the reported count is within a factor of(12,2).
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Refining the Analysis
- Letr≥2be an integer parameter.
- Maintainr−1orr copies ofBi for eachi≥1(buckets
of typeB0 may be less thanr−1)
- At any time we exceedrcopies of any type of buckets,
we take the oldest two buckets and merge them to form a new bucket of the next size.
- For the query, assume that the bucket labelledBj is only
partially overlapping the query window. - At least1 +
j−1
P
i=1
(r−1)2i1-bits are in the query window. - True count and the reported value are within a factor of
1± 1
r−1
=⇒ By settingr= 1 +1, we obtain a data structure of
sizeO(1 log2N)that approximates the count of the number of1s within a factor of1±.
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Computation of Sum
The Sum ProblemA stream of positive numbers. The query consists of a valuek∈ {1, . . . , N}, and we want to know the
(approximate) sum of the lastknumbers in the stream.
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Approach I: Computation of Sum
Assumingd-bit numbers. For each bit position, maintain a
stream. Approximate number of10sin each stream.
Report approximate sum value as
d−1
P
i=0
count(i)2i
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Approach II: Computation of Sum
If the next number in the stream isx, insertx10s in the stream
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
What is Trending?
Among the last1012movie tickets sold, list allpopular
movies?
Letc:= 10−3. Maintain (decaying) scores for movies
whose threshold is at leastτ ∈(0,1). For each new sale
of ticket (say for MovieM):
1 For each movie whose score is being maintained, its
new score is reduced by a factor of(1−c)
2 If we have the score ofM, add1to that score.
Otherwise, create a new score forM and initialize it
to1
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Questions
1 How many scores are maintained at any given time?
2 What is sum of all scores at any point of time?
3 Answer above questions forτ = 1
2 and 1 3.
Stream Statistics Over Sliding Window Anil Maheshwari Introduction Algorithm Sum Problem Trends References
Conclusions
Main References:1 Maintaining stream statistics over sliding windows, by
Datar, Gionis, Indyk, and Motwani, SIAM Jl. Computing 2002.
2 Chapter in MMDS book (mmds.org)
3 Chapter on Data Streams in My Notes on Topics in