• No results found

Data Streams: Where to Go? PODS 11, Tutorial S.

N/A
N/A
Protected

Academic year: 2021

Share "Data Streams: Where to Go? PODS 11, Tutorial S."

Copied!
60
0
0

Loading.... (view fulltext now)

Full text

(1)

Data Streams: Where to Go?

PODS 11, Tutorial

S. Muthu Muthukrishnan

(2)

Familiar Puzzle: Missing Number

I A shows B numbers 1 ; : : : ; n but in a permuted order and leaves out one of the numbers.

I B has to determine the missing number.

I Key: B has only 2 log n bits.

I Solution:

B maintains the running sum s of numbers seen.

Missing number is n (n+1) 2 s .

(3)

Familiar Puzzle: Missing Number

I A shows B numbers 1 ; : : : ; n but in a permuted order and leaves out one of the numbers.

I B has to determine the missing number.

I Key: B has only 2 log n bits.

I Solution:

B maintains the running sum s of numbers seen.

Missing number is n (n+1) 2 s .

(4)

Familiar Puzzle: Missing Number

I A shows B numbers 1 ; : : : ; n but in a permuted order and leaves out one of the numbers.

I B has to determine the missing number.

I Key: B has only 2 log n bits.

I Solution:

B maintains the running sum s of numbers seen.

Missing number is n (n+1) 2 s .

(5)

Familiar Puzzle: Missing Number

I A shows B numbers 1 ; : : : ; n but in a permuted order and leaves out one of the numbers.

I B has to determine the missing number.

I Key: B has only 2 log n bits.

I Solution:

B maintains the running sum s of numbers seen.

Missing number is n (n+1) 2 s .

(6)

A New Puzzle: One Word Median

I A sees items i 1 ; i 2 ; : : : arrive in a stream.

I A has to maintain the median m j of the items i 1 ; : : : ; i j .

I Each i j generated independently and randomly from some unknown distribution D over integers [1; n] .

I Key: A is allowed to store only one word of memory (of log n bits).

I Solution. Maintain  j .

If i j +1 >  j ,  j +1  j + 1 .

If i j +1 <  j ,  j +1  j 1.

(7)

A New Puzzle: One Word Median

I A sees items i 1 ; i 2 ; : : : arrive in a stream.

I A has to maintain the median m j of the items i 1 ; : : : ; i j .

I Each i j generated independently and randomly from some unknown distribution D over integers [1; n] .

I Key: A is allowed to store only one word of memory (of log n bits).

I Solution. Maintain  j .

If i j +1 >  j ,  j +1  j + 1 .

If i j +1 <  j ,  j +1  j 1.

(8)

A New Puzzle: One Word Median

I A sees items i 1 ; i 2 ; : : : arrive in a stream.

I A has to maintain the median m j of the items i 1 ; : : : ; i j .

I Each i j generated independently and randomly from some unknown distribution D over integers [1; n].

I Key: A is allowed to store only one word of memory (of log n bits).

I Solution. Maintain  j .

If i j +1 >  j ,  j +1  j + 1 .

If i j +1 <  j ,  j +1  j 1.

(9)

A New Puzzle: One Word Median

I A sees items i 1 ; i 2 ; : : : arrive in a stream.

I A has to maintain the median m j of the items i 1 ; : : : ; i j .

I Each i j generated independently and randomly from some unknown distribution D over integers [1; n].

I Key: A is allowed to store only one word of memory (of log n bits).

I Solution. Maintain  j .

If i j +1 >  j ,  j +1  j + 1 .

If i j +1 <  j ,  j +1  j 1.

(10)

A New Puzzle: One Word Median

I A sees items i 1 ; i 2 ; : : : arrive in a stream.

I A has to maintain the median m j of the items i 1 ; : : : ; i j .

I Each i j generated independently and randomly from some unknown distribution D over integers [1; n].

I Key: A is allowed to store only one word of memory (of log n bits).

I Solution. Maintain  j .

If i j +1 >  j ,  j +1  j + 1 .

If i j +1 <  j ,  j +1  j 1.

(11)

A Basic Problem: Indexing

I Imagine a virtual array F [1    n] ,

I Updates: F [i] + + , F [i] .

I Query: F [i] =?.

(12)

A Basic Problem: Indexing

I Imagine a virtual array F [1    n] ,

I Updates: F [i] + + , F [i] .

I Query: F [i] =?.

(13)

A Basic Problem: Indexing

I Imagine a virtual array F [1    n] ,

I Updates: F [i] + + , F [i] .

I Query: F [i] =?.

(14)

Count-Min Sketch

I For each update F [i] + + ,

I

for each j = 1; : : : ; log(1=) , update cm [h

j

(i)] + + .

F [i] + + h1(i)

h2(i) 1

2

log(1/δ)

1 2 e/

+1

+1 +1

+1

cm array

I Estimate F ~ (i) = min j =1;:::;log(1=) cm [h j (i)] .

(15)

Count-Min Sketch

I For each update F [i] + + ,

I

for each j = 1; : : : ; log(1=) , update cm [h

j

(i)] + + .

F [i] + + h1(i)

h2(i) 1

2

log(1/δ)

1 2 e/

+1

+1 +1

+1

cm array

I Estimate F ~ (i) = min j =1;:::;log(1=) cm [h j (i)] .

(16)

Count-Min Sketch

I Claim: F [i]  ~ F [i] .

I Claim: With probability at least 1 , F ~ [i]  F[i] + " P j 6=i F [j ] .

I Space used is O ( 1 " log ( 1  ) .

I Time per update is O (log( 1  )) . Indep of n.

G. Cormode and S. Muthukrishnan: An improved data stream sum-

mary: count-min sketch and its applications. Journal of Algorithms.

(17)

Count-Min Sketch: The Proof

I With probability at least 1 , F ~ [i]  F[i] + " X

j 6=i

F [j ]:

I X i ;j is the expected contribution of F [j ] to the bucket containing i , for any h .

E (X i ;j ) = " e

X

j 6=i

F [j ]:

I Consider Pr( ~ F [i] > F[i] + " P j 6=i F [j ]):

Pr () = Pr(8j ; F[i] + X i ;j > F[i] + " X

j 6=i

F [j ])

= Pr(8j ; X i ;j  eE(X i ;j ))

< e log (1=) = 

(18)

Count-Min Sketch: The Proof

I With probability at least 1 , F ~ [i]  F[i] + " X

j 6=i

F [j ]:

I X i ;j is the expected contribution of F [j ] to the bucket containing i , for any h .

E (X i ;j ) = "

e X

j 6=i

F [j ]:

I Consider Pr( ~ F [i] > F[i] + " P j 6=i F [j ]):

Pr () = Pr(8j ; F[i] + X i ;j > F[i] + " X

j 6=i

F [j ])

= Pr(8j ; X i ;j  eE(X i ;j ))

< e log (1=) = 

(19)

Count-Min Sketch: The Proof

I With probability at least 1 , F ~ [i]  F[i] + " X

j 6=i

F [j ]:

I X i ;j is the expected contribution of F [j ] to the bucket containing i , for any h .

E (X i ;j ) = "

e X

j 6=i

F [j ]:

I Consider Pr ( ~ F [i] > F[i] + " P j 6=i F [j ]) :

Pr () = Pr(8j ; F[i] + X i ;j > F[i] + " X

j 6=i

F [j ])

= Pr(8j ; X i ;j  eE(X i ;j ))

< e log (1=) = 

(20)

Improve Count-Min Sketch?

I Index Problem:

I

A has n long bitstring and sends messages to B who wishes to compute the ith bit.

I

Needs (n) bits of communication.

I Reduction of estimating F[i] in data stream model.

I

I [1    1=(2")]

I

I [i] = 1 ! F[i] = 2

I

I [i] = 0 ! F[i] = 0; F[0] F[0] + 2:

I

Estimating F [i] to "jjFjj = 1 accuracy reveals I [i] .

(21)

Count-Min Sketch, The Challenge

1000000 items inserted 999996 items removed

Sketch of 1000 bytes

4 items lefts

I Recovering F [i] to 0:1jFj accuracy will retrieve each item

precisely.

(22)

Applications of Count-Min Sketch

I Solves many problems in the best possible/known bounds:

I

Data Mining: heavy hitters, heavy hitter differences.

I

Signal processing: Compressed sensing.

I

Statistics: Histograms, Wavelets, Clustering, Least squares.

I Applications to other CS/EE areas:

I

NLP, ML, Password checking.

I Systems, code, hardware.

Wiki: http://sites.google.com/site/countminsketch/

(23)

Applications of Count-Min Sketch

I Solves many problems in the best possible/known bounds:

I

Data Mining: heavy hitters, heavy hitter differences.

I

Signal processing: Compressed sensing.

I

Statistics: Histograms, Wavelets, Clustering, Least squares.

I Applications to other CS/EE areas:

I

NLP, ML, Password checking.

I Systems, code, hardware.

Wiki: http://sites.google.com/site/countminsketch/

(24)

Applications of Count-Min Sketch

I Solves many problems in the best possible/known bounds:

I

Data Mining: heavy hitters, heavy hitter differences.

I

Signal processing: Compressed sensing.

I

Statistics: Histograms, Wavelets, Clustering, Least squares.

I Applications to other CS/EE areas:

I

NLP, ML, Password checking.

I Systems, code, hardware.

Wiki: http://sites.google.com/site/countminsketch/

(25)

Applications of Count-Min Sketch

I Solves many problems in the best possible/known bounds:

I

Data Mining: heavy hitters, heavy hitter differences.

I

Signal processing: Compressed sensing.

I

Statistics: Histograms, Wavelets, Clustering, Least squares.

I Applications to other CS/EE areas:

I

NLP, ML, Password checking.

I Systems, code, hardware.

Wiki: http://sites.google.com/site/countminsketch/

(26)

Summary

I Broken the premise that data has to be

I

captured,

I

stored,

I

communicated, analyzed in entirety.

Polynomial time/space theory -> sublinear theory

Nyquist sampling -> SubNyquist sampling

(27)

Summary

I Broken the premise that data has to be

I

captured,

I

stored,

I

communicated, analyzed in entirety.

Polynomial time/space theory -> sublinear theory

Nyquist sampling -> SubNyquist sampling

(28)

What does this got to do with data streams?

Some My-story

I Raghu asked: what can you do with one pass?

I

Dynamic data structures, with fast update times.

I Gibbons and Matias abstract synopsis data structures

I

Can’t simulate a stack!

I Alon, Matias and Szegedy used limited independence.

I

What does frequency moment got to do with databases?

I George Varghese argues high speed memory is a constraint in IP packet analyses.

I

Who needs to analyze IP packet data?

I Observation: 1=" 2 space to give " accuracy. Prohibitive.

(29)

What does this got to do with data streams?

Some My-story

I Raghu asked: what can you do with one pass?

I

Dynamic data structures, with fast update times.

I Gibbons and Matias abstract synopsis data structures

I

Can’t simulate a stack!

I Alon, Matias and Szegedy used limited independence.

I

What does frequency moment got to do with databases?

I George Varghese argues high speed memory is a constraint in IP packet analyses.

I

Who needs to analyze IP packet data?

I Observation: 1=" 2 space to give " accuracy. Prohibitive.

(30)

What does this got to do with data streams?

Some My-story

I Raghu asked: what can you do with one pass?

I

Dynamic data structures, with fast update times.

I Gibbons and Matias abstract synopsis data structures

I

Can’t simulate a stack!

I Alon, Matias and Szegedy used limited independence.

I

What does frequency moment got to do with databases?

I George Varghese argues high speed memory is a constraint in IP packet analyses.

I

Who needs to analyze IP packet data?

I Observation: 1=" 2 space to give " accuracy. Prohibitive.

(31)

What does this got to do with data streams?

Some My-story

I Raghu asked: what can you do with one pass?

I

Dynamic data structures, with fast update times.

I Gibbons and Matias abstract synopsis data structures

I

Can’t simulate a stack!

I Alon, Matias and Szegedy used limited independence.

I

What does frequency moment got to do with databases?

I George Varghese argues high speed memory is a constraint in IP packet analyses.

I

Who needs to analyze IP packet data?

I Observation: 1=" 2 space to give " accuracy. Prohibitive.

(32)

What does this got to do with data streams?

Some My-story

I Raghu asked: what can you do with one pass?

I

Dynamic data structures, with fast update times.

I Gibbons and Matias abstract synopsis data structures

I

Can’t simulate a stack!

I Alon, Matias and Szegedy used limited independence.

I

What does frequency moment got to do with databases?

I George Varghese argues high speed memory is a constraint in IP packet analyses.

I

Who needs to analyze IP packet data?

I Observation: 1=" 2 space to give " accuracy. Prohibitive.

(33)

Some Successful Streaming Systems

Specialized streaming systems:

I Gigascope at AT&T for IP traffic analysis.

I

Two level archietecture.

Uses count-min (A) + count-min(B) = count-min(A + B) .

I CMON at Sprint for IP traffic analysis.

I

Hash and parallelize architecture.

Uses count-min sketch to skip over parts of the stream.

I Sawzall at Google for log data analysis

I

Mapreduce-based.

Uses count-min sketch to decrease communication.

Q: General purpose streaming systems?

(34)

Some Successful Streaming Systems

Specialized streaming systems:

I Gigascope at AT&T for IP traffic analysis.

I

Two level archietecture.

Uses count-min (A) + count-min(B) = count-min(A + B) .

I CMON at Sprint for IP traffic analysis.

I

Hash and parallelize architecture.

Uses count-min sketch to skip over parts of the stream.

I Sawzall at Google for log data analysis

I

Mapreduce-based.

Uses count-min sketch to decrease communication.

Q: General purpose streaming systems?

(35)

Some Successful Streaming Systems

Specialized streaming systems:

I Gigascope at AT&T for IP traffic analysis.

I

Two level archietecture.

Uses count-min (A) + count-min(B) = count-min(A + B) .

I CMON at Sprint for IP traffic analysis.

I

Hash and parallelize architecture.

Uses count-min sketch to skip over parts of the stream.

I Sawzall at Google for log data analysis

I

Mapreduce-based.

Uses count-min sketch to decrease communication.

Q: General purpose streaming systems?

(36)

Some Successful Streaming Systems

Specialized streaming systems:

I Gigascope at AT&T for IP traffic analysis.

I

Two level archietecture.

Uses count-min (A) + count-min(B) = count-min(A + B) .

I CMON at Sprint for IP traffic analysis.

I

Hash and parallelize architecture.

Uses count-min sketch to skip over parts of the stream.

I Sawzall at Google for log data analysis

I

Mapreduce-based.

Uses count-min sketch to decrease communication.

Q: General purpose streaming systems?

(37)

Some Research Directions

(38)

1: Distributed, continual monitoring

S1t S2t Sti Snt

Center C bti

I S i t is the set of items seen by sensor i upto time t.

I S t is the multiset union of S i t ’s.

I Problem:

I

If jS

t

j >  , output 1.

I

if jS

t

j <  " , output 0.

I Say b i t is total number of bits sent b/w i and C

I Minimize P i b i t :

(39)

1: Distributed, continual monitoring

S1t S2t Sti Snt

Center C bti

I S i t is the set of items seen by sensor i upto time t.

I S t is the multiset union of S i t ’s.

I Problem:

I

If jS

t

j >  , output 1.

I

if jS

t

j <  " , output 0.

I Say b i t is total number of bits sent b/w i and C

I Minimize P i b i t :

(40)

1: Distributed, continual monitoring

S1t S2t Sti Snt

Center C bti

I S i t is the set of items seen by sensor i upto time t.

I S t is the multiset union of S i t ’s.

I Problem:

I

If jS

t

j >  , output 1.

I

if jS

t

j <  " , output 0.

I Say b i t is total number of bits sent b/w i and C

I Minimize P i b i t :

(41)

1: Distributed, continual monitoring

I When sensor sees O ( "

2

k  ) elements, sends a bit w.p k 1 to center.

I Center outputs 1 when O(1=" 2 ) bits received.

I O ( " 1

2

log ( 1  )) bits suffice with prob of success 1  .

I Independent of k .

Algorithms for distributed functional monitoring. Cormode,

Muthukrishnan, Yi. SODA 08.

(42)

1: Distributed, continual monitoring

I When sensor sees O ( "

2

k  ) elements, sends a bit w.p k 1 to center.

I Center outputs 1 when O(1=" 2 ) bits received.

I O ( " 1

2

log ( 1  )) bits suffice with prob of success 1  .

I Independent of k .

Algorithms for distributed functional monitoring. Cormode,

Muthukrishnan, Yi. SODA 08.

(43)

1: Distributed, continual monitoring

I When sensor sees O ( "

2

k  ) elements, sends a bit w.p k 1 to center.

I Center outputs 1 when O(1=" 2 ) bits received.

I O ( " 1

2

log ( 1  )) bits suffice with prob of success 1  .

I Independent of k .

Algorithms for distributed functional monitoring. Cormode,

Muthukrishnan, Yi. SODA 08.

(44)

1. Distributed, Continual Monitoring: Summary

I Statistics: Frequency moments, Distinct counts.

I Optimization: Clustering.

I Signal processing:

Compressed sensing.

Need a fuller theory.

Connections to Slepian-Wolf, network coding.

(45)

2. Probabilistic Streams

I Each stream update is a random variable X i , 1  i  n , X i 2 f0; 1g , identically distributed.

I The query is to estimate

Pr[ P i X i  c]:

(46)

2. Probabilistic Streams

I Each stream update is a random variable X i , 1  i  n , X i 2 f0; 1g , identically distributed.

I The query is to estimate

Pr[ P i X i  c]:

(47)

Probabilistic Streams Contd

Berry-Esseen Theorem

Let X 1 ; : : : ; X n be i.i.d. random variables with

I E (X i ) = 0; E(X 2 ) =  2 , and E (jX j 3 ) = :

Let Y n = P i X i =n with

I F n the cdf of Y n

p n=

 the cdf of the std normal dist.

Then there exists a positive C such that for all x and n , jF n (x) (x)j  C 

 3 p

n :

(48)

Probabilistic Streams Contd

I We have P i X i  c implies Y n = P i X i =n  c=n .

I Then Pr ( P i X i  c) = F n (c= p

n ) .

I This can be approximated by (c= p

n ):

I To finish up. Estimate 

and its impact on overall

error. Extend to more

general X i ’s.

(49)

Probabilistic Streams Contd

I We have P i X i  c implies Y n = P i X i =n  c=n .

I Then Pr ( P i X i  c) = F n (c= p

n ) .

I This can be approximated by (c= p

n ):

I To finish up. Estimate 

and its impact on overall

error. Extend to more

general X i ’s.

(50)

Probabilistic Streams Contd

I We have P i X i  c implies Y n = P i X i =n  c=n .

I Then Pr ( P i X i  c) = F n (c= p

n ) .

I This can be approximated by (c= p

n ):

I To finish up. Estimate 

and its impact on overall

error. Extend to more

general X i ’s.

(51)

Probabilistic Streams Contd

I We have P i X i  c implies Y n = P i X i =n  c=n .

I Then Pr ( P i X i  c) = F n (c= p

n ) .

I This can be approximated by (c= p

n ):

I To finish up. Estimate 

and its impact on overall

error. Extend to more

general X i ’s.

(52)

3. Stochastic Streams

I Input is a stochastic stream X 1 ; : : : ; X n , each X i is drawn from known distribution D . n is known.

I Problem: Stop at input t and output X t .

I Goal: maximize X t . Formally,

max E (X t )

E (OPT) = E(max i X i )

I Observe:

I

Can a priori look at the dist of max

i

X

i

.

I

Not the same as finding max

i

X

i

:

(53)

3. Stochastic Streams

I Input is a stochastic stream X 1 ; : : : ; X n , each X i is drawn from known distribution D . n is known.

I Problem: Stop at input t and output X t .

I Goal: maximize X t . Formally,

max E (X t )

E (OPT) = E(max i X i )

I Observe:

I

Can a priori look at the dist of max

i

X

i

.

I

Not the same as finding max

i

X

i

:

(54)

3. Stochastic Streams

I Input is a stochastic stream X 1 ; : : : ; X n , each X i is drawn from known distribution D . n is known.

I Problem: Stop at input t and output X t .

I Goal: maximize X t . Formally,

max E (X t )

E (OPT) = E(max i X i )

I Observe:

I

Can a priori look at the dist of max

i

X

i

.

I

Not the same as finding max

i

X

i

:

(55)

3. Stochastic Streams

I Input is a stochastic stream X 1 ; : : : ; X n , each X i is drawn from known distribution D . n is known.

I Problem: Stop at input t and output X t .

I Goal: maximize X t . Formally,

max E (X t )

E (OPT) = E(max i X i )

I Observe:

I

Can a priori look at the dist of max

i

X

i

.

I

Not the same as finding max

i

X

i

:

(56)

3. Stochastic Streams Contd

I Algorithm:

I

X



= max

i

X

i

.

I

m : median of X



, ie., Pr (X



< m)  1=2 .

I

 is the smallest t such that X

t

> m .

 is the answer.

I Algorithm finds t such that E (X t )=E(OPT)  1=2. Prophet inequality.

Many basic problems on stochastic streams still open.

(57)

3. Stochastic Streams Contd

I Algorithm:

I

X



= max

i

X

i

.

I

m : median of X



, ie., Pr (X



< m)  1=2 .

I

 is the smallest t such that X

t

> m .

 is the answer.

I Algorithm finds t such that E (X t )=E(OPT)  1=2.

Prophet inequality.

Many basic problems on stochastic streams still open.

(58)

Conclusions

I Talk summary:

I

Indexing problem.

I

count-min sketch and applications.

I

classical streaming.

I New directions:

I

Distributed, continual.

I

Probabilistic.

I

Stochastic.

I Comments:

I

Need convincing systems and applications to motivate new directions.

I

Left out: window streams, rich queries and data,

MapReduce

(59)

Conclusions

I Talk summary:

I

Indexing problem.

I

count-min sketch and applications.

I

classical streaming.

I New directions:

I

Distributed, continual.

I

Probabilistic.

I

Stochastic.

I Comments:

I

Need convincing systems and applications to motivate new directions.

I

Left out: window streams, rich queries and data,

MapReduce

(60)

Conclusions

I Talk summary:

I

Indexing problem.

I

count-min sketch and applications.

I

classical streaming.

I New directions:

I

Distributed, continual.

I

Probabilistic.

I

Stochastic.

I Comments:

I

Need convincing systems and applications to motivate new directions.

I

Left out: window streams, rich queries and data,

MapReduce

References

Related documents

The $19.9 million grant went to a group of 13 colleges including all three of the schools in Kern Community College District: Bakersfield College, Cerro Coso College and Porterville

Facilitated by modern instruments and by close working relationships with faculty, students receive expert theoretical and practical instruction in all fundamental areas of

is not surprising that White quickly exploited his advantage.. Meanwhile White's pieces are ideally placed, whereas the black knight at QR1 is a very bad piece.

The CAA 507 Programs consist of three parts: a Small Business Ombudsman (SBO) to act as an advocate for small business, a Small Business Environmental Assistance Program (SBEAP)

A The resu Inpatien the Bene Regulati viewed a St D 605 W ebruary 14 acey, CFO Hospital of toria Street Mesa, CA 9 GE HOSPIT NAL PROVI PERIOD E e examined Our examin titutions Co

Sau ®ã thùc hiÖn lÖnh hiÓn thÞ néi dung local macros nµy, nhng macros nµy kh«ng tån t¹i ë ®o¹n ch¬ng trinh kh¸c hay ë bé nhí cña Stata. end

Internal Client Departments Internal Supply Management External Community Based Organizations &amp; Chambers External DBE Suppliers.. 