gvaliantTCS.pptx

(1)

Estimating The Unseen

Optimal estimators for entropy,

support size, and other properties

Gregory Valiant

(joint work with

Paul Valiant

)

(2)

Estimating the Unseen

Given independent samples from a

distribution (of discrete support):

Empirical distribution



optimally approximates

seen

portion of

distribution

•

What can we infer

about the

unseen

portion?

•

How can inferences

about the unseen

portion yield better

estimates of

distribution

properties?

(3)

vs

Some concrete problems

Q1: Given a length n vector, how many indices must we look at to estimate # distinct elements, to +/- en (w.h.p)? [distinct elements problem]

Q2: Given samples drawn from D

supported on {1,…,n}, how many samples required to estimate entropy(D) to within +/- e (w.h.p)?

Q3: Given samples from D1 and D2

supported on {1,2,…,n}, how many

samples required to estimate Dist(D1,D2) to within +/- e (w.h.p)?

…

a c a b c

Distinct

Element

s

Entropy

Distance

O(n

logn)

Trivial

O

(

n

)

[Bar Yossef et al.’01] [P. Valiant, ‘08]

[Raskhodnikova et al. ‘09]

O

(

n

)

[Batu et al.’01,’02]

[Paninski, ’03,’04]

[Dasgupta et al, ’05]

O

(

n

)

[Goldreich et al. ‘96] [Batu et al. ‘00,’01]

(4)

Fisher’s

Butterflies

Turing’s Enigma

Codewords

How many new

species if I observe for another period?

Probability mass of unseen codewords + + + + + + + +

-f₁- f₂+f₃-f₄+f₅-_… f₁/ (number of samples)

(5)

Symmetric Properties

Let D be set of distributions over {1,2,…n} Property p: D  R

p is symmetric, if invariant to relabeling support:

permutation s, p(D)=p(D_s)

e.g entropy, support size, distance to uniformity, etc.

Analogous for properties of pairs of distributions:

p is symmetric, if invariant to relabeling support:

permutation s, p(D1,D2)=p(D1_s ,D2_s)

(6)

Symmetric Properties

`Histogram’ of a distribution:

Given distribution D h_D: (0,1] -> [0,1] h(x):= # domain elmts of D that occur w. prob x

e.g. Unif[n] has h(1/n)=n, and h(x)=0 for all x≠1/n

Fact: any “symmetric” property is a function of only h

e.g. H(D)=S_x:h(x)≠0h(x) x log x

Support(D)=S_x:h(x)≠0h(x)

‘Fingerprint’ of set of samples [aka profile, collision stats]

f=f₁,f₂,…, f_kf_i:=# elmts seen exactly i times in the sample

(7)

Reasoning Beyond the

Empirical Distribution

Fingerprint based on k samples

Fingerprint based on 10000

(8)

Linear Programming

“Find distributions whose expected

fingerprint is close to the observed fingerprint of the samples”

Feasible Region

Must show diameter of feasible region is small!!

Entropy

Distinct Elements

Other Property

Q

(n/ log n)

samples, and

OPTIMAL

(9)

Linear Programming

“Find distributions whose expected

fingerprint is close to the observed fingerprint of the sample”

histogram

Thm: For sufficiently large n, and any constant c>1, given

c n / log n ind. draws from D, of support at most

n, with prob > 1-exp(-W(n)), our alg returns histogram h’ s.t.

R(h_D, h’) < O (1/c1/2)

Additionally, our algorithm runs in time linear in the number of samples. R(h,h’): Relative Wass. Metric: [sup over functions f, |f’(x)|<1/x, …]

Corollary: For any e > 0, given O(n/e2 log n) draws

from a distribution D of support at most n, with

prob > 1-exp(-W(n)) our algorithm returns v s.t.

(10)

Lower Bounds

Discrete Distributions

Distr. of Fingerprints

(11)

Lower Bounds

Thm: There exists a constant c s.t. for any sufficiently large n there exists D1, D2 s.t.

• _{D1,D2 distributions over {1,…,}_n_} • _{H(D1)-H(D2) > .1}

• _{No algorithm can distinguish a fingerprint}

derived from a set of c n/ log n samples drawn D1 versus that drawn from D2 with any

probability greater than 2/3.

(12)

Prior Lower Bounds

(for distinct elements, entropy)

Main obstacle: characterizing distribution of

fingerprint

Thm: CLT for “generalized multinomial” dist., in L1 metric.

Thm: CLT for sums multivariate ind. r.v., in

Wasserstein [Earthmover] norm.

Via convolution/deconvolution tricks

(13)

A Multivariate

Earthmover CLT

(14)

Stein’s Method

Goal:



_i

Z

_i

≈ G

(multivariate

Gaussian)

What notion of “

≈

”?

(Bonus:

simulate

doing this, changing

only

h

)

Generally

:

(15)

Making Anything

Gaussian

Two steps:

add random noise

;

rescale

towards the origin

. (Ornstein-Uhlenbeck

process)

Crucial observation: Ornstein-Uhlenbeck will

not change the mean of a distribution of

mean 0, nor the covariance of a distribution

with covariance

I

To first order: Ø. To second order: Ø. To third

order???

Thus 3

rd

-order/3

rd

-moment dependence of

Berry-Esseen

(16)

Goal: CLT for “generalized multinomial” dist., in

L1 metric

.

Idea:

Dist

_Earth

( , ) = small

Dist

_L1

( , ) =large

Dist

_L1

( , ) = small

Thm: CLT for sums multivariate ind. r.v., in

Wasserstein [Earthmover] norm.

(17)

Generalized Multinomial Distributions

Generalizes binomial, multinomial distributions, and any sums of such distributions.

Parameterized by matrix: 0 1

.3 .7

(18)

0 1 2

.5 .3 .2

Generalized Multinomial Distributions

Generalizes binomial, multinomial distributions, and any sums of such distributions.

Parameterized by matrix:

1 2

1

0 1 2

.5 .3 .2

.8 .1 .1

.3 .3 .4

.2 .8 0

(19)

CLT for Gen. Multinomial Distributions

Earthmover CLT

Convolution/Deconvolution

(20)

Back to Property

Estimation…

Given a set of Poisson(k) samples

drawn from D, the distribution of the fingerprint is a ``generalized

multinomial’’ distribution!!!!

Domain elt # 1 Domain elt #2 Domain elt #3 …

Column sums: f1 f2 f3 …

(fingerprint!) Not Seen Seen once Seen twice

.8 .1 .1

.5 .3 .2

.1 .2 .4

.2 .4 .3

…

(21)

Characterizing

Fingerprints

CLT: Fingerprint

expectations + covariance similar

Dist. of fingerprints close

Fingerprint expectations similar

Dist. of fingerprints close

(22)

(23)

So…do the estimators work in practice?

(24)

Performance in Practice (entropy)

(25)

(26)

Performance in Practice (support size)

(27)

The Big Picture

[Estimating Symmetric

Properties]

100+ years of

statistics

“Linear estimators”

f

₁

f

₂

f

₃

c

₁



+

c

₂



+

c

₃



+



Our Algorithm

Linear

Programming

Substantial

improvements!

"what richness of algorithmic

(28)

The Empirical Estimate

1/k 2/k 3/k 4/k 5/k 6/k 7/k 8/k 9/k₁₀/k₁₁/k₁₂/k₁₃/k₁₄/k₁₅/k

log( 1/k)  log( 2/k)  log( 3/k)  log( 4/k)  log( 5/k)  log( 6/k)  log( 7/k)  log( 8/k)  log( 9/k)  log(

10/ _k) log(

11/ _k) log(

12/ _k) log(

13/ _k) log(

14/ _k) log(

15/ _k)

Better estimates?

Apply something

other than log to the empirical

distribution

z(1

/k)

z(2

/k)

z(3

/k)

z(4

/k)

z(5

/k)

z(6

/k)

z(7

/k)

z(8

/k)

z(9

/k)

(29)

s.t. for all distributions

p

over [

n

]

“Expectation of estimator

z

applied to

k

samples from p is within

ε

of

H

(p)”

Searching for Better Estimators

Finding the Best Estimator

Bias

Variance

(30)

Surprising Theorem

Thm:

Given parameters n,k,ε, and a `linear’ property π

Eith

er

OR

(31)

“Find lower bound instance y+_,_y

-Maximize H(y+)-H(y-) s.t.

expected fingerprint

entries given k samples from y+,y- match to within

k1-c “”

for y+,y- dists. over [n]

Proof Idea: Duality!!

s.t. for all dists. p over [n]

“Find estimator z: Minimize ε, s.t.

expectation of estimator z applied to k samples from p is within ε of H(p)”

s.t. for all i, E[f+

(32)

Further Directions

Resolve dependency on

e

Beyond additive estimates – “case-by-case optimal”?



We suspect linear programming

is

better

than linear estimators

Non-symmetric properties?

Leveraging structural information to

yield improved sample complexity

Practical applications!

Applications of Central Limit Theorems?



for entropy,

Q(

n

/(

e

log

n

))