Estimating The Unseen
Optimal estimators for entropy,
support size, and other properties
Gregory Valiant
(joint work with
Paul Valiant
)
Estimating the Unseen
Given independent samples from a
distribution (of discrete support):
Empirical distribution
optimally approximates
seen
portion of
distribution
•
What can we infer
about the
unseen
portion?
•
How can inferences
about the unseen
portion yield better
estimates of
distribution
properties?
vs
Some concrete problems
Q1: Given a length n vector, how many indices must we look at to estimate # distinct elements, to +/- en (w.h.p)? [distinct elements problem]
Q2: Given samples drawn from D
supported on {1,…,n}, how many samples required to estimate entropy(D) to within +/- e (w.h.p)?
Q3: Given samples from D1 and D2
supported on {1,2,…,n}, how many
samples required to estimate Dist(D1,D2) to within +/- e (w.h.p)?
…
a c a b c
Distinct
Element
s
Entropy
Distance
O(n
logn)
Trivial
O
(
n
)
[Bar Yossef et al.’01] [P. Valiant, ‘08]
[Raskhodnikova et al. ‘09]
O
(
n
)
[Batu et al.’01,’02]
[Paninski, ’03,’04]
[Dasgupta et al, ’05]
O
(
n
)
[Goldreich et al. ‘96] [Batu et al. ‘00,’01]
Fisher’s
Butterflies
Turing’s Enigma
Codewords
How many new
species if I observe for another period?
Probability mass of unseen codewords + + + + + + + +
-f1- f2+f3-f4+f5-… f1 / (number of samples)
Symmetric Properties
Let D be set of distributions over {1,2,…n} Property p: D R
p is symmetric, if invariant to relabeling support:
permutation s, p(D)=p(Ds)
e.g entropy, support size, distance to uniformity, etc.
Analogous for properties of pairs of distributions:
p is symmetric, if invariant to relabeling support:
permutation s, p(D1,D2)=p(D1s ,D2s)
Symmetric Properties
`Histogram’ of a distribution:
Given distribution D hD: (0,1] -> [0,1] h(x):= # domain elmts of D that occur w. prob x
e.g. Unif[n] has h(1/n)=n, and h(x)=0 for all x≠1/n
Fact: any “symmetric” property is a function of only h
e.g. H(D)=Sx:h(x)≠0 h(x) x log x
Support(D)=Sx:h(x)≠0 h(x)
‘Fingerprint’ of set of samples [aka profile, collision stats]
f=f1,f2,…, fk fi:=# elmts seen exactly i times in the sample
Reasoning Beyond the
Empirical Distribution
Fingerprint based on k samples
Fingerprint based on 10000
Linear Programming
“Find distributions whose expected
fingerprint is close to the observed fingerprint of the samples”
Feasible Region
Must show diameter of feasible region is small!!
Entropy
Distinct Elements
Other Property
Q
(n/ log n)
samples, and
OPTIMAL
Linear Programming
“Find distributions whose expected
fingerprint is close to the observed fingerprint of the sample”
histogram
Thm: For sufficiently large n, and any constant c>1, given
c n / log n ind. draws from D, of support at most
n, with prob > 1-exp(-W(n)), our alg returns histogram h’ s.t.
R(hD, h’) < O (1/c1/2)
Additionally, our algorithm runs in time linear in the number of samples. R(h,h’): Relative Wass. Metric: [sup over functions f, |f’(x)|<1/x, …]
Corollary: For any e > 0, given O(n/e2 log n) draws
from a distribution D of support at most n, with
prob > 1-exp(-W(n)) our algorithm returns v s.t.
Lower Bounds
Discrete Distributions
Distr. of Fingerprints
Lower Bounds
Thm: There exists a constant c s.t. for any sufficiently large n there exists D1, D2 s.t.
• D1,D2 distributions over {1,…,n} • H(D1)-H(D2) > .1
• No algorithm can distinguish a fingerprint
derived from a set of c n/ log n samples drawn D1 versus that drawn from D2 with any
probability greater than 2/3.
Prior Lower Bounds
(for distinct elements, entropy)
Main obstacle: characterizing distribution of
fingerprint
Thm: CLT for “generalized multinomial” dist., in L1 metric.
Thm: CLT for sums multivariate ind. r.v., in
Wasserstein [Earthmover] norm.
Via convolution/deconvolution tricks
A Multivariate
Earthmover CLT
Stein’s Method
Goal:
iZ
i≈ G
(multivariate
Gaussian)
What notion of “
≈
”?
(Bonus:
simulate
doing this, changing
only
h
)
Generally
:
Making Anything
Gaussian
Two steps:
add random noise
;
rescale
towards the origin
. (Ornstein-Uhlenbeck
process)
Crucial observation: Ornstein-Uhlenbeck will
not change the mean of a distribution of
mean 0, nor the covariance of a distribution
with covariance
I
To first order: Ø. To second order: Ø. To third
order???
Thus 3
rd-order/3
rd-moment dependence of
Berry-Esseen
Goal: CLT for “generalized multinomial” dist., in
L1 metric
.
Idea:
Dist
Earth( , ) = small
Dist
L1( , ) =large
Dist
L1( , ) = small
Thm: CLT for sums multivariate ind. r.v., in
Wasserstein [Earthmover] norm.
Generalized Multinomial Distributions
Generalizes binomial, multinomial distributions, and any sums of such distributions.
Parameterized by matrix: 0 1
.3 .7
.3 .7
.3 .7
.3 .7
0 1 2
.5 .3 .2
.5 .3 .2
.5 .3 .2
.5 .3 .2
Generalized Multinomial Distributions
Generalizes binomial, multinomial distributions, and any sums of such distributions.
Parameterized by matrix:
1 2
1
0 1 2
.5 .3 .2
.8 .1 .1
.3 .3 .4
.2 .8 0
CLT for Gen. Multinomial Distributions
Earthmover CLT
Convolution/Deconvolution
Back to Property
Estimation…
Given a set of Poisson(k) samples
drawn from D, the distribution of the fingerprint is a ``generalized
multinomial’’ distribution!!!!
Domain elt # 1 Domain elt #2 Domain elt #3 …
Column sums: f1 f2 f3 …
(fingerprint!) Not Seen Seen once Seen twice
.8 .1 .1
.5 .3 .2
.1 .2 .4
.2 .4 .3
…
Characterizing
Fingerprints
CLT: Fingerprint
expectations + covariance similar
Dist. of fingerprints close
Fingerprint expectations similar
Dist. of fingerprints close
So…do the estimators work in practice?
Performance in Practice (entropy)
Performance in Practice (support size)
The Big Picture
[Estimating Symmetric
Properties]
100+ years of
statistics
“Linear estimators”
f
1f
2f
3c
1
+
c
2
+
c
3
+
Our Algorithm
Linear
Programming
Substantial
improvements!
"what richness of algorithmic
The Empirical Estimate
1/k 2/k 3/k 4/k 5/k 6/k 7/k 8/k 9/k10/k11/k12/k13/k14/k15/k
log( 1/k) log( 2/k) log( 3/k) log( 4/k) log( 5/k) log( 6/k) log( 7/k) log( 8/k) log( 9/k) log(
10/ k) log(
11/ k) log(
12/ k) log(
13/ k) log(
14/ k) log(
15/ k)
Better estimates?
Apply something
other than log to the empirical
distribution
z(1
/k)
z(2
/k)
z(3
/k)
z(4
/k)
z(5
/k)
z(6
/k)
z(7
/k)
z(8
/k)
z(9
/k)
s.t. for all distributions
p
over [
n
]
“Expectation of estimator
z
applied to
k
samples from p is within
ε
of
H
(p)”
Searching for Better Estimators
Finding the Best Estimator
Bias
Variance
Surprising Theorem
Thm:
Given parameters n,k,ε, and a `linear’ property πEith
er
OR
“Find lower bound instance y+,y
-Maximize H(y+)-H(y-) s.t.
expected fingerprint
entries given k samples from y+,y- match to within
k1-c “”
for y+,y- dists. over [n]
Proof Idea: Duality!!
s.t. for all dists. p over [n]
“Find estimator z: Minimize ε, s.t.
expectation of estimator z applied to k samples from p is within ε of H(p)”
s.t. for all i, E[f+
Further Directions
Resolve dependency on
e
Beyond additive estimates – “case-by-case optimal”?
We suspect linear programming
is
better
than linear estimators
Non-symmetric properties?
Leveraging structural information to
yield improved sample complexity
Practical applications!
Applications of Central Limit Theorems?
for entropy,
Q(
n
/(
e
log
n
))