The square root rule for adaptive
importance sampling
Art B. Owen Stanford University
Based on joint work with Yi Zhou, PhD (1998) arXiv:1901.02976
To appear in ACM Transactions on Modeling and Simulation (TOMACS)
Adaptive importance sampling
1) We use importance sampling
2) From data
· · ·
see that we could have done it better3) So we iterate
This talk
How to combine results from multiple iterations. Weight
k
’th iteration proportionally to√
k
. Simple, safe, effective.Genesis
This is from the Appendix to
“Adaptive importance sampling by mixtures of products of beta distributions”
Importance sampling notation
µ =
Z
f (x)p(x) dx
ˆ
µ =
1
n
nX
i=1f (x
i)p(x
i)
q(x
i)
,
x
i iid∼ q
whereq(x) > 0
wheneverf (x)p(x) 6= 0
.Variance
var(ˆ
µ) =
σ
2 qn
,
whereσ
q2=
Z
f
2p
2q
− µ
2=
Z
(f p − µq)
2q
f > 0 =⇒ σ
q= 0
can be approached Avoid smallq
University of FloridaSelf normalized I.S.
˜
µ =
1
n
nX
i=1f (x
i)p(x
i)
q(x
i)
.
1
n
nX
i=1p(x
i)
q(x
i)
,
x
i iid∼ q
Less restrictive:p
andq
don’t have to be normalizedMore restrictive: we need
q > 0
wheneverp > 0
Nota Bene
SNIS cannot approach zero variance unless
f
is constant.lim
n→∞n × var(˜
µ) >
"
Z
|f (x) − µ|p(x) dx
#
2We focus here on adaptive plain IS. Some findings apply to SNIS too.
Parametric AIS
ˆ
µ =
1
n
nX
i=1f (x
i)p(x
i)
q(x
i; θ)
,
x
i iid∼ q(·; θ)
Core iteration
1) chooseθ
, 2) getx
1, . . . , x
n,→ ˆ
µ
, 3) updateθ
University of FloridaBasic TODO list
1) pick a family
q(·; θ)
,θ ∈ Θ
2) choose starting point
θ
13) choose sample size
n
and numberK > 2
of steps4) design a rule to pick
θ
k using data from steps1 · · · k − 1
5) sample
x
ik iid∼ q(·; θ
k)
and computeˆ
µ
k=
1
n
nX
i=1f (x
ik)p(x
ik)
q(x
ik; θ
k)
,
6) combineµ
ˆ
1, ˆ
µ
2, . . . , ˆ
µ
K intoµ
ˆ
There are
N = nK
data values.This talk
Example AIS
Ryu and Boyd (2014)
Adapt after every data point.
n = 1
,K = N
, using convex optimizationZhang (1996)
K = 2
. First sample is a pilot sample. Second sample from a kernel density estimate.Kollman, Baggerly, Cox, Picard (1999)
Get
var(ˆ
µ) ≈ exp(−A × K)
. Possible becausef (x)p(x) ∝ q(x; θ)
someθ ∈ Θ ⊂ R
r.Kong and Spanier (2011)
Geometric convergence in radiative transport problems.
De Boer, Kroese, Mannor, Rubinstein (2005)
Martingales
History prior to step k:
H
k≡ (x
i`, i = 1, . . . , n, ` < k)
A martingale argument underlies the analysis of mean, variances, covariances.
Unbiasedness
E(ˆ
µ
k| H
k) = µ
Variance
var(ˆ
µ
k| H
k) = σ
k2≡
1
n
Z
(f (x)p(x) − µq(x; θ
k))
2q(x; θ
k)
dx
NBσ
k2= σ
k2(H
k)
is randomvar(ˆ
µ
k) = E(σ
k2) ≡
τ
k2Variance estimates
ˆ
σ
k2=
1
n
1
n − 1
nX
i=1f (x
i)p(x
i)
q(x
i; θ
k)
− ˆ
µ
k 2(
ifn > 2)
E(ˆ
σ
k2| H
k) = σ
2 kE(ˆ
σ
k2) = E(σ
2 k) = τ
2 kˆ
σ
k2 is unbiased for bothσ
k2 andτ
k2 Takeτ
ˆ
k2≡ ˆ
σ
k2Covariance
For` > k
cov(ˆ
µ
k, ˆ
µ
`) = E(E((ˆ
µ
k− µ)(ˆ
µ
`− µ) | H
`))
= E((ˆ
µ
k− µ)E(ˆ
µ
`− µ | H
`))
= 0
Upshot
ˆ
Fixed linear weights
ˆ
µ =
KX
k=1ω
kµ
ˆ
kω
k> 0
andX
kω
k= 1
Variance
var(ˆ
µ) =
KX
k=1ω
k2τ
k2Unknown optimal weights
ω
k∝ τ
k−2Why simple?
Consider AMIS Cornuet, Marin, Mira, Robert (2012)
Weight on
µ
ˆ
k can depend on future iterations. Very hard to analyze. “· · ·
the convergence properties of the algorithm cannot be investigated· · ·
”What not to do
Do not take
ω
k∝ ˆ
τ
k−2= ˆ
σ
k−2Positive skew is common
E((ˆ
µ
k− µ)
3| H
k) > 0
=⇒ cov(ˆ
µ
k, ˆ
τ
k2) > 0
=⇒
Get smallµ
ˆ
k with smallτ
ˆ
k2 (largeω
k) and largeµ
ˆ
with smallω
kResult
We would downweight large
µ
ˆ
k (largeω
k) and upweight small onesBad for failure probabilities
Also
Model for steady gain
τ
k2= τ
2×
k
−y,
0 6 y 6 1, 0 < τ < ∞
Invoking G.E.P. Box: This model might never hold exactly but it captures
qualitative behavior and variance is a continuous function of the weights used.
Too pessimistic case
y = 0 =⇒
no learningToo optimistic case
y = 1 =⇒
getvar(ˆ
µ) = O(N
−2)
Not reasonable unless
f (x)p(x) = q(x; θ)
someθ
We guess
τ
k2∝ k
xˆ
µ = ˆ
µ(x) =
KX
k=1k
xµ
ˆ
k.
X
K k=1k
x0 < x < 1
University of FloridaVariances
τ
k2= τ
2k
−yˆ
µ(x) =
KX
k=1k
xµ
ˆ
k.
X
K k=1k
xvar(ˆ
µ(x)) = τ
2 KX
k=1k
2x−y.
X
K k=1k
x!
2At
x =
optimal unknown
y
var(ˆ
µ(y)) = τ
2 KX
k=1k
y!
−1Rate
Inefficiency
We should have used
y
but we did usex
ρ
K(x | y) ≡
var(ˆ
µ(x))
var(ˆ
µ(y))
=
P
K k=1k
2x−yP
K k=1k
yP
K k=1k
x 2Just use
x = 1/2
sup
16K<∞sup
06y61ρ
K1
2
| y
6
9
8
O & Zhou (2019)Unknown optimal rate; mildly suboptimal constant.
Steps in the proof
Lemma 1sup
06y61ρ
K(x | y) =
ρ
K(x | 1), x 6 1/2
ρ
K(x | 0), x > 1/2.
Trivial forK = 1
. ForK > 2
,ρ
K(x | y)
is strictly convex iny
Also:ρ
K1
2
| 0
= ρ
K1
2
| 1
. Lemma 2ρ
K+11
2
| 1
> ρ
K1
2
| 1
,
K > 1
Long argument using very tight inequalities for sums of powers of integers.
Theorem
L’H ˆopital’s rule:
lim
K→∞ρ
K1
2
| 1
=
9
8
Robustness
O & Zhou (2019) looks at other models Diminishing returns model:
τ
k2∝
k
−1,
1 6 k 6 k
1(1 + k
1)
−1, k
1+ 1 6 k 6 k
1+ k
2 Square root rule hasmax
16k16100
max
16k26100
ρ 6 1.121
Bad case for sqrt
First iterations make no progress. Then variance drops sharply.
Self normalized
Above argument applies to variance Have to contend with bias.
2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 Iteration V ar iance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 V ar iance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Asymptote => OK, Step => inefficient
Realistic patterns
1)
τ
k2> η > 0
fork = 1, . . . , K
2)
τ
k+126 τ
k23) And maybe diminishing returns
(a)
τ
k+22/τ
k+12> τ
k+12/τ
k2, or(b)
τ
k+12− τ
k+226 τ
k+12− τ
k2O & Zhou (2019) have some more examples.
Convex minimax
Pickω
k in simplex tomin
ωmax
τ ∈TX
kω
k2τ
k2Choosing
T = {(τ
12, . . . , τ
K2)}
for future workThanks
•
Yi Zhou, co-author•
Christian Robert, Richard Everitt, invitation•
Victor Elvira, Felipe Aguayo, co-speakers•
NSF DMS-1407397, DMS-1521145, IIS-1837931•
Hobert, Betancourt, Khare, Michailidis, Patra, organizers