The square root rule for adaptive importance sampling

(1)

The square root rule for adaptive

importance sampling

Art B. Owen Stanford University

Based on joint work with Yi Zhou, PhD (1998) arXiv:1901.02976

To appear in ACM Transactions on Modeling and Simulation (TOMACS)

(2)

Adaptive importance sampling

1) We use importance sampling

2) From data

· · ·

see that we could have done it better

3) So we iterate

This talk

How to combine results from multiple iterations. Weight

k

’th iteration proportionally to

√

k

. Simple, safe, effective.

(3)

Genesis

This is from the Appendix to

“Adaptive importance sampling by mixtures of products of beta distributions”

(4)

Importance sampling notation

µ =

Z

f (x)p(x) dx

ˆ

µ =

1

n

X

i=1

f (x

_i

)p(x

_i

)

q(x

i

)

,

x

i iid

∼ q

where

q(x) > 0

whenever

f (x)p(x) 6= 0

.

Variance

var(ˆ

µ) =

σ

2 q

n

,

where

σ

_q2

=

Z

f

2

p

2

q

− µ

2

=

Z

(f p − µq)

2

q

f > 0 =⇒ σ

q

= 0

can be approached Avoid small

q

University of Florida

(5)

Self normalized I.S.

˜

µ =

1

n

X

i=1

f (x

_i

)p(x

_i

)

q(x

i

)

.

1

n

X

i=1

p(x

_i

)

q(x

i

)

,

x

i iid

∼ q

Less restrictive:

p

and

q

don’t have to be normalized

More restrictive: we need

q > 0

whenever

p > 0

Nota Bene

SNIS cannot approach zero variance unless

f

is constant.

lim

n→∞

n × var(˜

µ) >

"

Z

|f (x) − µ|p(x) dx

#

2

We focus here on adaptive plain IS. Some findings apply to SNIS too.

(6)

Parametric AIS

ˆ

µ =

1

n

X

i=1

f (x

i

)p(x

i

)

q(x

i

; θ)

,

x

i iid

∼ q(·; θ)

Core iteration

1) choose

θ

, 2) get

x

1

, . . . , x

n,

→ ˆ

µ

, 3) update

θ

(7)

Basic TODO list

1) pick a family

q(·; θ)

,

θ ∈ Θ

2) choose starting point

θ

1

3) choose sample size

n

and number

K > 2

of steps

4) design a rule to pick

θ

k using data from steps

1 · · · k − 1

5) sample

x

_ik iid

∼ q(·; θ

_k

)

and compute

ˆ

µ

k

=

1

n

X

i=1

f (x

ik

)p(x

ik

)

q(x

ik

; θ

k

)

,

6) combine

µ

ˆ

₁

, ˆ

µ

₂

, . . . , ˆ

µ

_K into

µ

ˆ

There are

N = nK

data values.

This talk

(8)

Example AIS

Ryu and Boyd (2014)

Adapt after every data point.

n = 1

,

K = N

, using convex optimization

Zhang (1996)

K = 2

. First sample is a pilot sample. Second sample from a kernel density estimate.

Kollman, Baggerly, Cox, Picard (1999)

Get

var(ˆ

µ) ≈ exp(−A × K)

. Possible because

f (x)p(x) ∝ q(x; θ)

some

θ ∈ Θ ⊂ R

r.

Kong and Spanier (2011)

Geometric convergence in radiative transport problems.

De Boer, Kroese, Mannor, Rubinstein (2005)

(9)

Martingales

History prior to step k:

H

_k

≡ (x

_i`

, i = 1, . . . , n, ` < k)

A martingale argument underlies the analysis of mean, variances, covariances.

Unbiasedness

E(ˆ

µ

k

| H

k

) = µ

(10)

Variance

var(ˆ

µ

_k

| H

_k

) = σ

_k2

≡

1

n

Z

(f (x)p(x) − µq(x; θ

k

))

2

q(x; θ

k

)

dx

NB

σ

_k2

= σ

_k2

(H

k

)

is random

var(ˆ

µ

k

) = E(σ

_k2

) ≡

τ

_k2

Variance estimates

ˆ

σ

_k2

=

1

n

1

n − 1

n

X

i=1

f (x

_i

)p(x

_i

)

q(x

i

; θ

k

)

− ˆ

µ

k

2

(

if

n > 2)

E(ˆ

σ

k2

| H

k

) = σ

2 k

E(ˆ

σ

k2

) = E(σ

2 k

) = τ

2 k

ˆ

σ

_k2 is unbiased for both

σ

_k2 and

τ

_k2 Take

τ

ˆ

_k2

≡ ˆ

σ

_k2

(11)

Covariance

For

` > k

cov(ˆ

µ

k

, ˆ

µ

`

) = E(E((ˆ

µ

k

− µ)(ˆ

µ

`

− µ) | H

`

))

= E((ˆ

µ

k

− µ)E(ˆ

µ

`

− µ | H

`

))

= 0

Upshot

ˆ

(12)

Fixed linear weights

ˆ

µ =

K

X

k=1

ω

k

µ

ˆ

k

ω

k

> 0

and

X

k

ω

k

= 1

Variance

var(ˆ

µ) =

K

X

k=1

ω

_k2

τ

_k2

Unknown optimal weights

ω

k

∝ τ

_k−2

Why simple?

Consider AMIS Cornuet, Marin, Mira, Robert (2012)

Weight on

µ

ˆ

_k can depend on future iterations. Very hard to analyze. “

· · ·

the convergence properties of the algorithm cannot be investigated

· · ·

”

(13)

What not to do

Do not take

ω

_k

∝ ˆ

τ

_k−2

= ˆ

σ

_k−2

Positive skew is common

E((ˆ

µ

k

− µ)

3

| H

k

) > 0

=⇒ cov(ˆ

µ

_k

, ˆ

τ

_k2

) > 0

=⇒

Get small

µ

ˆ

k with small

τ

ˆ

_k2 (large

ω

k) and large

µ

ˆ

with small

ω

_k

Result

We would downweight large

µ

ˆ

_k (large

ω

_k) and upweight small ones

Bad for failure probabilities

Also

(14)

Model for steady gain

τ

_k2

= τ

2

×

k

−y

,

0 6 y 6 1, 0 < τ < ∞

Invoking G.E.P. Box: This model might never hold exactly but it captures

qualitative behavior and variance is a continuous function of the weights used.

Too pessimistic case

y = 0 =⇒

no learning

Too optimistic case

y = 1 =⇒

get

var(ˆ

µ) = O(N

−2

)

Not reasonable unless

f (x)p(x) = q(x; θ)

some

θ

We guess

τ

_k2

∝ k

x

ˆ

µ = ˆ

µ(x) =

K

X

k=1

k

x

µ

ˆ

k

.

X

K k=1

k

x

0 < x < 1

(15)

Variances

τ

_k2

= τ

2

k

−y

ˆ

µ(x) =

K

X

k=1

k

x

µ

ˆ

k

.

X

K k=1

k

x

var(ˆ

µ(x)) = τ

2 K

X

k=1

k

2x−y

.

X

K k=1

k

x

!

2

At

x =

optimal unknown

y

var(ˆ

µ(y)) = τ

2 K

X

k=1

k

y

!

−1

Rate

(16)

Inefficiency

We should have used

y

but we did use

x

ρ

K

(x | y) ≡

var(ˆ

µ(x))

var(ˆ

µ(y))

=

P

K k=1

k

2x−y

P

K k=1

k

y

P

K k=1

k

x

2

Just use

x = 1/2

sup

16K<∞

sup

06y61

ρ

K

1

2

| y

6

9

8

O & Zhou (2019)

Unknown optimal rate; mildly suboptimal constant.

(17)

Steps in the proof

Lemma 1

sup

06y61

ρ

K

(x | y) =







ρ

K

(x | 1), x 6 1/2

ρ

K

(x | 0), x > 1/2.

Trivial for

K = 1

. For

K > 2

,

ρ

K

(x | y)

is strictly convex in

y

Also:

ρ

K

1

2

| 0

= ρ

K

1

2

| 1

. Lemma 2

ρ

K+1

1

2

| 1

> ρ

K

1

2

| 1

,

K > 1

Long argument using very tight inequalities for sums of powers of integers.

Theorem

L’H ˆopital’s rule:

lim

_K→∞

ρ

_K

1

2

| 1

=

9

8

(18)

Robustness

O & Zhou (2019) looks at other models Diminishing returns model:

τ

_k2

∝







k

−1

,

1 6 k 6 k

1

(1 + k

1

)

−1

, k

1

+ 1 6 k 6 k

1

+ k

2 Square root rule has

max

16k16100

max

16k26100

ρ 6 1.121

Bad case for sqrt

First iterations make no progress. Then variance drops sharply.

Self normalized

Above argument applies to variance Have to contend with bias.

(19)

2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 Iteration V ar iance ● ● ● ● ● ● _● ● _● ● ● ● ● ● ● ● ● ● ● ● ● _● ● _● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 V ar iance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● _● ● ● ● ● ● ● Asymptote => OK, Step => inefficient

(20)

Realistic patterns

1)

τ

_k2

> η > 0

for

k = 1, . . . , K

2)

τ

_k+12

6 τ

_k2

3) And maybe diminishing returns

(a)

τ

_k+22

/τ

_k+12

> τ

_k+12

/τ

_k2, or

(b)

τ

_k+12

− τ

_k+22

6 τ

_k+12

− τ

_k2

O & Zhou (2019) have some more examples.

Convex minimax

Pick

ω

_k in simplex to

min

ω

max

τ ∈T

X

k

ω

_k2

τ

_k2

Choosing

T = {(τ

₁2

, . . . , τ

_K2

)}

for future work

(21)

Thanks

•

Yi Zhou, co-author

•

Christian Robert, Richard Everitt, invitation

•

Victor Elvira, Felipe Aguayo, co-speakers

•

NSF DMS-1407397, DMS-1521145, IIS-1837931

•

Hobert, Betancourt, Khare, Michailidis, Patra, organizers