• No results found

The square root rule for adaptive importance sampling

N/A
N/A
Protected

Academic year: 2021

Share "The square root rule for adaptive importance sampling"

Copied!
21
0
0

Loading.... (view fulltext now)

Full text

(1)

The square root rule for adaptive

importance sampling

Art B. Owen Stanford University

Based on joint work with Yi Zhou, PhD (1998) arXiv:1901.02976

To appear in ACM Transactions on Modeling and Simulation (TOMACS)

(2)

Adaptive importance sampling

1) We use importance sampling

2) From data

· · ·

see that we could have done it better

3) So we iterate

This talk

How to combine results from multiple iterations. Weight

k

’th iteration proportionally to

k

. Simple, safe, effective.
(3)

Genesis

This is from the Appendix to

“Adaptive importance sampling by mixtures of products of beta distributions”

(4)

Importance sampling notation

µ =

Z

f (x)p(x) dx

ˆ

µ =

1

n

n

X

i=1

f (x

i

)p(x

i

)

q(x

i

)

,

x

i iid

∼ q

where

q(x) > 0

whenever

f (x)p(x) 6= 0

.

Variance

var(ˆ

µ) =

σ

2 q

n

,

where

σ

q2

=

Z

f

2

p

2

q

− µ

2

=

Z

(f p − µq)

2

q

f > 0 =⇒ σ

q

= 0

can be approached Avoid small

q

University of Florida
(5)

Self normalized I.S.

˜

µ =

1

n

n

X

i=1

f (x

i

)p(x

i

)

q(x

i

)

.

1

n

n

X

i=1

p(x

i

)

q(x

i

)

,

x

i iid

∼ q

Less restrictive:

p

and

q

don’t have to be normalized

More restrictive: we need

q > 0

whenever

p > 0

Nota Bene

SNIS cannot approach zero variance unless

f

is constant.

lim

n→∞

n × var(˜

µ) >

"

Z

|f (x) − µ|p(x) dx

#

2

We focus here on adaptive plain IS. Some findings apply to SNIS too.

(6)

Parametric AIS

ˆ

µ =

1

n

n

X

i=1

f (x

i

)p(x

i

)

q(x

i

; θ)

,

x

i iid

∼ q(·; θ)

Core iteration

1) choose

θ

, 2) get

x

1

, . . . , x

n,

→ ˆ

µ

, 3) update

θ

University of Florida
(7)

Basic TODO list

1) pick a family

q(·; θ)

,

θ ∈ Θ

2) choose starting point

θ

1

3) choose sample size

n

and number

K > 2

of steps

4) design a rule to pick

θ

k using data from steps

1 · · · k − 1

5) sample

x

ik iid

∼ q(·; θ

k

)

and compute

ˆ

µ

k

=

1

n

n

X

i=1

f (x

ik

)p(x

ik

)

q(x

ik

; θ

k

)

,

6) combine

µ

ˆ

1

, ˆ

µ

2

, . . . , ˆ

µ

K into

µ

ˆ

There are

N = nK

data values.

This talk

(8)

Example AIS

Ryu and Boyd (2014)

Adapt after every data point.

n = 1

,

K = N

, using convex optimization

Zhang (1996)

K = 2

. First sample is a pilot sample. Second sample from a kernel density estimate.

Kollman, Baggerly, Cox, Picard (1999)

Get

var(ˆ

µ) ≈ exp(−A × K)

. Possible because

f (x)p(x) ∝ q(x; θ)

some

θ ∈ Θ ⊂ R

r.

Kong and Spanier (2011)

Geometric convergence in radiative transport problems.

De Boer, Kroese, Mannor, Rubinstein (2005)

(9)

Martingales

History prior to step k:

H

k

≡ (x

i`

, i = 1, . . . , n, ` < k)

A martingale argument underlies the analysis of mean, variances, covariances.

Unbiasedness

E(ˆ

µ

k

| H

k

) = µ

(10)

Variance

var(ˆ

µ

k

| H

k

) = σ

k2

1

n

Z

(f (x)p(x) − µq(x; θ

k

))

2

q(x; θ

k

)

dx

NB

σ

k2

= σ

k2

(H

k

)

is random

var(ˆ

µ

k

) = E(σ

k2

) ≡

τ

k2

Variance estimates

ˆ

σ

k2

=

1

n

1

n − 1

n

X

i=1

f (x

i

)p(x

i

)

q(x

i

; θ

k

)

− ˆ

µ

k

2

(

if

n > 2)

E(ˆ

σ

k2

| H

k

) = σ

2 k

E(ˆ

σ

k2

) = E(σ

2 k

) = τ

2 k

ˆ

σ

k2 is unbiased for both

σ

k2 and

τ

k2 Take

τ

ˆ

k2

≡ ˆ

σ

k2
(11)

Covariance

For

` > k

cov(ˆ

µ

k

, ˆ

µ

`

) = E(E((ˆ

µ

k

− µ)(ˆ

µ

`

− µ) | H

`

))

= E((ˆ

µ

k

− µ)E(ˆ

µ

`

− µ | H

`

))

= 0

Upshot

ˆ

(12)

Fixed linear weights

ˆ

µ =

K

X

k=1

ω

k

µ

ˆ

k

ω

k

> 0

and

X

k

ω

k

= 1

Variance

var(ˆ

µ) =

K

X

k=1

ω

k2

τ

k2

Unknown optimal weights

ω

k

∝ τ

k−2

Why simple?

Consider AMIS Cornuet, Marin, Mira, Robert (2012)

Weight on

µ

ˆ

k can depend on future iterations. Very hard to analyze. “

· · ·

the convergence properties of the algorithm cannot be investigated

· · ·

(13)

What not to do

Do not take

ω

k

∝ ˆ

τ

k−2

= ˆ

σ

k−2

Positive skew is common

E((ˆ

µ

k

− µ)

3

| H

k

) > 0

=⇒ cov(ˆ

µ

k

, ˆ

τ

k2

) > 0

=⇒

Get small

µ

ˆ

k with small

τ

ˆ

k2 (large

ω

k) and large

µ

ˆ

with small

ω

k

Result

We would downweight large

µ

ˆ

k (large

ω

k) and upweight small ones

Bad for failure probabilities

Also

(14)

Model for steady gain

τ

k2

= τ

2

×

k

−y

,

0 6 y 6 1, 0 < τ < ∞

Invoking G.E.P. Box: This model might never hold exactly but it captures

qualitative behavior and variance is a continuous function of the weights used.

Too pessimistic case

y = 0 =⇒

no learning

Too optimistic case

y = 1 =⇒

get

var(ˆ

µ) = O(N

−2

)

Not reasonable unless

f (x)p(x) = q(x; θ)

some

θ

We guess

τ

k2

∝ k

x

ˆ

µ = ˆ

µ(x) =

K

X

k=1

k

x

µ

ˆ

k

.

X

K k=1

k

x

0 < x < 1

University of Florida
(15)

Variances

τ

k2

= τ

2

k

−y

ˆ

µ(x) =

K

X

k=1

k

x

µ

ˆ

k

.

X

K k=1

k

x

var(ˆ

µ(x)) = τ

2 K

X

k=1

k

2x−y

.

X

K k=1

k

x

!

2

At

x =

optimal unknown

y

var(ˆ

µ(y)) = τ

2 K

X

k=1

k

y

!

−1

Rate

(16)

Inefficiency

We should have used

y

but we did use

x

ρ

K

(x | y) ≡

var(ˆ

µ(x))

var(ˆ

µ(y))

=

P

K k=1

k

2x−y

P

K k=1

k

y

P

K k=1

k

x

2

Just use

x = 1/2

sup

16K<∞

sup

06y61

ρ

K

1

2

| y

6

9

8

O & Zhou (2019)

Unknown optimal rate; mildly suboptimal constant.

(17)

Steps in the proof

Lemma 1

sup

06y61

ρ

K

(x | y) =

ρ

K

(x | 1), x 6 1/2

ρ

K

(x | 0), x > 1/2.

Trivial for

K = 1

. For

K > 2

,

ρ

K

(x | y)

is strictly convex in

y

Also:

ρ

K

1

2

| 0

= ρ

K

1

2

| 1

. Lemma 2

ρ

K+1

1

2

| 1

> ρ

K

1

2

| 1

,

K > 1

Long argument using very tight inequalities for sums of powers of integers.

Theorem

L’H ˆopital’s rule:

lim

K→∞

ρ

K

1

2

| 1

=

9

8

(18)

Robustness

O & Zhou (2019) looks at other models Diminishing returns model:

τ

k2

k

−1

,

1 6 k 6 k

1

(1 + k

1

)

−1

, k

1

+ 1 6 k 6 k

1

+ k

2 Square root rule has

max

16k16100

max

16k26100

ρ 6 1.121

Bad case for sqrt

First iterations make no progress. Then variance drops sharply.

Self normalized

Above argument applies to variance Have to contend with bias.

(19)

2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 Iteration V ar iance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 0.2 0.4 0.6 0.8 1.0 V ar iance ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Asymptote => OK, Step => inefficient

(20)

Realistic patterns

1)

τ

k2

> η > 0

for

k = 1, . . . , K

2)

τ

k+12

6 τ

k2

3) And maybe diminishing returns

(a)

τ

k+22

k+12

> τ

k+12

k2, or

(b)

τ

k+12

− τ

k+22

6 τ

k+12

− τ

k2

O & Zhou (2019) have some more examples.

Convex minimax

Pick

ω

k in simplex to

min

ω

max

τ ∈T

X

k

ω

k2

τ

k2

Choosing

T = {(τ

12

, . . . , τ

K2

)}

for future work
(21)

Thanks

Yi Zhou, co-author

Christian Robert, Richard Everitt, invitation

Victor Elvira, Felipe Aguayo, co-speakers

NSF DMS-1407397, DMS-1521145, IIS-1837931

Hobert, Betancourt, Khare, Michailidis, Patra, organizers

References

Related documents