• No results found

Kernel techniques for adaptive Monte Carlo methods

N/A
N/A
Protected

Academic year: 2021

Share "Kernel techniques for adaptive Monte Carlo methods"

Copied!
28
0
0

Loading.... (view fulltext now)

Full text

(1)

Kernel methods for adaptive Monte Carlo

Heiko Strathmann

Gatsby Unit, UCL London

(2)
(3)

Metropolis Hastings transition kernel

Target π(θ)p(θ|D)

I At iteration j+1, state θ(j) I Proposeθ0 ∼q θ|θ(j)

I Acceptθ(j+1) ←θ0 with probability min π(θ0) π(θ(j)) × q(θ(j)|θ0) q(θ0|θ(j)),1 I Rejectθ(j+1) ←θ(j) otherwise.

(4)

Intractable target – running example

Gaussian process classification model on{(xi,yi)}ni=1

I latent process responsef ∈Rn wherefi =f(xi)

I labelsD=y ∈ {−1,1}n

I hyper-parametersθ

Joint distribution

p(f,y, θ) =p(θ)p(f|θ)p(y|f)

I f|θ ∼ N(0,Kθ) with covariance matrix Kθ

(5)

Intractable target – running example

I Interested in posterior parameters

p(θ|y)p(θ)p(y|θ) = p(θ)

Z

p(y|f)p(f|θ)df

c.f.Filippone & Girolami (2014), Murray & Adams (2011)

I Unbiased estimate via importance sampling:

p(y|θ) 1 nimp nimp X i=1 p(y|f(i))p(f(i)|θ) Q(f(i))

with f(i)Q(f), which is obtained via e.g. EP

I Instance of pseudo-marginal MCMC

[Beaumont, 2003], [Andrieu & Roberts, 2009], ... [Lyne et. al 2015]

(6)

Intractable target – running example

−6 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 θ7

(7)

Learning covariance

I [Haario et al., 1999]learn covariance on the fly

I Given Markov chain at stateθ(t), then for λt ∈(0,1), set

Σt = (1λt)Σt1+λt θ(t)θ>(t)

and use proposal

q(·|θ(t)) =N(·|θ(t),Σt)

I Careful whenq depends on {θ(i)}it

I Can chooseλt s.t.Σt Cov(π)as t → ∞under some assumptions on π [Andrieu, 2008]

(8)

Adaptive Metropolis

[Haario et al., 1999]

Improves mixing but is

(9)

Learning kernel covariance

[Sejdinovic et al., 2012]

φ

Cz

Feature space sample f

xt φ(xt)

Feature space H Input spaceX

(10)

The Kameleon

[Sejdinovic et al., 2012]

Learned kernel covariance allows to I propose locally aligned moves

I improved mixing on nonlinear targets I without the need for gradients

(11)

This talk: learning gradients

Gradients allow to

I propose distant moves with I high acceptance probability I in high dimensions

(12)

Hamiltonian dynamics 101

I Potential energyU(q)=−logπ(q) I Momentump ∼exp(K(p)), K(p) = 1 2p>p I Hamiltonian H(p,q) :=K(p) +U(q) I H-Flow is map φHt : (p,q)7→(p∗,q∗) s.t.H(p∗,q∗) = H(p,q)t

I Acceptance probability along flow is 1. I Generated by operator: ∂K ∂p ∂ ∂q − ∂U ∂q ∂ ∂p

(13)

Exponential families in kernel spaces

I Need a surrogate density model to model gradient I Kameleon used Gaussian in RKHSH

I Here: exponential family [Sriperumbudur at al., 2014]

π(θ)exp   hf,k(θ,·)iH | {z } =f(θ) −A(f)   

I For certain k, dense in probability densities (KL, TV, ...) I Crux: fitting – normalising constant A(f) is intractable

A(f) =log

Z

exp(f(θ))dθ

(14)

Score matching

[Hyvärien, 2005]

I Instead of ML, minimise Fisher divergence

arg min f∈H 1 2 Z π(θ)k∇θf(θ)−∇θlogπ(θ)k22dθ

I Intuition: match gradients in high density regions I Remarkable: can rewrite and estimate from {θi}ni=1 ∼π

arg min f∈H 1 n n X i=1 d X `=1 " ∂2f(θ) ∂θ2 ` +1 2 ∂f(θ) ∂θ` 2#

I Can be minimised in closed form. Reduces to regression. I In practice much more robust than KDE.

(15)

Hamiltonian moves without gradients

−6 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 θ7 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 1 θ7

Kernel induced Hamiltonian flow:

∂K ∂p ∂ ∂q − ∂f ∂q ∂ ∂p

(16)

Kernel HMC

[Strathmann et al., 2015]

Start as random walk, transition to HMC.

Every iteration:

I Learn/update gradient model using past trajectory

I Use surrogate gradient to simulate Hamiltonian dynamics I Correction for simulation error and gradient error:

MH accept/reject step using estimator for π

I Stop adapting eventually

(17)

Computational considerations

HMC KMC 0 100 200 300 400 500 Leap-frog steps 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Acceptance prob.

I Bad fit⇒ low acceptance rate⇒ inefficient. But... I Gradient model expensive to fit to Markov chain {θi}ti=1:

I O(t3d3) time

I O(t2d2) memory

I Markov chain trajectory length t grows I Aim is ’high’ dimension d

(18)

One approximation: KMC Lite

f(θ) = n X i=1 αik(zi, θ) I {zi}ni=1 ⊆ {θi}ti=1 sub-sample I α∈Rn from ˆ αλ =− σ 2(C +λI) −1b whereC Rn×n,b ∈Rn

depend on kernel matrix I Cost O(n3+n2d)(modulo low-rank, CG). Geometrically ergodic on log-concave targets. Gradient norm: Gaussian KMC Lite

(19)

Geometric ergodicity intuition

MCMC chain visits ’interesting’ parts I geometrically fast

I in particular when initialised in tails I means: same guarantees as RWM

Proof idea

I In KMC lite, we have forkqk → ∞

f(q) = n

X

i=1

αik(zi,q) = exp(−kzi −qk)→0

I Recall kernel H-flow is generated as

∂K ∂p ∂ ∂q − ∂f ∂q ∂ ∂p

(20)

Why do we care?

Early adaptation stopping is potentially harmful... But we need to for asymptotic correctness!

(21)

Why do we care?

Imagine we stopped adaptation early... with a bad fit.

(22)

Why do we care?

(23)

Acceptance rate in high dimensions

Challenging Gaussian (top): I Eigenvalues: λi ∼Exp(1). I Covariance: diag(λ1, . . . , λd),

randomly rotate.

I ‘Non-singular’ length-scales I KMC scales up tod ≈30. Isotropic Gaussian (bottom):

I More smooth I KMC scales up tod100. 102 103 104 n 100 101 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 102 103 104 n 100 101 102 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(24)

KMC asymptotically behaves as HMC

8-dimensional strongly nonlinear snythetic banana

0 500 1000 1500 2000 n 0 5 10 15 20 25 30 35 Minimum ESS HMC KMC RW KAMH 0 500 1000 1500 2000 n 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Acc. rate 0 500 1000 1500 2000 n 0 2 4 6 8 10 k ˆ E[X]k

(25)

KMC improves mixing

0 1000 2000 3000 4000 5000 Iterations 102 103 104 105 106 107 M M D fr om gr ou nd tr ut h KMC KAMH RW −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 1 θ7

(26)

Kernel sequential Monte Carlo

[Schuster & Strathmann et al., 2016]

Nonlinear versions of

I Adaptive Sequential Monte Carlo[Fearnhead et al., 2010] I Feature space covariance

Gradient free versions of

I Gradient Importance Sampling[Schuster et al., 2015] I Hamiltonian Importance Sampling [Naesseth et al., 2016]

Context:

I Intractable likelihoods, nested importance sampling

(27)

Discussion

Kernel models as density emulators for Monte Carlo I Covariance [Sejdinovic et al., 2012]

I Gradients [Strathmann et al., 2015] I Leads to mixing improvements in practice I Useful for intractable targets

The crucial trade-offs: I Parameter selection I Adaptation

I Computational costs I Growing dimensions

(28)

Thank you

References

Related documents

For both arithmetic average options and basket options, we show some nu- merical exemplifications in 4 and 12 dimensions using the crude Monte Carlo and the Sobol Quasi-Monte

To speed up the process there exist different variance reduction techniques and also Quasi Monte Carlo (QMC) simulation, where the integral is evaluated by using

In this thesis, I present two new quantum Monte Carlo techniques, the Monte Carlo Power Method and Bose-Fermi Auxiliary-Field Quantum Monte Carlo, and apply previ- ously developed

Abdessalem AB, Dervilis N, Wagg DJ and Worden K (2017) Automatic Kernel Selection for Gaussian Processes Regression with Approximate Bayesian Computation and Sequential Monte

Keywords: Monte Carlo Method; Importance Sampling; Variance Reduction; Option

For a model elliptic problem, we provide a full convergence and complexity analysis of the ratio estimator in the case where Monte Carlo, quasi-Monte Carlo or multilevel Monte

Two Markov chain Monte Carlo methods were introduced, the manifold Metropolis-adjusted Langevin algorithm and Riemannian manifold Hamiltonian Monte Carlo.. Here, we focus on the

103 4.3 The Monte Carlo method for stochastic processes 107 4.3.1 Monte Carlo and stochastic processes 107 4.3.2 Simulating paths of stochastic processes: Basics... 135 4.6