Kernel techniques for adaptive Monte Carlo methods

(1)

Kernel methods for adaptive Monte Carlo

Heiko Strathmann

Gatsby Unit, UCL London

(2)

(3)

Metropolis Hastings transition kernel

Target π(θ)_∝p(θ_|D)

I At iteration j+1, state θ_(j) I Proposeθ0 ∼q θ|θ_(j)

I Acceptθ_(j₊₁₎ ←θ0 with probability min π(θ0₎ π(θ(j)) × q(θ(j)|θ0) q(θ0_|_θ_(j₎₎,1 I Rejectθ_(j₊₁₎ ←θ_(j) otherwise.

(4)

Intractable target – running example

Gaussian process classification model on_{(xi,yi)_}n_i=1

I latent process responsef ∈_Rn wheref_i =f(xi)

I labelsD=y _{∈ {−}1,1_}n

I hyper-parametersθ

Joint distribution

p(f,y, θ) =p(θ)p(f_|θ)p(y_|f)

I f|θ ∼ N(0,_Kθ) with covariance matrix Kθ

(5)

Intractable target – running example

I Interested in posterior parameters

p(θ|y)_∝p(θ)p(y_|θ) = p(θ)

Z

p(y_|f)p(f_|θ)df

c.f.Filippone & Girolami (2014), Murray & Adams (2011)

I Unbiased estimate via importance sampling:

p(y_|θ)_≈ 1 nimp nimp X i=1 p(y_|f(i)_)p(_f(i)_|_θ₎ Q(f(i)₎

with f(i)_∼_Q₍_f₎_{, which is obtained via e.g. EP}

I Instance of pseudo-marginal MCMC

[Beaumont, 2003], [Andrieu & Roberts, 2009], ... [Lyne et. al 2015]

(6)

Intractable target – running example

−6 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 θ7

(7)

Learning covariance

I [Haario et al., 1999]learn covariance on the fly

I _{Given Markov chain at state}θ_(t)_{, then for} λ_t ∈(0,1), set

Σt = (1₋λt)Σt₋1+λt θ(t)θ>(t)

and use proposal

q(_·|θ(t)) =N(·|θ(t),Σt)

I Careful whenq depends on {θ_(i)}_i_≤_t

I Can chooseλ_t s.t.Σt _→Cov(π)as t _{→ ∞}under some assumptions on π [Andrieu, 2008]

(8)

Adaptive Metropolis

[Haario et al., 1999]

Improves mixing but is

(9)

Learning kernel covariance

[Sejdinovic et al., 2012]

φ

Cz

Feature space sample f

xt φ(xt)

Feature space _H Input space_X

(10)

The Kameleon

[Sejdinovic et al., 2012]

Learned kernel covariance allows to I propose locally aligned moves

I _{improved mixing on nonlinear targets} I without the need for gradients

(11)

This talk: learning gradients

Gradients allow to

I propose distant moves with I high acceptance probability I in high dimensions

(12)

Hamiltonian dynamics 101

I Potential energyU(q)=−logπ(q) I Momentump ∼exp(₋K(p)), K(p) = ₋1 2p>p I Hamiltonian H(p,q) :=K(p) +U(q) I H-Flow is map φHt : (p,q)7→(p∗,q∗) s.t.H(p∗,q∗) = H(p,q)_∀t

I Acceptance probability along flow is 1. I Generated by operator: ∂K ∂p ∂ ∂q − ∂U ∂q ∂ ∂p

(13)

Exponential families in kernel spaces

I Need a surrogate density model to model gradient I Kameleon used Gaussian in RKHSH

I Here: exponential family [Sriperumbudur at al., 2014]

π(θ)_≈exp   hf,k(θ,·)iH | {z } =f(θ) −A(f)   

I For certain k, dense in probability densities (KL, TV, ...) I Crux: fitting – normalising constant A(f) is intractable

A(f) =log

Z

exp(f(θ))dθ

(14)

Score matching

[Hyvärien, 2005]

I Instead of ML, minimise Fisher divergence

arg min f∈H 1 2 Z π(θ)_k∇θf(θ)−∇θlogπ(θ)k22dθ

I Intuition: match gradients in high density regions I Remarkable: can rewrite and estimate from {θ_i}n_i=1 ∼π

arg min f∈H 1 n n X i=1 d X `=1 " ∂2_f₍_θ₎ ∂θ2 ` +1 2 ∂f(θ) ∂θ` 2#

I Can be minimised in closed form. Reduces to regression. I In practice much more robust than KDE.

(15)

Hamiltonian moves without gradients

−6 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 θ7 −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 1 θ7

Kernel induced Hamiltonian flow:

∂K ∂p ∂ ∂q − ∂f ∂q ∂ ∂p

(16)

Kernel HMC

[Strathmann et al., 2015]

Start as random walk, transition to HMC.

Every iteration:

I Learn/update gradient model using past trajectory

I Use surrogate gradient to simulate Hamiltonian dynamics I Correction for simulation error and gradient error:

MH accept/reject step using estimator for π

I Stop adapting eventually

(17)

Computational considerations

HMC KMC 0 100 200 300 400 500 Leap-frog steps 0.70 0.75 0.80 0.85 0.90 0.95 1.00 Acceptance prob.

I Bad fit⇒ low acceptance rate⇒ inefficient. But... I Gradient model expensive to fit to Markov chain {θ_i}t_i=1:

I O(t3d3) time

I O(t2d2) memory

I Markov chain trajectory length t grows I Aim is ’high’ dimension d

(18)

One approximation: KMC Lite

f(θ) = n X i=1 αik(zi, θ) I {z_i}n_i=1 ⊆ {θ_i}t_i₌₁ sub-sample I α∈_Rn from ˆ αλ =− σ 2(C +λI) −1_b whereC _∈Rn×n,b ∈Rn

depend on kernel matrix I Cost O(n3+n2_d₎_(modulo low-rank, CG). Geometrically ergodic on log-concave targets. Gradient norm: Gaussian KMC Lite

(19)

Geometric ergodicity intuition

MCMC chain visits ’interesting’ parts I geometrically fast

I in particular when initialised in tails I means: same guarantees as RWM

Proof idea

I In KMC lite, we have forkqk → ∞

f(q) = n

X

i=1

αik(zi,q) = exp(−kzi −qk)→0

I Recall kernel H-flow is generated as

∂K ∂p ∂ ∂q − ∂f ∂q ∂ ∂p

(20)

Why do we care?

Early adaptation stopping is potentially harmful... But we need to for asymptotic correctness!

(21)

Why do we care?

Imagine we stopped adaptation early... with a bad fit.

(22)

Why do we care?

(23)

Acceptance rate in high dimensions

Challenging Gaussian (top): I Eigenvalues: λ_i ∼Exp(1). I Covariance: diag(λ₁, . . . , λ_d),

randomly rotate.

I ‘Non-singular’ length-scales I KMC scales up tod ≈30. Isotropic Gaussian (bottom):

I More smooth I _{KMC scales up to}_d ≈_100. 102 ₁₀3 ₁₀4 n 100 101 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 102 ₁₀3 ₁₀4 n 100 101 102 d 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(24)

KMC asymptotically behaves as HMC

8-dimensional strongly nonlinear snythetic banana

0 500 1000 1500 2000 n 0 5 10 15 20 25 30 35 Minimum ESS HMC KMC RW KAMH 0 500 1000 1500 2000 n 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Acc. rate 0 500 1000 1500 2000 n 0 2 4 6 8 10 k ˆ E[X]k

(25)

KMC improves mixing

0 1000 2000 3000 4000 5000 Iterations 102 103 104 105 106 107 M M D fr om gr ou nd tr ut h KMC KAMH RW −5 −4 −3 −2 −1 0 θ2 −5 −4 −3 −2 −1 0 1 θ7

(26)

Kernel sequential Monte Carlo

[Schuster & Strathmann et al., 2016]

Nonlinear versions of

I Adaptive Sequential Monte Carlo[Fearnhead et al., 2010] I Feature space covariance

Gradient free versions of

I Gradient Importance Sampling[Schuster et al., 2015] I Hamiltonian Importance Sampling [Naesseth et al., 2016]

Context:

I _{Intractable likelihoods, nested importance sampling}

(27)

Discussion

Kernel models as density emulators for Monte Carlo I Covariance [Sejdinovic et al., 2012]

I Gradients [Strathmann et al., 2015] I Leads to mixing improvements in practice I Useful for intractable targets

The crucial trade-offs: I Parameter selection I _Adaptation

I Computational costs I Growing dimensions

(28)

Kernel techniques for adaptive Monte Carlo methods

Kernel methods for adaptive Monte Carlo

Metropolis Hastings transition kernel

Intractable target – running example

Intractable target – running example

Intractable target – running example

Learning covariance

Adaptive Metropolis

[Haario et al., 1999]

Learning kernel covariance

[Sejdinovic et al., 2012]

The Kameleon

[Sejdinovic et al., 2012]

This talk: learning gradients

Hamiltonian dynamics 101

Exponential families in kernel spaces

Score matching

[Hyvärien, 2005]

Hamiltonian moves without gradients

Kernel HMC

[Strathmann et al., 2015]

Computational considerations

One approximation: KMC Lite

Geometric ergodicity intuition

Why do we care?

Why do we care?

Why do we care?

Acceptance rate in high dimensions

KMC asymptotically behaves as HMC

KMC improves mixing

Kernel sequential Monte Carlo

Discussion

Thank you