Diffusion limit - Main Results - Gradient flow without gradient

Chapter 5 Gradient flow without gradient

6.2 Main Results

6.2.3 Diffusion limit

As described in section6.2.2, for dimensionny ≥3 it is optimal to choose a scaling

factor h(ε) of order O(ε). To go further in this direction, we study in this section the behaviour of the RWM algorithm, as ε → 0, for a scaling factor of the form

h(ε) = ` ε where` >0 is a tuning parameter. In order to state our main result, it is useful to introduce the quantity

a0(x, `) = Z Rny E h F B(x, u+` Zy)−B(x, u) i eB(x,u)du (6.2.5) as well as the time scaleT(ε) =ε−2. The quantitya0(x, `) is the limiting acceptance probability, asε→0, of the RWM algorithm when conditioned on thex-coordinate. Indeed, one can verify thata0(x, `) can also be expressed as

a0(x, `) = lim ε→0Eπε a(Xε, Uε, Zx, Zy)|Xε=x .

As it will become clear from our diffusion limit analysis, the time scaleT(ε) is the natural time scale on which thex-coordinate process{Xε,k}k≥0 evolves. Our main

result states that the accelerated processes

converges weakly, asε→0, to a non-trivial diffusion process. For this reason,T(ε) is called ‘diffusive time scale’ in the sequel. For a density π on Rn and volatility

functionσ :Rn→(0;∞) we introduce the function drift(π, σ2) :Rn→Rn given by

drift(π, σ2) : x7→ 1 2 σ2∇logπ(x) +∇σ2(x) .

Under mild assumptions2on the densityπand the volatility functionσ, the function drift(π, σ2_{) is such that the diffusion process}

dDt= drift(π, σ2)(Dt)dt+σ(Dt)dWt

is reversible with respect to the probability distribution π. The case σ ≡ Cst corresponds to the Langevin diffusion dD = σ₂2 ∇logπ dt+σ dW. For our main scaling limit to hold, we assume regularity and growth assumptions on the functions

A:Rnx →R andB :Rnx×Rny →R. These conditions are mainly technical. Assumptions 6.2.2. (Growth and Regularity Assumptions on π)

The first two derivatives of the functions A:Rnx →R andB :Rnx ×Rny →R are bounded by a polynomial of degreep≥1. Moreover, there exists an exponent η >0

such that the following moment condition holds,

Eπ1

(1 +kXk+kUk)2p+η

< ∞, (6.2.7)

where (X, U)∼D π1 has density eA(x)eB(x,u).

Assumptions 6.2.2 implies the existence of an integer p ≥ 1 such that the norm of the quantities A(x) and B(x, u) and their first two derivatives are less than a constant multiple of 1 +kxk+kukp

. This estimate is used at several places in the proof of our main result. The main theorem of this section is the following. The proof is described in section6.3.

Theorem 6.2.3. LetT >0be a fixed finite time horizon. Assume that assumptions

6.2.2 hold and that the RWM algorithm is started in stationarity,(Xε,0, Yε,0) D ∼πε.

As ε → 0, the sequence of accelerated processes {Xe_ε,t}_t_∈_[0_,T_] converges weakly in

the Skorohod spaceD([0, T],Rnx) to the diffusion process{Dt}t∈[0,T] specified as the solution of the stochastic differential equation

dDt=drift π, σ2)(Dt)dt+σ(Dt)dWt (6.2.8)

where W is a standard Brownian motion in Rnx. The local volatility function is given by σ2(x, `) =`2a0(x, `). The initial distribution is D0 ∼D π.

The diffusion (6.2.8) is ergodic and reversible with respect toπ. The diffusive time scaleT(ε) =ε−2 _{shows that the algorithmic complexity of the RWM grows as}

O(ε−2) as the thicknessεgoes to zero. The limiting rescaled ESJD can directly be read from the volatility coefficient of the limiting diffusion (6.2.8),

lim ε→0 ESJD(ε, `) ε2 =E σ2(X, `)

where X ∼D π and σ2(x, `) = `2a0(X, `). In the case where the function (x, y) 7→ B(x, y) does not depend on thex-coordinate, the limiting acceptance probabilitya0

does not depend on the local positionx anymore, a0(x, `) =a0(`). In this case the optimal value for the parameter`is given by`∗= argmax`2a0(`), which leads to a

0.234-type optimality result as described in [RGG97]. In general, the optimisation of the limiting ESJD is difficult. The Dirichlet form [Fuk80] associated to the diffusion (6.2.8) reads D(ϕ) := 1 2 Z Rnx ∇ϕ(x) 2 σ2(x, `)π(dx).

The spectral gap of the diffusion (6.2.8) equalsλ= supϕ D(ϕ) where the supremum

runs over the class of smooth test functions satisfyingπ(ϕ) = 0 andπ(ϕ2) = 1. The maximisation of the ESJD is equivalent to maximising the Dirichlet form over the class of affine functions. In general, the maximisation of the spectral gap and the maximisation of the ESJD thus lead to different answers.

In an attempt to reconcile the different notions of optimality, we adopt slightly more general proposals. The variance of the proposals is allowed to depend on the current position; the tuning parameter`=`(x)>0 is now allowed to depend on thex-coordinate,

X_ε0 Y_ε0 ! = Xε Yε ! +`(x)ε Zx Zy ! . (6.2.9)

In other words, when the RWM Markov chain stands at (x, y) ∈ _Rn_{, a Gaussian}

jump of size`(x)εis proposed. We now state assumptions on the functionx7→`(x) that allows diffusion limit results to holds.

The function ` is positive, bounded away from zero and infinity. The first two derivatives of `are also bounded.

Under the regularity assumption6.2.4on the functionx7→`(x) the analogue of Theorem6.2.3holds. We choose to work in this limited setup so that the proof of the next theorem is a straightforward adaption of Theorem 6.2.3. The accelerated version (6.2.6) of the x-coordinate process converges to a diffusion process.

Theorem 6.2.5. LetT >0be a fixed finite time horizon. Assume that assumptions

6.2.2 and 6.2.4 hold and that the RWM algorithm is started in stationarity. As ε → 0, the sequence of processes {Xe_ε,t}_t_∈_[0_,T_] converges weakly in the Skorohod

space D([0, T],Rnx) to the diffusion process {Dt}t∈[0,T] specified as the solution of the stochastic differential equation

dDt=drift π, σ2)(Dt)dt+σ(Dt)dWt (6.2.10)

where W is a standard Brownian motion in Rnx. The local volatility function is given by σ2(x) =`2(x)a0 x, `(x)

. The initial distribution is D0 D

∼π.

The only difference with Theorem6.2.3is the form of the volatility function

σ. As before, the limiting distribution (6.2.10) is reversible with respect to π and the Dirichlet form reads

D(ϕ) := 1 2 Z Rnx ∇ϕ(x) 2 `2(x)a0 x, `(x) π(dx).

Since the parameter `=`(x) is a function of the x-coordinate, the optimal choice

`∗(x) for the tuning parameter` is

`∗(x) := argmax`>0`2a0(x, `). (6.2.11)

As described in [RR12], the choice (6.2.11) maximises the ESJD, the spectral gap of the limiting diffusion (6.2.10) and the asymptotic variance of MCMC estimators.

In document Scaling analysis of MCMC algorithms (Page 117-120)