• No results found

4.7 Sampling from Probability Distributions

4.7.2 Markov Chain Monte Carlo

4.7.2.3 Auxiliary Variable Methods

The idea of introducing auxiliary variables in Markov chain Monte Carlo (MCMC) sampling arose in statistical physics (Swendsen & Wang, 1987), was generalized by (Edwards & Sokal, 1988), and brought into the mainstream statistical litera- ture by (Besag & Green, 1993). Auxiliary variable techniques exploit the general principle that often an apparently complicated n-dimensional problem becomes easier and more tractable if embedded in a higher dimensional framework. Once the high dimensional solution is found, it is projected on the original state space and the original problem is thus solved. This projection procedure is reflected by disregarding the auxiliary variable(s), and just obtaining a sample from the target distribution. Mathematically speaking, in order to sample realizations from P (x), one specifies a conditional distribution P (u|x) and writes P (x, u) = P (x)P (u|x) with marginal distribution P (x). A Markov chain is then constructed on X × U by alternately updating u and x via Gibbs sampling or some other method that maintains P (x, u), and hence P (x). After sampling the (x(i), u(i)) according to P (x, u), one can easily ignore the samples u(i) and keep x(i). The introduction of the auxiliary/supplementary variables allow us to construct Markov chains that mix faster and are easier to simulate than standard single site algorithms. Here we will discuss two well known auxiliary variable methods, namely Hamiltonian Monte Carlo (HMC) and Annealed Importance Sampling (AIS) used for data sampling from the RBM in our work.

Hamiltonian Monte Carlo HMC is an MCMC algorithm that avoids random walk behavior by simulating a physical system governed by Hamiltonian dynamics, potentially avoiding tricky conditional distributions in the process. In order to simulate a physical system, the particles move about a high dimensional landscape subject to potential and kinetic energies. The particles are characterised by a position vector or state x ∈ <D and a velocity vector v ∈ <D. In non-physical MCMC applications of Hamiltonian dynamics, the position will correspond to the variables of interest, whereas v serves as an auxiliary variable that is introduced artificially. The combined state of the particle is denoted as χ ←− (x, v).

The Hamiltonian equation is defined as the sum of the potential energy E(x), ( same energy function defined by the energy based models, i.e. E(x) = − log P (x)−

log(Z) ) and kinetic energy, K(v) defined as follows: H(x, v) = E(x) + K(v) = E(x) +1

2v 2

i. (4.48)

Instead of sampling P (x) directly, HMC operates by sampling from the canonical distribution:

P (x, v) = 1

Z exp(−H(x, v)) P (x, v) ∝ exp(−H(x, v)) P (x, v) ∝ exp(−E(x) − K(v)) P (x, v) ∝ exp (−E(x)) exp −(K(v)) P (x, v) ∝ P (x)P (v)

Because the two variables x and v are independent, marginalizing over v is trivial and recovers the original distribution of interest P (x). The state x and velocity v are modied such that H(x, v) remains constant throughout the simulation. The differential equations of the Hamiltonian used to choose x and v are given as:

dxi dt = ∂Hi ∂vi = vi, dvi dt = ∂Hi ∂xi = −∂E ∂xi .

As shown in (Neal, 1996), the above transformation preserves volume and is re- versible, therefore these dynamics could be used as transition operators of a Markov chain that leaves P (x, v) invariant.

Discretizing Hamiltons equations-The Leapfrog Method

For computer implementation, Hamiltonian equations must be approximated by discretizing time, using some small step size, ε. Starting with the state at time zero, we iteratively compute (approximately) the position x at times ε, 2ε, 3ε, etc. There are several ways through which one can do that (for example Euler’s method), however to maintain invariance of the Markov chain, care must be taken to preserve the properties of volume conservation and time reversibility. The leap- frog algorithm maintains these properties and operates in 3 steps that first perform a half-step update of the velocity at time t + ε/2, which is then used to compute x(t + ε) and v(t + ε) ultimately : vi(t + ε/2) = vi(t) − ε 2 ∂E ∂xi (x(t)) (4.49) xi(t + ε) = xi(t) + εvi(t + ε/2) (4.50) vi(t + ε) = vi(t + ε/2) − (ε/2) ∂E ∂xi (x(t + ε)) (4.51)

The leap frog method can be run for L steps to simulate dynamics over L × ε units of time. This particular discretization method has a number of properties that make it preferable to other approximation methods like Eulers method, however a discussion on that is beyond the scope of this thesis.

Accept / Reject Phase

In practice, using finite step sizes ε will not preserve H(x; v) exactly and will introduce bias in the simulation. HMC cancels these effects exactly by adding a Metropolis accept/reject stage, after n leapfrog steps. The new state χ0←− (x0, v0) is accepted with the probability Pacc(χ; χ0), which is defined as:

Pacc(χ; χ0) = min  1, exp −H(x 0, v0) −H(x, v)  (4.52)

In order to draw a new sample according to P (x, v), we first start off with a Algorithm 6 Hamiltonian Monte Carlo Algorithm

1: Initialize position x0and velocity v0

2: Set step-size, ε

3: for i = 1 to nsamples, take steps do

4: Draw v ∝ N (0; I)

5: (x0; v0) = (xi1; v)

6: % Perform N leapfrog steps to obtain the new state χ0←− (x0, v0)

7: for j = 1 to L do

8: v(j−1/2)= v(j−1)ε

2∇E(x

(j−1)) % Make half step in v (Equation4.49)

9: x(j)= x(j−1)+ εv(j−1/2) % Make full step in x (Equation4.50)

10: v(j)= v(j−1/2)ε

2∇E(x

(j)) % Make full step in v (Equation4.51)

11: end for

12: (x0; v0) = (x(L); v(L))

13: Draw α ∼ U [0; 1]

14: δH = H(x0; v0) − H(x(0); v(0)) %Equation4.48

15: % Acceptance/Rejection Criterion in Equation4.52

16: if α < min{1, exp(−δH)} then

17: (xi; vi) = (x0, v0)

18: else

19: (xi; vi) = (xi−1; vi−1)

20: end if

21: end for

22: Return {xi, vi}nsamplesi=0

random value of x and generate a Gaussian random variable v. We then take L leap frog steps in v and x. The values of v and x at the last leap are the proposal candidates in the MH algorithm with target density P (x, v). Marginal samples from P (x) are obtained by simply ignoring v. Given (x(i−1), v(i−1)), the algorithm proceeds as illustrated in Algorithm 6. The choice of the parameters L and ε pose simulation tradeoffs. Large values of ρ result in low acceptance rates, while small values require many leapfrog steps (expensive computation of the gradient)

to move between two nearby states. Choosing L is equally problematic as we want it to be large to generate candidates far from the initial state, but this can result in many expensive computations. HMC therefore requires careful tuning of the proposal distribution. It is more efficient in practice to allow a different step size ε for each of the coordinates of x.