Markov Chain Monte Carlo - Probabilistic multiple kernel learning

The first approximate inference scheme reviewed is the sampling approaches of Markov chain Monte Carlo (MCMC) which becomes exact in the limit of infinite samples. For an excellent practical introduction the reader is referred to Gelman et al. (2004). The intuition behind MCMC is to address our inability of obtaining closed form posterior distributions by instead drawing samples from them.

In most cases we are actually interested in calculating expectations with re- spect to the posterior distribution, such as the class predictions in classification.

Hence, assuming we can draw samples from the (joint) posterior of parameters12 θ, these expectations can be approximated via the Monte Carlo estimate:

E{f |t, X} = Z f (θ)p(θ|t, X)dθ ≈ ef = 1 L L X l=1 f (θl) (2.34)

One important observation is that the variance of the Monte Carlo estimate is given by:

var{ ef } = 1

LE(f − E{f})

(2.35) and hence the accuracy of the estimate is independent of the model’s dimension- ality.

The major hurdle of sampling from the posterior distribution has not been addressed so far and in the next section the four sampling approaches that will be employed in this thesis are reviewed.

2.5.1 Importance Sampling

One of the most classical, and straightforward to implement, Monte Carlo esti- mators for a function f (θ) with θ distributed as p(θ) is given by the importance sampling approach that utilises an easy-to-sample from distribution q(θ) in the following way: Ep(θ){f (θ)} = Z f (θ)p(θ)dθ = Z f (θ)p(θ) q(θ)q(θ)dθ = Eq(θ){w (θ)f (θ)} (2.36) where w (θ) = p(θ)

q(θ) is the importance weight.

The intuition behind this approach is to sample from the importance distribution q(θ) which can be conveniently chosen as long as it offers good support for the distribution of interest, and then to weight each sample with the associated importance weight resulting in the following estimator:

12_{In order to express any model parameter and not only regression coefficients w, we denote}

parameters with the generic notation θ. We assume the classification setting as an example scenario with the parameter posterior of interest p(θ|t, X).

e f = 1 L L X l=1 w (θl)f (θl) (2.37)

In the (common) case (Andrieu 2003) where the distribution of interest p(θ) is only available in its unnormalized form p∗(θ) then the importance weights are

modified to: w (θl) = p∗ θl q (θl₎ N X j=1 p∗(θj) q (θj₎ (2.38)

Finally, the main dangers of importance sampling reside in the choice of the importance distribution. Ideally it should offer support so that the target distribution is efficiently explored and it should satisfy the condition that p(θ) > 0 ⇒ q(θ) > 0. The advantages of importance sampling are that it’s easy to implement, it is parallelisable, and that it can be extended to sequential inference. Due to the latter property it is the cornerstone of sequential Monte Carlo (particle filters) techniques (Doucet et al. 2000) and the main approach in dealing with covariate shift (Qui˜nonero-Candela et al. 2009) where the i.i.d assumption is no longer valid.

In the next sections we introduce the Markov chain Monte Carlo techniques that construct an ergodic Markov chain that converges to a stationary distribution that is the target distribution. The following definitions introduce the basic concepts:

Definition 2.3: [First order Markov chain]

A first order Markov chain is a sequence of random variables θ1_{, θ}2_{, ..., θ}n _{that for}

any t ∈ {1, . . . , n} the distribution of θt _{given all previous values of θ is dependent}

only on the previous value θt−1_:

p(θt|θt−1_{, θ}t−2_{, . . . , θ}1_{) = p(θ}t_|θt−1₎

Definition 2.4: [Ergodicity]

A Markov chain converges to a unique stationary distribution if it is irreducible, aperiodic and not transient. Such a Markov chain is called ergodic.

Where aperiodicity and non-transiency hold for a random walk on any proper distribution (Gelman et al. 2004) and irreducibility dictates that there is a pos- itive probability of reaching any state from any other state, in other words the Markov chain can reach all states from any state within a finite sequence of steps.

2.5.2 Metropolis Sampling

The first MCMC method considered is the Metropolis algorithm (Metropolis et al. 1953) which employs a symmetric proposal distribution (also known as transition or jump distribution) q(θt_|θt−1_{) = q(θ}t−1_|θt_{) and accepts or rejects}

generated samples from that distribution based on the following criterion known as the acceptance ratio:

R = min 1, p∗(θ t_{|t, X)} p∗(θt−1|t, X) (2.39) where p∗(θ|t, X) denotes the unnormalized parameter posterior which is equal

to the product of the likelihood with the prior over the parameters.

The procedure is to draw a random number from a uniform distribution on the unit interval and if the number is smaller or equal to mathcalR the proposed sample is accepted, else rejected. Hence the new state is given by:

θt= (

θt _{with probability R}

θt−1 _otherwise (2.40)

2.5.3 Metropolis-Hastings Sampling

The straightforward generalisation of the Metropolis scheme to handle asymmetric proposal distributions is known as the Metropolis-Hastings (MH) method. The only significant difference is that the distribution q(θ) is asymmetric: q(θt_|θt−1_{) 6=}

q(θt−1|θt_{) and hence it is included in the acceptance ratio as:}

R = min 1, p∗(θ t_{|t, X)q(θ}t−1_|θt₎ p∗(θt−1|t, X)q(θt|θt−1) (2.41) Both the Metropolis and the Metropolis-Hastings MCMC methods have been proved (Hastings 1970) to converge to a stationary distribution that is the target distribution (p(θ|t, X) here).

The main drawback of Metropolis based MCMC methods is the need to tune the proposal distribution in order to retain an acceptance ratio between 15 − 40% (depending on the nature of the parameter sampling scheme) which is the recommended level to efficiently reach convergence (Gelman et al. 2004). Adaptive proposal distributions are usually employed that take into account the covariance structure of the posterior through initial exploratory samples that are later discarded (Burn-in period). In general, the engineering requirements of such methods make them less practical although more efficient sampling schemes based on gradient information through the Fisher information matrix are a major research topic (Girolami et al. 2009) and a promising direction.

2.5.4 Gibbs Sampling

The final MCMC approach reviewed is the Gibbs sampler (Geman and Geman 1984, Tanner and Wong 1987) also known as alternating conditional sampling (Andrieu 2003, Gelman et al. 2004) which can be seen as a special case of the Metropolis-Hastings method. The intuition behind it is to decompose the (un- obtainable in closed form) posterior distribution of interest into conditional posterior distributions that are easy to sample from. Consider a parameter vector θ and the sought after posterior distribution p(θ|t, X). If we decompose the joint posterior to conditional posterior distributions of the form:

p(θt_i|θt−1_−i , t, X) (2.42)

then iteratively sampling from these conditional distributions leads to sampling from the target joint posterior distribution. The notation θt−1_−i denotes all the elements of θ except the ith _{one, from the current (t − 1) sample.}

This principle can be extended to sampling block variables where the joint posterior can be decomposed to blocks of parameter sets (i.e. regression coefficients w and scales α in GLMs) whose conditional posterior distributions are easy to sample from. Such a block-wise Gibbs sampling approach will be employed in this thesis.

The advantage of Gibbs sampling is that no proposal distribution is necessary. It can be seen as a sub-case of MH sampling with an acceptance ratio equal to one, see e.g. Gelman et al. (2004) for proof, and this alleviates the need for tuning acceptance ratios and adapting proposal distributions for efficient exploration of

the stationary distribution. Further advantages of Gibbs sampling are discussed in Chapter 3. The main drawback is that it can lead to correlated posterior samples due to the coupling between the conditional posterior distributions.

Finally all of the MCMC methodology is computationally demanding due to the inherent sampling nature of the methods which might require anything from a few thousands to hundreds of thousands samples for convergence to the stationary distribution. A very important aspect for these methods is assessing and monitoring convergence which is described in detail in Chapter 3 together with the appropriate measures that are typically (Gelman et al. 2004) employed.

In document Probabilistic multiple kernel learning (Page 42-47)