Variational Inference and Gibbs Sampling - Probabilistic multiple kernel learning

This section examines the performance of the variational Bayes approximation with respect to the full MCMC Gibbs sampling solution previously introduced in Chapter 3. The comparison is performed between the variational approximate posterior distribution and the Gibbs sampling posterior, classification accuracy and computational processing time on two artificial low-dimensional datasets, a linearly and a non-linearly separable one as introduced by (Neal 1998).

Furthermore, the convergence of the VBpMKL approximation was deter- mined by monitoring the lower bound and the convergence occurred when there was less than 0.1% increase in the bound or when the maximum number of variational iterations was reached. The burn-in period for the Gibbs sampler was set to 10% of the total 100,000 of samples. Finally, all the CPU times reported in this study are for a 1.6 GHz Intel based PC with 2Gb RAM running unoptimised Matlab® codes.

4.5.1 Synthetic Data sets

In order to illustrate the performance of the variational approximation against the full Gibbs sampling solution, we employ two low dimensional datasets which enable us to visualise the decision boundaries and posterior distributions produced by either method. First we consider a linearly separable case in which we construct the dataset by fixing our regression coefficients W ∈ RD×C _{, with}

C = 3 and D = 3, to known values and sample two-dimensional covariates X plus a constant term. In that way, by knowing the true values of our regression

coefficients, we can examine the accuracy of both the Gibbs posterior distribution and the approximate posterior distribution of the variational method. In Figure 4.2 the dataset together with the optimal decision boundaries constructed by the known regression coefficients values can be seen.

−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3 4 C1 C2 C3 Dec₁₂ Dec 13 Dec 23

Figure 4.2: Linearly separable dataset with known regression coefficients defining the decision boundaries. Cn denotes the members of class n and Decij is the

decision boundary between classes i and j.

In Figures 4.3 and 4.4 the posterior distributions of one decision boundary’s (Dec12) slope and intercept based on both our obtained Gibbs samples and the

approximate posterior of the regression coefficients W are plotted. As it can be seen, the variational approximation is in agreement with the mass of the Gibbs posterior and it successfully captures the predetermined regression coefficients values.

However, as it can be observed the approximation is over-confident in the prediction and produces a smaller covariance for the posterior distribution as expected (de Freitas et al. 2001). Furthermore, the probability mass is concen- trated in a very small area due to the very nature of variational approximations and similar mean field methods that make extreme “judgements” as they do not explore the posterior space by Markov chains.

CGibbs= " 0.16 0.18 0.18 0.22 # CVB = " 0.015 0.015 0.015 0.018 # (4.38)

Slope Intercept −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2 −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4

Figure 4.3: Gibbs posterior distribution of a decision boundary’s (Dec12) slope

and intercept for a Markov chain of 100,000 samples. The cross describes the original decision boundary employed to sample the dataset.

Slope Intercept −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4 −2.2 −3.6 −3.4 −3.2 −3 −2.8 −2.6 −2.4

Figure 4.4: The variational approximate posterior distribution for the same case as above. Employing 100,000 samples from the approximate posterior of the regression coefficients W in order to estimate the approximate posterior of the slope and intercept.

The second synthetic dataset we employ is a 4-dimensional 3-class dataset {t, X} with N = 400 samples, first described by (Neal 1998), which defines the first class as points in an ellipse α > x2

1 + x22 > β, the second class as points

below a line αx1+ βx2 < γ and the third class as points surrounding these areas,

see Figure 4.5.

The problem is tackled by introducing a second order polynomial expansion on the original dataset F (xn) = [1 xn1 xn2 xn12 xn1xn2x2n2] while disregarding the

uninformative dimensions x3, x4. Due to the aforementioned expansion which

avoids the need for embedding the features into a high dimensional Hilbert space induced by a kernel, there is now a 2-dimensional decision plane that can be visualised and 6-dimensional regression coefficients wc per class. In Fig.

4.5 we plot the decision boundaries produced from the full Gibbs solution by averaging over the posterior parameters after 100,000 samples and in Fig. 4.6 the corresponding decision boundaries from the variational approximation after a maximum of 100 iterations. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Class 1 Class 2 Class 3 1 vs 2 1 vs 3 2 vs 3

Figure 4.5: Decision boundaries from the Gibbs sampling solution on Neal’s dataset.

As it can be seen, both the variational approximation and the MCMC solution produce similar decision boundaries leading to good classification per- formances of 2% error for both the Gibbs and the variational approximation. However, the Gibbs sampler produces typically tighter boundaries due to the

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Class 1 Class 2 Class 3 1 vs 2 1 vs 3 2 vs 3

Figure 4.6: Decision boundaries from the variational approximation on Neal’s dataset.

Markov Chain exploring the parameter posterior space more efficiently than the VB approximation.

The corresponding CPU times are given in Table 4.1

Gibbs VB

41,720 (s) 120.3 (s)

Table 4.1: CPU time (sec) comparison for 100,000 Gibbs samples versus a maximum of 100 variational iterations. Notice that the number of variational iterations needed for the lower bound to converge is typically less than 100.

In document Probabilistic multiple kernel learning (Page 104-108)