3.2 Approximations
3.2.2 Variational Approximation
There are a number of methods that have used variational inference in the context of Gaussian process inference for non-Gaussian likelihoods. This section will provide a recap of the method proposed by Opper and Archambeau (2009). Chapter 4 will build upon this method alongside the work of Lazaro-Gredilla and Titsias (2011) and Hensman et al. (2015b).
Readers unfamiliar with variational inference in general will find it useful to recap its key components, see Blei et al. (2016) for an excellent modern review.
Variational inference in general attempts to find a approximate distribution, q(f|θV),
that closely matches a posterior distribution, p(f|y). The approximate distribution is conditioned on a set of variational parameters, θV, that only affect the quality of the
approximation, not the distribution being approximated. It must belong to a family of tractable densities, where as in general the true posterior will not. The measure of closeness is the Kullback-Leibler (KL) divergence (Kullback,1959;Kullback and Leibler,
1951). For continuous distributions, the KL divergence between two distributions, q(x) and p(x) is written
KL (q(x) ∥ p(x)) =Z q(x) lnq(x) p(x)dx
The KL-divergence is an asymmetric measure, such that KL (q(x) ∥ p(x)) ̸= KL (p(x) ∥ q(x)). The KL-divergence is also always non-negative, KL (q(x) ∥ p(x)) ≥ 0. To make an approximation to the posterior for Gaussian process regression we can minimise a KL divergence between the approximate posterior, q(f|θV), and true
3.2 Approximations 47 EP (Minka, 2001) uses the reverse KL divergence at a local level to each random variable f. Taking the expectation under the approximating distribution, q(f|θV), as
in, KL (q(f|θV) ∥ p(f|y)), has the effect of encouraging the approximate distribution
to have low density where the true distribution has low density. The reversed KL divergence, KL (p(f|y) ∥ q(f|θV)), whereby the expectation is taken under the true
distribution, p(f|y), has the effect of encouraging the approximation distribution to have a high density where the true distribution has a high density (Lawrence, 2000). Typically there will be a trade off between trying to capture density where there high density, and assigning density low where the true density is low. This effect is clearly visible comparing the approximations given by variational inference (Figure 3.6d) and EP (Figure 3.6b).
Opper and Archambeau (2009) introduced an approach to applying variational inference to Gaussian process regression. The basis of this variational approach has been expanded on in recent years (Hensman et al.,2015b;Khan et al.,2012;Lazaro-Gredilla and Titsias,2011;Nguyen and Bonilla,2014) and is used for the novel inference method put forward in Saul et al. (2016) that is covered in detail in Chapter 4. The method minimizes KL (q(f|θV) ∥ p(f|y)), in order to approximate the posterior distribution
p(f|y) = p(y,f )p(y) = p(y|f )p(f )p(y) , by exploiting the following equality,
KL (q(f|θV) ∥ p(f|y)) =
Z
q(f|θV) log
q(f|θV)
p(f|y) df
= log p(y) +Z q(f|θV) log
q(f|θV)
p(y, f) df
log p(y) = −Z q(f|θV) log
q(f|θV)
p(y, f)df + KL (q(f|θV) ∥ p(f|y))
=Z q(f|θV) log p(y|f)df − KL (q(f|θV) ∥ p(f))
+ KL (q(f|θV) ∥ p(f|y)) (3.12)
In practice, since the true posterior, p(f|y), will not usually belong to a family of tractable distributions whilst the approximating distribution, q(f|θV), must, the KL
divergence, KL (q(f|θV) ∥ p(f|y)), will not be computable. Using the fact that a
KL-divergence is non-negative an inequality is obtained,
3.2 Approximations 48 More generally this can be written as,
log p(y) ≥Z q(f|θV)[log p(f, y) − log q(f|θV)]df.
In variational inference this inequality is known as the evidence lower bound (ELBO), and is widely used (Hoffman et al., 2013;Nguyen and Bonilla, 2014; Ranganath et al.,
2014) as it provides a lower bound on the true log evidence log p(y). Maximising the ELBO is equivalent to minimising KL (q(f|θV) ∥ p(f|y)) (Bishop, 2006; Jordan
et al., 1999;Ranganath et al., 2014), which makes the approximate posterior as similar as possible to the true posterior. The first term inside the expectation can be seen to encourage parameters of the variational distribution that give high density to configurations of the latent variables that also explain the observations, y. The second term encourages parameters that give rise to entropic variational distributions; such that the distribution spreads it mass across many configurations (Ranganath et al.,
2014).
Suitable variational parameters, θV, can be found by maximising Equation (3.13).
Opper and Archambeau (2009) assume the approximating distribution, q(f|θV), is a
Gaussian distribution q(f|θV) = N (f|µV, ΣV), where the variational parameters are
θV = {µV, ΣV}; for a factorising likelihood this gives the bound to be maximised as
follows, L(q) = 1 2Tr(Kf fΣV) + 1 2µV⊤K−1f fµV − 1 2ln |ΣV|+ ln Z − n 2 ln 2πe − n X i=1 Z q(f|θV) log p(yi|fi)df.
Furthermore note that eachR
q(f|θV) log p(yi|fi)df only depends on the corresponding
mean µV i and covariance element ΣV i,i of the full covariance matrix, ΣV. Using the
equality ΣV−1 = −2∇ΣVEq(f |θV)
log p(y|f) allows the authors to express the optimal covariance as,
ΣV = (K−1f f + ΩV)
−1
where ΩV is a diagonal matrix with variational parameters ωV ∈ Rn×1, along its
diagonal, and so only 2n variational parameters are required. Without this trick, the naïve representation would require in total n(n + 3)/2 variational parameters to be learnt. By applying this trick only 2n variational parameters are required. The variational parameters are optimised using a gradient decent method. By significantly
3.3 Sparse Gaussian Process 49