Hybrid EP-VB Inference - Efficient Bayesian active learning and matrix modelling

Approximate inference in our model is implemented using a combination of expectation propagation (EP) [Minka,2001b] and variational Bayes (VB) [Attias,1999;Ghahramani

& Beal,2000]. We choose EP as the core inference routine since empirical studies show that EP obtains state-of-the-art performance in the related problem of GP binary classification [Nickisch & Rasmussen,2008]. We first give a brief primer on EP.

6.3.1 Primer on Expectation Propagation

Expectation propagation is a deterministic algorithm for approximate Bayesian inference, originally developed inMinka[2001a]. Similar to VB, introduced in Section5.4.1, the algorithm approximates an intractable distribution over variables θ, p(θ), with a simpler, factorized distribution q(θ). The EP algorithm matches the approximation q to the posterior p by attempting to minimizing the KL divergence between the two, KL[p(θ)||q(θ)], with respect to the parameters of q. Note that the direction of the KL is the reverse of that used in VB, Equation (5.6). For most models, with many parameters, minimizing KL[p(θ)||q(θ)] over the entire distribution q is intractable. Therefore, EP uses an iterative procedure.

For most models, the posterior distribution can be decomposed into a product of factors: p(θ) = Q

afa(θ). In EP, the posterior approximation q is decomposed into

approximate factors ˆfa(θ) that approximate the true factors fa(θ). The approximate

posterior is the re-normalized product of approximate factors q(θ) ∝ Q

afˆa(θ). EP

iteratively refines each approximate factor ˆfa(θ) by minimizing the following KL diver-

gence, KLhfa(θ)q\a(θ)|| ˆfa(θ)q\a(θ) i = Z faq\alog faq\a ˆ faq\a + faq\a− ˆfaq\adθ , (6.8)

where q\a_{(θ) is the current approximation with the a-th term removed, q}\a_{(θ) =}

q(θ)/ ˆfa(θ) ∝

b6=afˆb(θ). The form of the KL in Equation (6.8) accounts for the dis-

tributions being unnormalized. For exponential family distributions, optimizing Equa- tion (6.8) corresponds to matching the expected sufficient statistics of distributions on either side of the KL. With a Gaussian approximate posterior this computation is equivalent to moment matching.

EP iterates over the approximate factors, minimizing (6.8) until convergence. This procedure does not guarantee to minimize the global KL divergence between p(θ) and q(θ), or even converge. However, in practice it has demonstrated strong empirical performance with many models and has become a popular inference algorithm. For thorough overview see Minka[2001a].

6.3.2 Inference for Collaborative Preference Learning

We describe our EP routine for our multi-user preference learning model. We approximate the posterior in (6.6) with fully factorized Gaussian distributions over all of the elements in W, H and GD, q(GD, W, H) = " _U Y u=1 D Y d=1 N(wud; mwu,d, vu,dw ) # " _D Y d=1 P Y i=1

N(hd,i; mhd,i, vhd,i)

#   Y (u,i)∈D N(gu,i; mgu,i, v g u,i)  , (6.9) where mw

u,d, vwu,d, mhd,i, vhd,i, m g

u,i, and v g

u,i are free parameters to be determined by

EP. The superscripts w, h and g indicate the random variables described by these parameters.

The joint distribution p(GD, W, H, YD|X, U) consists of four factors f1, . . . , f4,

p(GD, W, H, YD|X, U) =

a=1

fa(GD, W, H) ,

with correspondences f1(GD, W, H) = p(YD|GD), f2(GD, W, H) = p(GD|W, H),

f3(GD, W, H) = p(W|U) and f4(GD, W, H) = p(H|X). EP approximates these exact

factors by approximate factors ˆf1(W, H, GD), . . . , ˆf4(W, H, GD) that have the same

functional form as q, ˆ fa(GD, W, H) = " _U Y u=1 D Y d=1 N(wud| ˆma,wu,d, ˆv a,w u,d) # " _D Y d=1 P Y i=1

N(hd,i| ˆma,hd,i, ˆv a,h d,i) #   Y (u,i)∈D

N(gu,i| ˆma,gu,i, ˆv a,g u,i)



sˆ_a, (6.10)

where ˆma,w_u,d, ˆv_u,da,w, ˆma,h_d,i, ˆva,h_d,i, ˆma,g_u,i, ˆva,g_u,i and ˆsa are free parameters of the approximate

factors. As described in Section6.3.1, q is obtained from the normalized productQ

afˆa.

The first step is to initialize ˆf1, . . . , ˆf4 and q to be uniform. Then EP iteratively refines

each ˆfa by minimizing the KL divergence between faq\a and ˆfaq\a, KL[faq\a|| ˆfaq\a]

with respect to the parameters of ˆfa.

However, such KL minimization does not perform well for refining ˆf2. This term

to perform poorly. This is because the matrix factorization has many invariances; the solution is invariant to rotations, reflections or re-scalings of W and H. This means that the posterior is multimodal. EP will average across the modes of the posterior (6.9), which will result in a poor overall solution [Bishop, 2006; Stern et al., 2009]. Therefore we use VB to refine ˆf2. As discussed in Chapter5, VB is a popular inference

routine for probabilistic matrix factorization because it will model just one of the modes of the posterior, which is sufficient for making good predictions. Therefore, instead of minimizing KL[q\2f2kq\2fˆ2] as is required by EP, the direction KL divergence is

reversed, that is we minimize KL[q\2fˆ2kq\2f2].

EP iteratively refines all the approximate factors until convergence. After run- ning inference we approximate the predictive distribution (6.7) by replacing the exact posterior with q. The approximate predictive distribution is

p(yu,P +1|YD, X, U) ≈ Φ   yu,P +1mgu,P +1 q v_{u,P +1}g + 1   , where mg_{u,P +1}= D X d=1 mw_u,dmh_{d,P +1}, vg_{u,P +1}= D X d=1 [mw_u,d]2v_{d,P +1}h + D X d=1 vw_u,d[mh_{d,P +1}]2+ D X d=1 v_u,dw vh_{d,P +1}, and mh

d,P +1 and vd,P +1h are given by

mh d,P +1 = k > ? h Kitems+ diag[ˆvh,2_d ] i−1 ˆ mh,2_d , v_{d,P +1}h = k?− k>? h Kitems+ diag[ˆv_dh,2] i−1 k?,

where diag[·] converts a vector into a diagonal matrix and ˆmh,2_d , ˆv_dh,2 are the vectors ˆ

mh,2_d = ( ˆmh,2_1,d, . . . , ˆmh,2_P,d)> and ˆvh,2_d = (ˆv_1,dh,2, . . . , ˆv_P,dh,2)>. Note that EP approximates the posterior with fully factorized Gaussians for each factor. However, to interpolate to the new item pairs we use the full GP prior over the user latent functions hd. Therefore,

when computing the EP approximation to the predictive distributions, we replace the approximate factor ˆf3 corresponding to an uncorrelated prior over the item pairs with

the full GP prior covariance matrix.

model evidence) with the integral of the product of all the approximate factors ˆf1, . . . , ˆf4.

Computing the model evidence with EP requires moment matching the 0th _{order mo-}

ments of the distributions in Equation (6.8). With the VB routine for second factor ˆf2

we use the variational lower bound to the evidence (the ELBO).

6.3.3 Algorithmic Details

Damping

Unlike VB, EP does not optimize a bound on the likelihood, and is not guaranteed to converge so may oscillate. This undesirable behaviour can be prevented by damping the EP updates [Minka & Lafferty,2002]. Let ˆfnew

a denote the value of the approximate

factor that minimizes the KL in (6.8). Damping consists of using ˆ

f_adamp =h ˆf_anewih ˆfa

i(1−)

, (6.11)

instead of ˆfnew

a to update the approximate factor, where ˆfa is the factor before the

update. The parameter ∈ [0, 1] controls the degree of damping. = 1 yields no damping and with = 0 the factor ˆfaremains unchanged. To improve the converge of

EP, we use a damping schedule recommended inHern´andez-Lobato[2010] that uses an initial = 1 and then progressively reduces by a constant multiplicative factor. Refinement of ˆf2

The specific computations required to refine the probit likelihood function ˆf1follow from

those for GP classification [Rasmussen & Williams,2005]. Refining ˆf3 and ˆf4 requires

standard moment matching of a multivariate Gaussian to independent Gaussians. For the second factor ˆf2, we use VB in a similar manner to Stern et al. [2009].

To do this, we first marginalize q\2_f

2 with respect to GD. This yields an auxiliary

unnormalized distribution s(W, H) which can be computed analytically using s(W, H) = Z Y (u,i)∈D δ[gu,i− wuh·,zu,i]q \2 (GD, W, H) dGD. (6.12)

Let qW,H be the posterior approximation (6.9) after marginalizing out GD. The pa-

rameters of qW,H, are then optimized by minimizing KL[qW,Hks]. This corresponds to

performing a VB matrix factorization with a Gaussian likelihood, for which we use the gradient descent algorithm described in Raiko et al. [2007]. After this, ˆf2 is updated

using the ratio of Gaussians, ˆf2= qW,H/q\2.

6.3.4 Sparse GPs for Linear Computational Time

The cost to perform inference with GPs is cubic in the number of observations. In our case, refining the third factor ˆf3 costs O(DU3), where U is the number of users,

and D is the number of shared latent functions. The cost to refine the fourth factor ˆ

f4 is O(DP3), where P is the number of observed item pairs. These cubic costs can

be prohibitive. However, GP inference may be reduced to a cost linear in the number of observations with sparse approximations. We use the Fully Independent Training Conditional (FITC) approximation [Snelson & Ghahramani, 2006]. In essence, FITC channels covariance information from the full dataset through a small number of pseudo- inputs that can be located arbitrarily.

We reduce the costs of refining ˆf3 and ˆf4 by approximating Kusers and Kitems in

Equations (6.4) and (6.5) using FITC. Under this approximation, an N × N covariance matrix K resulting from the evaluation of a covariance function at N locations is ap- proximated by a low rank matrix K0 = Q+diag(K−Q), where Q = KN N0K

−1 N0N0K

> N N0. The N0 × N0 matrix KN0N0 contains evaluations of the covariance function at only N0 < N pseudo-inputs and the N × N0 matrix KN N0 contains the covariances between the original data and the pseudo-inputs.

This approximation allows us to refine ˆf3 and ˆf4 inO(DU02U ) andO(DP02P ) oper-

ations, where U0 and P0 are the number of pseudo-inputs for the users and the item

pairs respectively. We choose U0 and P0 to balance cost and accuracy. The calcula-

tions required to implement EP and to approximate the predictive distribution and model evidence with FITC follow from those inL´azaro Gredilla[2010];Naish-Guzman & Holden [2007]. Both of the other factors, ˆf1 and ˆf2, have linear cost in the total

number of datapoints, O(|D|). Without FITC, this cost is dominated by the cubic cost of the GPs, and the total cost of our inference routine is O(|D| + DU3 _{+ DP}3_).

With FITC, the total cost is linear in the number or users, item pairs and datapoints, O(|D| + DU2

0U + DP02P ).

In document Efficient Bayesian active learning and matrix modelling (Page 118-123)