5.4 Stochastic Variational Inference for Binary Matrices
5.4.8 Related Work
Specific Challenges for SVI in Matrix Factorization
SVI has been applied to other probabilistic models such as Latent Dirichlet Allocation [Hoffman et al., 2010], the Hierarchical Dirichlet Process, [Hoffman et al., 2013], and Bayesian Nonparametric models [Bryant & Sudderth,2012;Wang et al.,2011]. In these cases there is a clear distinction between local and global parameters or variables. The distinction is governed by the conditional dependencies in the model. A local variable is associated with each observation, and the conditional distribution of each observation and its local variable is independent of all other local variables and observations given the global variables [Hoffman et al.,2013].
Therefore, local parameters are updated only when a particular data point is sub- sampled and in the aforementioned models the global variational parameters are up- dated when any datapoint is subsampled. In MF, the definition of a datapoint is more ambiguous: does a datapoint correspond to a row, column, entry or entire matrix? We subsample individual matrix entries. In this case the row and column parameters U and V are partially global since they do not satisfy the conditional independence assumptions to be local, and are only updated when elements in the corresponding row or column are subsampled. With MF, the partially global nature of the row and column parameters makes the data sub-sampling strategy more important because it
Algorithm 1 Stochastic Inference for Binary Matrices
1: Input: matrix X, initial parameters Φ, # samples T
2: fort = 1 to T do
3: select minibatch size S using (5.20) 4: fors = 1 to S do
5: save ˚ui,1, . . . , ˚ui,D, ˚vj,1, . . . , ˚vj,D and ˚z
6: sample row and column indices (i, j) ∼ p(i, j) 7: compute stepsize ρz using Robbins-Monro 8: update ξi,j using (5.12)
9: compute ˚z? and update ˚z using (5.16)
10: for d = 1 to D do
11: update ξi,j using (5.12)
12: compute ˚v?j,d and update ˚vj,d using (5.15)
13: update ˚v?,avgj,d
14: end for
15: for d = 1 to D do
16: update ξi,j using (5.12)
17: compute ˚u?i,d and update ˚ui,d using (5.14)
18: update ˚u?,avgi,d
19: end for
20: restore ˚ui,1, . . . , ˚ui,D, ˚vj,1, . . . , ˚vj,D and ˚z
21: end for
22: forany row i sampled in the last minibatch do
23: compute stepsize ρu
i using Robbins-Monro
24: update ˚ui,1, . . . , ˚ui,D using (5.17)
25: end for
26: forany column j sampled in the last minibatch do
27: compute stepsize ρvj using Robbins-Monro 28: update ˚vj,1, . . . , ˚vj,D
29: end for
30: compute stepsize ρz using Robbins-Monro 31: update ˚z
32: end for
determines which parameters are updated and how often. A more closely related ap- plication of SVI is to the Mixed-Membership Stochastic Blockmodel for L-node binary networks [Gopalan et al.,2012], but in this case only one L × D matrix of parameters, the community memberships, is partially global.
A second difficulty for SVI posed by MF models arises from the direct coupling of the parameters updates. For example, the update for the row variational parameters in (5.14) is a direct function of the column parameters ˚vj,d. As noted in Section5.4.6,
the parameters in MF models are often heavy tailed. The combination of update coupling and heavy tailed parameters results in heavy tailed noisy gradients. This makes the minibatch size selection particularly important with MF models. Algorithms that adaptively change the stepsize online have been proposed [Ranganath et al.,2013;
Schaul et al., 2012]. However, these algorithms assume a Gaussian distribution of noisy estimates of the natural gradients. Since the noisy estimates have heavy tails these methods can result in unstable behaviour in MF models, and we found that the sequences of the stepsizes ended up diverging.
Algorithms for Probabilistic Binary MF
An alternative stochastic algorithm just subsamples the zeros is proposed inPaquet & Koenigstein [2013]. However, unlike SIBM, this method does not correct for the bias introduced by the subsampling process and hence yields poorer solutions, as we observe in our experiments.
With sparse matrices batch variational inference schemes can be efficient since the time required to update the parameters scales linearly only in the number of observa- tions [Lim & Teh,2007]. However, with fully observed matrices this is usually imprac- tical since each update costs O(LM). We note that with sparse binary matrices the required computations with a Gaussian likelihood in Raiko et al. [2007] can be rear- ranged so that the cost per iteration is linear only in the number of ones. This can be achieved essentially by decomposing the likelihood into a sum of a term corresponding to a full matrix of zeros and correction factors for the observed ones. Now anyO(LM) terms may be pre-computed, however, this is not possible with the logistic likelihood which is more appropriate for binary data.
With a Gaussian likelihood, one can avoid optimization altogether, and use the analytic solution for the global maximum of the ELBO derived in Nakajima et al.
include i) the likelihood must be Gaussian with equal variance across matrix entries,1
ii) U and V must have zero-mean isotropic priors, and iii) no bias parameters can be included. These constraints yield a large negative effect to predictive performance, as we show in our experiments. An iterative scheme has been proposed to extend this approach to binary likelihoods at the cost of making very crude approximations to the logistic likelihood function [Seeger & Bouchard, 2012]. In practice, with binary matrices this method tends to produce only small gains in performance with respect to the solution inNakajima et al. [2010].
A large number of non-probabilistic algorithms have been proposed for MF. With binary matrices, one of the best performing is Bayesian Personalized Ranking (BPR) which directly optimizes a ranking loss function. BPR has shown state-of-the-art results on item recommendation against a wide range of systems [Rendle et al.,2009]. It was also a key component in many of the best solutions in Track 2 of the KDD-Cup’11 music recommendation competition [Dror et al.,2012]. We show comparisons to all of the above methods, including BPR, in our experiments.
5.5
Experiments
SIBM is evaluated in experiments with synthetic and real-world binary matrices. We consider six datasets that include i) a synthetic dataset generated by sampling X from the generative model assumed by SIBM. We fix D = 5 and generate U and V by sampling all the ui,d and vj,d independently from N(0, 100). The global bias is fixed
to z = −500, yielding binary matrices with about 98% sparsity. We consider two real- world datasets from the FIMI repository: ii) purchase data from a retail store (retail) [Brijs et al.,1999] and iii) click data from an online news portal (Kosarak). We include two datasets from the 2000 KDD Cup [Kohavi et al., 2000; Zheng et al., 2001], iv) point-of-sale data from a retailer (POS, originally BMS-POS) and v) click data from an e-commerce website (WebView, originally BMS-WebView-2). Finally, we include vi) the Netflix data, treating 4-5 star ratings as ones. We pre-process the original datasets to be able to compare to the computationally expensive batch approach. We keep the 1000 columns with the highest number of ones and discard rows with fewer than 10 ones. We consider small and large versions of each dataset. We subsample 2000 rows for the small and 40,000 rows for the large datasets, except in retail and WebView, where
1
The restriction to equal likelihood variances across the matrix means that the analytic solution
cannot be used directly with the Gaussian approximation to the sigmoid function in (5.10) to handle
we use approximately the maximum number of rows for the large datasets, 10,000 and 5000, respectively.
Each matrix is randomly split into a training matrix and a set of test entries with value one. The training matrix is generated by randomly removing a single one from each row in the original matrix and adding it to the test set. Predictive performance is evaluated using recall at N , which is equivalent to precision when a single one is held out. Recall is popular metric for recommendation tasks [Gunawardana & Shani,2009] because it measures directly the ability to find the items a user may like. We iterate over the rows, using (5.4) to compute the probability of each zero entry actually taking value one. We select the top N zero entries with highest probability in that row. Recall is computed as the average number of times that the test entry appears in this list. We use N = 10 and repeat the experiment 25 times on each small dataset and 10 times on each large one.