Probabilistic models for matrix factorization assume that a partially observed data matrix X is well approximated by a low rank matrix UVT. Normally X is very sparse,
with most elements being unobserved. The objective is then to find the two matrices U and V given X. Probabilistic methods treat each element in U and V as model parameters to be inferred. Fast approximate inference is usually implemented using variational Bayes [Lim & Teh, 2007; Nakajima et al., 2010; Raiko et al., 2007]. The resulting techniques are computationally efficient because their cost depends only on the number of entries observed in X, which is usually low, and not on the size of X, which can be large.
Many real-world datasets are binary, that is, the entries of X take values in {0, 1}. Some common examples of sources of binary data include include market basket data [Mild & Reutterer, 2003], click-stream data [Joachims, 2002], network data [Airoldi et al.,2008] or file dependencies in complex software systems [Hu et al., 2010]. How- ever, for binary matrices, X is usually fully observed, entries take either zero or one and there is no ‘unobserved’ value. For example, in a news portal, we know which articles
a user has visited, and which they have not.1 With fully observed matrices the afore-
mentioned probabilistic approaches to solving the MF problem are infeasible in practice because they require looking at the entire matrix before making any adjustments to the parameters.
More specifically, current popular inference methods are based on batch variational algorithms that require processing all the entries in X before producing even a single update to the variational parameters. An alternative is to use a likelihood function for continuous data instead of one for binary data [Nakajima et al., 2010]. In this case, an analytic solution exists which scales with the number of ones in X. However, this solution is restricted to zero-mean spherical priors on U and V, and homoscedastic Gaussian likelihood functions for X. In our experiments, we find that these restrictions lead to poor predictions when X is binary.
We address scalable learning with probabilistic MF models that are flexible enough to produce state-of-the-art predictions on large binary matrices. To meet this challenge we propose an algorithm based upon stochastic inference. Stochastic methods have the advantage that, with large datasets, they can make reasonably accurate predictions before batch algorithms generate a single parameter update. The algorithm is based on a recent technique called stochastic variational inference (SVI) [Hoffman et al.,
2013]. Existing implementations of SVI do not extend to MF models directly, which present specific challenges that are not encountered in models currently addressed by this inference algorithm, such as topic models. This is because in MF we subsample individual matrix entries instead of complete data instances, such as an entire document in a topic model. In standard SVI all the variational parameters are updated each time a data instance is subsampled. With matrices, we have different parameters for each row and column in X and each time we subsample a matrix entry, we update only the variational parameters associated with the row and column of that entry. This makes the data sub-sampling strategy more important because it determines which parameters are updated and how often. For this reason, we develop a data subsampling strategy with different sampling probabilities across the rows and columns of X. This method significantly outperforms standard uniform subsampling.
A second challenge for SVI presented by MF is that parameter estimates in MF models often exhibit heavy-tailed empirical distributions [Lakshminarayanan et al.,
2011]. These heavy tails can significantly reduce the convergence speed of stochastic 1
In some domains it may be ambiguous whether a ‘zero’ corresponds to a negative observation or lack of observation. In these ambiguous cases it is advantageous to treat the zeros as observed, since if they were unobserved the maximum likelihood solution would predict ones everywhere. We return to this point in Section5.6.
algorithms. A solution is to use minibatches to reduce the effect of outliers in the noisy estimates of the gradients. However, the best minibatch size S can be dataset- dependent. To avoid having to hand-tune S to each dataset, which is common practice [Orr & M¨uller,1998], we propose a method that adaptively selects the value of S online. With this approach we scale probabilistic MF methods to large binary matri- ces whilst maintaining strong empirical performance. Experimentally, our algorithm demonstrates faster convergence than batch alternatives [Raiko et al.,2007] and yields more accurate solutions than existing scalable variational methods [Nakajima et al.,
2010; Paquet & Koenigstein,2013;Seeger & Bouchard,2012]. The focus of this chap- ter is on improving the state-of-the-art in probabilistic MF methods, but we also com- pare to one of the best alternative non-probabilistic techniques for MF [Rendle et al.,
2009]. Encouragingly, our method performs favourably. We can improve upon the state-of-the-art because:
1. We handle fully observed matrices and learn by subsampling individual matrix entries.
2. We use a likelihood function for binary data and not for continuous data. 3. Flexible priors and additional bias parameters may be incorporated easily with
our method.
4. We use improved subsampling strategies and automatically select the appropriate minibatch size for the data.
The chapter is organized as follows. In the next section we introduce a model for binary matrices and present our core stochastic variational inference algorithm. We then describe the extensions including our sampling strategy in Section5.4.5 and our automatic minibatch size selection strategy in Section5.4.6. Related literature is discussed are in Section5.4.8 and experiments with a number of real world binary matrices in Section5.5. Section5.6finishes with a summary, discussion and extensions.