Understanding the Matrix Factorization Family

3.6 Latent Factor Models

3.6.7 Understanding the Matrix Factorization Family

It is evident that the various forms of matrix factorization in the previous sections share a lot in common. All of the aforementioned optimization formulations minimize the Frobenius norms of the residual matrix (R− UVT_{) subject to various constraints on the factor ma-}

trices U and V . Note that the goal of the objective function is to make U VT _approximate

the ratings matrix R as closely as possible. The constraints on the factor matrices achieve diﬀerent interpretability properties. In fact, the broader family of matrix factorization models can use any other objective function or constraint to force a good approximation. This broader family can be written as follows:

Optimize J = [Objective function quantifying matching between R and U VT] subject to:

Constraints on U and V

The objective function of a matrix factorization method is sometimes referred to as the loss function, when it is in minimization form. Note that the optimization formulation may be either a minimization or a maximization problem, but the goal of the objective function is always to force R to match U VT _{as closely as possible. The Frobenius norm is an}

example of a minimization objective, and some probabilistic matrix factorization methods use a maximization formulation such as the maximum-likelihood objective function. In most cases, regularizers are added to the objective function to prevent overfitting. The various constraints often impose different types of interpretability on the factors. Two examples of such interpretability are orthogonality (which provides geometric interpretability) and non-negativity (which provides sum-of-parts interpretability). Furthermore, even though constraints increase the error on the observed entries, they can sometimes improve the errors on the unobserved entries when they have a meaningful semantic interpretation. This is because constraints reduce the variance16on the unobserved entries while increasing bias. As a result, the model has better generalizability. For example, fixing the entries in a column in each of U and V to ones almost always results in better performance (cf. section3.6.4.5). Selecting the right constraints to use is often data-dependent and requires insights into the application-domain at hand.

Other forms of factorization exist in which one can assign probabilistic interpretability to the factors. For example, consider a scenario in which a non-negative unary ratings matrix R is treated as a relative frequency distribution, whose entries sum to 1.

m i₌₁ n j₌₁ rij= 1 (3.31)

Note that it is easy to scale R to sum to 1 by dividing it with the sum of its entries. Such a matrix can be factorized in a similar way to SVD:

R≈ (QkΣk)PkT

= U VT

As in SVD, the diagonal matrix Σk is absorbed in the user factor matrix U = QkΣk, and

the item factor matrix V is set to Pk. The main diﬀerence from SVD is that the columns

of Qk and Pk are not orthogonal, but they are non-negative values summing to 1. Fur-

thermore, the entries of the diagonal matrix Σk are non-negative and they also sum to 1.

3.6. LATENT FACTOR MODELS 127

Table 3.3: The family of matrix factorization methods

Method Constraints Objective Advantages/Disadvantages

Unconstrained No constraints Frobenius Highest quality solution + Good for most matrices regularizer Regularization prevents

overﬁtting

Poor interpretability SVD Orthogonal Basis Frobenius Good visual interpretability

+ Out-of-sample recommendations regularizer Good for dense matrices

Poor semantic interpretability Suboptimal in sparse matrices Max. Margin No constraints Hinge loss Highest quality solution

+ Resists overﬁtting margin Similar to unconstrained regularizer Poor interpretability

Good for discrete ratings NMF Non-negativity Frobenius Good quality solution

+ High semantic interpretability regularizer Loses interpretability with

both like/dislike ratings Less overﬁtting in some cases Best for implicit feedback PLSA Non-negativity Maximum Good quality solution

Likelihood High semantic interpretability + Probabilistic interpretation regularizer Loses interpretability with

both like/dislike ratings Less overﬁtting in some cases Best for implicit feedback

Such a factorization has a probabilistic interpretation; the matrices Qk, Pk and Σk contain

the probabilistic parameters of a generative process that creates the ratings matrix. The objective function learns the parameters of this generative process so that the likelihood of the generative process creating the ratings matrix is as large as possible. Therefore, the objective function is in maximization form. Interestingly, this method is referred to as Prob- abilistic Latent Semantic Analysis (PLSA), and it can be viewed as a probabilistic variant of non-negative matrix factorization. Clearly, the probabilistic nature of this factorization provides it with a diﬀerent type of interpretability. A detailed discussion of PLSA may be found in [22]. In many of these formulations, optimization techniques such as gradient descent (or ascent) are helpful. Therefore, most of these methods use very similar ideas in terms of formulating the optimization problem and the underlying solution methodology.

Similarly, maximum margin factorization [180,500,569,624] borrows ideas from support vector machines to add a maximum margin regularizer to the objective function and some of its variants [500] are particularly effective for discrete ratings. This approach shares a number of conceptual similarities with the regularized matrix factorization method discussed in section 3.6.4. In fact, the maximum margin regularizer is not very different than that used in unconstrained matrix factorization. However, hinge loss is used to quantify the errors in the approximation, rather than the Frobenius norm. While it is beyond the scope of this book to discuss these variants in detail, a discussion may be found in [500,569]. The focus on maximizing the margin often provides higher quality factorization than some of the other models in the presence of overfitting-prone data. In Table3.3, we have provided a list of various factorization models and their characteristics. In most cases, the addition of

128 CHAPTER 3. MODEL-BASED COLLABORATIVE FILTERING

constraints such as non-negativity can reduce the quality of the underlying solution on the observed entries, because it reduces the space of feasible solutions. This is the reason that unconstrained and maximum margin factorization are expected to have the highest quality of global optima. Nevertheless, since the global optimum cannot be easily found in most cases by the available (iterative) methods, a constrained method can sometimes perform better than an unconstrained method. Furthermore, the accuracy over observed entries may be different from that over unobserved entries because of the effects of overfitting. In fact, non-negativity constraints can sometimes improve the accuracy over unobserved entries in some domains. Some forms of factorization such as NMF cannot be applied to matrices with negative entries. Clearly, the choice of the model depends on the problem setting, the noise in the data, and the desired level of interpretability. There is no single solution that can achieve all these goals. A careful understanding of the problem domain is important for choosing the correct model.

In document Recommender Systems the Textbook (Page 148-150)