HOW DO THEY WORK?

Recommendation Systems

We will describe the basic mathematical setting underlying most re-commendation systems. Let the ratings be arranged as a matrix X with Nurows and Ni columns, where Nuis the number of users and Ni is the number of items to be rated. The elementxui of X is the rating given by user u to item i . Typically, we know only the values for a few elements of X. Let D denote the known ratings stored in the form of tuples ( , ,u i x1 1 u i_{1 1}) (…u i xn n, , u i_{n n}), as illustrated in Figure 9.1 . We assume there are n known ratings.

Baseline Model

A basic, “baseline” model is given by xui =b0+bu+bi

where b b0, uand b_i are a global, user, and item bias terms, respectively.

The b0 term corresponds to the overall average rating, while bu is the amount by which user u deviates, on average, from b0. The same ap-plies to bi for item i. The goal is to estimate b b0, , and u bi for all u and

i, given D.

The bias terms can be estimated by solving the least squares opti-mization problem

min ( )

, ( , )

b b ui u i

u i D u

u i

u i x − − −b b b + ⎛ b + i b

⎝⎜ ⎞

⎠⎟

∑

∈ ⁰ ² ^λ

∑

with b0 set to the average of all known xui. The fi rst term is the sum of squared errors between the observed ratings and the ratings predicted

Figure 9.1 Factor Matrix

Table 9.1 Data Set D of Known Ratings

u₁ i₁ x_{u i}

1 1

...

u_n i_n x_{u i}

n n

166 _▸B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

by the model. The second term is a regularization penalty designed to discourage overly large biases, which are associated with overfi tting.

The parameter λ controls the amount of regularization.

Low‐Rank Matrix Factorization

A more fl exible model is given by low‐rank matrix factorization. Con-sider the following inner product:

xui = ×l ru i

where lu and ri are vectors of dimension K. The aforementioned mod-el can also be expressed by the matrix product.

X LR= where

rows ofL (the “left” factor matrix) = vectors l_u for all users columns of the “right” factor matrix R = vectors ri for all items This is depicted in Figure 9.1 . In this model,X has rank K, which is typically much lower than eitherNu orNi, hence the name low‐rank matrix factorization.

L

^and

R

can be estimated by solving the optimization problem:

min . || || || ||

, ,

l r ui u i

u i D u

u i

u i

(

x −l r

)

⁺ ^⎛ l ⁺ i r

⎝⎜ ⎞

⎠⎟

( )

∑

^∈ ² ^λ

∑

overfi tting by penalizing the norms of the factors. Two popular meth-ods for solving this problem are stochastic gradient descent and alter-nating least squares.

Stochastic Gradient Descent

Stochastic gradient descent starts from an initial guess for each l ru i, and proceeds by updating in the direction of the negative gradient of the objective, according to

lu ← −lu η(e rui i−λlu) ri ← −ri η(e lui u−λri)

R E C O M M E N D A T I O N S Y S T E M S ◂ 167

where

euiΔxui − . = prediction error for the rating associated with the l ru i

pair ( , )u i

η = user‐defi ned learning step size

The updates are made one rating at a time, with each pair ( , )u i selected uniformly at random, without replacement, from the data set D. This is the reason behind the name “stochastic.” Once a full pass through D (i.e., an epoch) is completed, the algorithm begins a new pass through the same examples, in a different random order. This is repeated until convergence, which typically requires multiple epochs.

Stochastic gradient descent does not require the full data set to be stored in memory, which is an important advantage when D is very large.D Alternating Least Squares

Another well‐known method is alternating least squares, which pro-ceeds in two steps. First, it solves Equation 1 with respect to the left factorsluby fi xing the right factors ri; then it solves for the ri with the lufi xed. Each of these steps can be tackled by a standard least squares solver. Defi ne nu as the number of ratings from useru and R u[ ] as the

Alternating least squares has the advantage of being simpler to parallelize than stochastic gradient descent. It has the disadvantage of larger memory requirements. The entire data set needs to be stored in memory, which can be an issue for large D^.

Restricted Boltzmann Machines

There exist many other recommendation approaches besides low‐rank matrix factorization. In this section we address restricted Boltzmann machines (RBMs), due to their growing popularity and the fact that

168 _▸B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

they represent a distinctive and highly competitive approach. An RBM is a two‐layer neural network with stochastic units. The name “restrict-ed” comes from the fact that the units must be arranged in a bipartite graph, as shown in Figure 9.2 . Units in the visible layer are connected only to units in the hidden layer, and vice versa. The connections are undirected, meaning that the network can operate in both directions, with the hidden units exciting the visible units and vice versa.

In the recommendation setting, there exists a separate RBM for each user u. Assuming that user u has rated m items, there are m visible units in the corresponding network.

The output of the k ‐th hidden unit is typically binary and is de-k noted hk withk= …1, , . The output of the K i ‐th visible unit is i vi with

i= …1, , . For recommendation, m vi is usually ordinal with discrete val-ues between 1 andQ. The q ‐th value of vi, denoted viq, has a certain probability of being activated (i.e., the network estimates a probability distribution over all possible values of vi). The connection between the k ‐th hidden unit and the

k q ‐th value of the i ‐th visible unit is associatedi with a weight wkiq. To avoid clutter, bias terms are not shown in Figure 9.2 . The dependence on user u is also omitted.

The units are stochastic in the sense that the output obeys a prob-ability distribution conditioned on the input rather than being a deter-ministic function of the input. The hidden units are binary with

p hk bk v wiq

Figure 9.2 Restricted Boltzmann Machine for One User

R E C O M M E N D A T I O N S Y S T E M S ◂ 169

The ordinal visible units follow a softmax rule:

p v b h w

with bias term biq. All user‐specifi c networks share connection weights and bias terms when multiple users rate the same item, but both hid-den and visible units have distinct states for each user.

Contrastive Divergence

The parameters of an RBM are the weights and biases. These are learned by maximizing the marginal log‐likelihood of the visible units, given by

p V E V h

with the term E V h( , ) representing the “energy” for the network con-fi guration. This has the expression

E V h w h vkiq Z v b

where Ziis a normalizing constant. Learning is performed by gradient ascent of log ( )p V . The updates for the weights are

where 〈⋅〉 denotes the expectation operator. The biases follow similar updates. The term 〈v h_i^q _{j data}〉 is equal to the frequency with which the binary quantities hj and viq are simultaneously on when the network is being driven by the data set D, meaning that the visible units viq

are clamped to the values in the training set. The term 〈v hiq 〉

j model is

170 _▸B I G D A T A , D A T A M I N I N G , A N D M A C H I N E L E A R N I N G

the expectation with respect to the distribution of viq defi ned by the learned model, and it is far harder to calculate. For this reason, an ap-proximation is used. In the Monte Carlo–based contrastive divergence method, the approximation is

〈v hiq 〉 ≈ 〈v h〉

j T iq

j model

where 〈v hiq 〉

j T denotes the expectation over T steps of a Gibbs sampler.

As seen, for example, in the Netfl ix competition, RBMs tend to perform well for cases where matrix factorization has diffi culties, and vice versa. For this reason, a successful approach consists of utilizing both a matrix factorization recommender and an RBM in tandem, thus providing a combined prediction.

In document Additional praise for Big Data, Data Mining, and Machine Learning: Value Creation for Business Leaders and Practitioners (Page 189-194)