Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

(1)

University of Pennsylvania

ScholarlyCommons

Departmental Papers (CIS) Department of Computer & Information Science

August 2001

Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

Alexandrin Popescul

University of Pennsylvania

Lyle H. Ungar

University of Pennsylvania, [email protected]

David M. Pennock

NEC Research Institute

Steve Lawrence

NEC Research Institute

Follow this and additional works at: http://repository.upenn.edu/cis_papers

Postprint version. Published in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence 2001 (UAI 2001), pages 437-444.

This paper is posted at ScholarlyCommons.http://repository.upenn.edu/cis_papers/137 For more information, please contact[email protected].

Recommended Citation

Alexandrin Popescul, Lyle H. Ungar, David M. Pennock, and Steve Lawrence, "Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments", . August 2001.

(2)

Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

Abstract

Recommender systems leverage product and community information to target products to consumers.

Researchers have developed collaborative recommenders, content-based recommenders, and a few hybrid systems. We propose a unified probabilistic framework for merging collaborative and content-based

recommendations. We extend Hofmann’s (1999) aspect model to incorporate three-way co-occurrence data among users, items, and item content. The relative influence of collaboration data versus content data is not imposed as an exogenous parameter, but rather emerges naturally from the given data sources. However, global probabilistic models coupled with standard EM learning algorithms tend to drastically overfit in the sparse data situations typical of recommendation applications. We show that secondary content information can often be used to overcome sparsity. Experiments on data from the ResearchIndex library of Computer Science publications show that appropriate mixture models incorporating secondary data produce

significantly better quality recommenders than k-nearest neighbors (k-NN). Global probabilistic models also allow more general inferences than local methods like k-NN.

Comments

Postprint version. Published in Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence 2001 (UAI 2001), pages 437-444.

This conference paper is available at ScholarlyCommons:http://repository.upenn.edu/cis_papers/137

(3)

In Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence (UAI-2001), to appear, Morgan Kaufmann, San Francisco, 2001.

Probabilistic Models for Unified Collaborative and Content-Based Recommendation in Sparse-Data Environments

Alexandrin Popescul and Lyle H. Ungar Department of Computer and Information Science

University of Pennsylvania Philadelphia, PA 19104 [email protected]

[email protected]

David M. Pennock and Steve Lawrence NEC Research Institute

4 Independence Way Princeton, NJ 08540

[email protected] [email protected]

Abstract

Recommender systems leverage product and community information to target products to consumers. Researchers have developed collaborative recommenders, content-based recommenders, and a few hybrid systems. We propose a unified probabilistic framework for merging collaborative and content-based recommendations. We extend Hofmann’s (1999) aspect model to incorporate three-way co-occurrence data among users, items, and item content. The relative influence of collaboration data versus content data is not imposed as an exogenous parameter, but rather emerges naturally from the given data sources. However, global probabilistic models coupled with standard EM learning algorithms tend to drastically overfit in the sparse- data situations typical of recommendation applications. We show that secondary content information can often be used to overcome sparsity. Experiments on data from the ResearchIn- dex library of Computer Science publications show that appropriate mixture models incorporating secondary data produce significantly better quality recommenders than -nearest neighbors ( -NN). Global probabilistic models also allow more general inferences than local methods like

-NN.

1 INTRODUCTION

The Internet offers tremendous opportunities for mass personalization of commercial transactions. Web businesses ideally strive for global reach, while maintaining the feel of a neighborhood shop where the customers know the owners, and the owners are familiar with the customers and their specific needs. To show a personal face on a mas- sive scale, businesses must turn to automated techniques like so-called recommender systems (Resnick & Varian,

1997). These systems suggest products of interest to consumers based on their explicit and implicit preferences, the preferences of other consumers, and consumer and product attributes. For example, a movie recommender might combine explicit ratings data (e.g., Bob rates X-men a 7 out of 10), implicit data (e.g., Mary purchased Hannibal), user demographic information (e.g., Mary is female), and movie content information (e.g., Mystery Men is a comedy) to make recommendations to specific users.

Traditionally, recommender systems have fallen into two main categories. Collaborative filtering methods utilize explicit or implicit ratings from many users to recommend items to a given user (Breese et al., 1998; Resnick et al., 1994; Shardanand & Maes, 1995). Content-based or information filtering methods make recommendations by matching a user’s query, or other user information, to descriptive product information (Mooney & Roy, 2000;

Salton & McGill, 1983). Pure collaborative systems tend to fail when little is known about a user, or when he or she has uncommon interests. On the other hand, content-based systems cannot account for community endorsements; for example, an information filter might recommend The Mex- ican to a user who likes Brad Pitt and Julia Roberts, even though many like-minded users strongly dislike the film.

Several researchers are exploring hybrid collaborative and content-based recommenders to smooth out the disadvan- tages of each (Basu et al., 1998; Claypool et al., 1999;

Good et al., 1999).

In this paper, we propose a generative probabilistic model for combining collaborative and content-based recommendations in a normative manner. The model builds on previ- ous two-way co-occurrence models for information filtering (Hofmann, 1999) and collaborative filtering (Hofmann

& Puzicha, 1999). Our model incorporates three-way co- occurrence data by presuming that users are interested in a set of latent topics which in turn “generate” both items and item content information. Model parameters are learned using expectation maximization (EM), so the relative con- tributions of collaborative and content-based data are de- termined in a sound statistical manner. When data is ex-

(4)

tremely sparse, as is typically the case for collaboration data, EM can suffer from overfitting. In Sections 4 and 5, we present two techniques to effectively increase the density of the data by exploiting secondary data. The first uses a similarity measure to fill in the user-item co-occurrence matrix by inferring which items users are likely to have accessed without the system’s knowledge. The second creates an implicit user-content co-occurrence matrix by treating each user’s access to an item as if it were many accesses to all of the pieces of content in the item’s descriptive information. We evaluate these models in the context of a document recommendation system. Specifically, we train and test the models on data from ResearchIndex,¹an online digital library of Computer Science papers (Lawrence et al., 1999; Bollacker et al., 2000). Section 6 presents empiri- cal results and evaluations. In Section 6.2, we demonstrate the potential ineffectiveness of EM in sparse-data situations, using both ResearchIndex data and synthetic data. In Section 6.3, we show that both of our density-augmenting methods are effective at reducing overfitting and improving predictive accuracy. Our models yield more accurate recommendations than the commonly-employed -nearest neighbors ( -NN) algorithm. Moreover, our global models can produce predictions for any user-item pair, whereas local methods like -NN are simply incapable of producing meaningful recommendations for many user-item combi- nations.

2 BACKGROUND AND RELATED WORK

A variety of collaborative filtering algorithms have been designed and deployed. The Tapestry system relied on each user to identify like-minded users manually (Gold- berg et al., 1992). GroupLens (Resnick et al., 1994) and Ringo (Shardanand & Maes, 1995), developed indepen- dently, were the first to automate prediction. Typical algorithms compute similarity scores between all pairs of users; predictions for a given user are generated by weight- ing other users’ ratings proportionally to their similarity to the given user. A variety of similarity metrics are possible, including correlation (Resnick et al., 1994), mean-squared difference (Shardanand & Maes, 1995), vector similarity (Breese et al., 1998), or probability that users are of the same type (Pennock et al., 2000b). Other algorithms con- struct a model of underlying user preferences, from which predictions are inferred. Examples include Bayesian network models (Breese et al., 1998), dependency network models (Heckerman et al., 2000), clustering models (Un- gar & Foster, 1998), and models of how people rate items (Pennock et al., 2000b). Collaborative filtering has also been cast as a machine learning problem (Basu et al., 1998;

Billsus & Pazzani, 1998; Nakamura & Abe, 1998) and as

1http://researchindex.org/

a list-ranking problem (Cohen et al., 1999; Freund et al., 1998; Pennock et al., 2000a). Singular Value Decomposi- tion (SVD) was used to improve scalability of collaborative filtering systems by dimensionality reduction (Sarwar et al., 2000).

Pure information filtering systems use only content to make recommendations. For example, search engines recommend web pages with content similar to (e.g., containing) user queries (Salton & McGill, 1983). In contrast to collaborative methods, content-based systems can even recommend new (previously unaccessed) items to users without any history in the system. Mooney & Roy (2000) develop a content-based book recommender using information ex- traction and machine learning techniques for text categorization.

Several authors suggest methods for combining collaborative filtering with information filtering. Basu et al. (1998) present a hybrid collaborative and content-based movie recommender. Collaborative features (e.g., Bob and Mary like Titanic) are encoded as set-valued attributes. These fea- tures are combined with more typical content features (e.g., Traffic is rated R) to inductively learn a binary classifier that separates liked and disliked movies. Also in a movie recommender domain, Good et al. (1999) suggest using content based software agents to automatically generate ratings to reduce data sparsity. Claypool et al. (1999) employ separate collaborative and content-based recommenders in an online newspaper domain, combining the two predictions using an adaptive weighted average: as the number of users accessing an item increases, the weight of the collaborative component tends to increase. Web hyperlinks and document citations can be thought of as implicit endorsements or ratings. Cohn and Hofmann (2001) combine document content information with this type of connectivity information to identify principle topics and authoritative documents in a collection.

Recommender systems technology is in current use in many Internet commerce applications. For example, the University of Minnesota’s GroupLens and MovieLens²research projects spawned Net Perceptions,³a successful In- ternet startup offering personalization and recommendation services. Alexa⁴is a web browser plug-in that recommends related links based in part on other people’s web surfing habits. A growing number of companies,⁵including Ama- zon.com, CDNow.com, and Levis.com, employ or provide recommender system solutions (Schafer et al., 1999). Rec- ommendation tools originally developed at Microsoft Re- search are now included with the Commerce Edition of Mi- crosoft’s SiteServer,⁶ and are currently in use at multiple

2http://movielens.umn.edu/

3http://www.netperceptions.com/

4http://www.alexa.com/

5http://www.cis.upenn.edu/˜ungar/CF/

6http://www.microsoft.com/siteserver

(5)

sites.

3 THREE-WAY ASPECT MODEL

Hofmann (1999) proposes an aspect model—a latent class statistical mixture model—for associating word-document co-occurrence data with a set of latent variables. Hofmann and Puzicha (1999) apply the aspect model to user-item co-occurrence data for collaborative filtering. In the context of a document recommender system, users

, together with the documents they access

, form observations ^!"# ^$ , which are associated with one of the latent variables %&(')

% %*+ . Conceptually, the latent variables are top-

ics. Users choose among topics according to their interests;

topic variables in turn “generate” documents. Users are assumed independent of documents, given the topics. The joint probability distribution over users, topics, and documents is,.- !"

$

,.- !/%10 $ ,.-!

0% $

. An equivalent specification of the joint distribution that treats users and documents symmetrically is ^,.-
!/%

$

,.-!/20% $ ,.- !

0% $

. The joint distribution over just users and documents is

,.- !"#

$

43657,.- !/%

$

,.- !"80% $ ,9-!

0% $

Model parameters are learned using EM (or variants) to find a local maximum of the log-likelihood of the training data. After the model is learned, documents can be ranked for a given user according to^,9-!

0

$;:

,9-!/<

=$

; that is, according to how likely it is that the user will access the corresponding document. Documents with high ^,.-!

0 $

that the user has not yet seen are good candidates for recommendation. Note that the aspect model allows multiple topics per user, unlike most clustering algorithms that as- sign each user to a single class.

This model is a pure collaborative filtering model; document content is not taken into account. We propose an extension of the aspect model to include three-way co- occurrence data among users, documents, and document content. An observation is a triple ^!/< ^> ^$ corresponding to an event of a user accessing document containing word^> . Conceptually, users choose (latent) topics^% , which in turn generate both documents and their content words. Users, documents, and words are assumed independent, given the topics. An asymmetric specification of the joint distribution corresponding to this conceptual view- point is ^,9-!/ ^$ ^,9-!?%0 ^$ ^,.-! ⁰^% ^$ ,.-!/>+0% $

. Figure 1 depicts this model as a Bayesian network. An equivalent symmet- ric specification (obtained by reversing the arc from users to topics) is ^,9-!?% ^$ ^,9-!/20^% ^$ ^,.-! ⁰^% ^$ ,.-!/>+0% $

. Marginaliz- ing out^% , we obtain

,.-!/<

>

$ 3 5 ,9-!?%

$

,9-!/20% $ ,.-!

0% $

,.-!/>+0% $

u

w z

d

P( d | z)

P( u)

P( z | u)

P( w | z)

Figure 1: Graphical representation of the three-way aspect model.

Let^@.!/< ^> ^$ be the number of times user “saw” word

> in document

. That is,^@.!/<

>

$

@.!"#

$BA

@.!

C>

$

, where^@.!/<

=$

is the number of times user accessed document

, and^@.!

C>

$

is the number of times word^> occurs in document

. Given training data of this form, the log likelihood^D of the data is

DE 3

FHGIGJ

@.!"#

C>

$ KLM

,.-!/<

>

$

The corresponding EM algorithm is:

E step:

,9-!/%10<

>

$

,.- !/%

$

,.-!"80% $ ,.- !

0% $

,9-!">N0% $

O

5QP

,.- !/%=R

$

,9-!/20%HR

$

,9-!

0%HR

$

,.- !">+0%=R

$

M step:

,.- !"80%

$S:

3

IGJ

@.!/<

>

$

,.-!/%10<

>

$

,.- !

0%

$S:

3

FGJ

@.!/<

>

$

,.-!/%10<

>

$

,9-!">N0%

$T:

3

FHGI

@.!/<

>

$

,.-!/%10<

>

$

,9-!/%

$S:

3

FHGIGJ

@.!"#

C>

$

,.-!/%10#

C>

$

The E and M steps are repeated alternately until a local maximum of the log-likelihood is reached.

As in the two-way model, ^,9-! ⁰ ^$U:

O J ,9-!"#

C>

$

is used to recommend documents to users. Both content and collaboration data can influence recommendations. The relative weight of each type of data depends on the nature of the given data; EM automatically exploits whatever data source is most informative.

Hofmann (1999) proposes a variant of EM called tempered EM (TEM) to help avoid overfitting and improve general-

(6)

ization. TEM makes use of an inverse computational temperature^V . EM is modified by raising the conditionals in the right-hand side of the E step equation to the power^V . TEM starts with^VXWZY , and decreases^V with the rate^[]\4Y using^{V^WVX_`[} , when the performance on a held-out por- tion of the training set deteriorates.

In Section 6.2, we see that even TEM fails to generalize when data is extremely sparse. In the next two sections, we propose two methods that effectively increase data density, thereby improving learning performance.

4 SIMILARITY-BASED DATA SMOOTHING

One approach to overcoming the overfitting problem with sparse data is to use the similarity between items to smooth the co-occurrence data matrix. The co-occurrence matrix contains integer entries that are the number of times the corresponding row and column items co-occur in the observed data set. Similarity between items in the database can be used to fill some zeros in the co-occurrence data matrix, thus reducing sparsity and helping to address overfitting.

Consider a userâ who has accessed document^bHc once, and assume there exists a document ^b6d that has not been accessed byâ , and that documents^bHc and^b6d are very similar in content (e.g., they share many words in common). Con- sider a similarity metric which yields êfhgji?b ^cCk ^b ^dl ^Wmonqp . Informally, we may believe that there is a 70% chance that userâ actually has seen document^b ^d , even though the system does not know it. Using this reasoning, we propose to preprocess the initial co-occurrence data matrix, by filling in some of the zeros with the aggregate similarity between the corresponding document and the documents definitely seen by userâ . The co-occurrence matrix will no longer be integer valued, but may also contain similarity values which range between 0 and 1. The EM algorithm used in the original aspect model also converges in this situation.

The most frequently used similarity measure in information retrieval is vector-space cosine similarity (Salton &

McGill, 1983). Each document is viewed as a vector whose dimensions correspond to words in the vocabulary; the component magnitudes are the tf-idf weights of the words.

Tf-idf is the product of term frequency^rCs2i/t ^k ^b ^l—the number of times word^t occurs in the corresponding document

b —and inverse document frequency

fhbus2i/t

l

WvwHxTyz^y

bus2i"t l k

where

yz^y

is the number of documents in a collection and

bus2i"t l is the number of documents in which word^t occurs

at least once. The similarity between two documents is then

efhgji?{<|k {o} l W { |~ { }

{ | { } k

where^{ ^| and^{ ^} are vectors with tf-idf coordinates as de- scribed above.

In our setting, the user-document co-occurrence data matrix is smoothed by replacing zero entries with average sim- ilarities above a certain threshold between the corresponding document and all documents that the user has accessed.

This effectively increases the density (i.e., the fraction of non-zero entries) in the matrix. Figure 2 shows how the density of the ResearchIndex data (described in detail in Section 6.1) changes depending on the similarity threshold used in smoothing.

0.2 0.4 0.6 0.8 1.0

0.00.10.20.30.40.5

threshold

density

Figure 2:Density of the data against the similarity threshold used in smoothing.

5 IMPLICIT USER-WORDS ASPECT MODEL

As another method to overcome overfitting due to sparsity, we propose a model where the co-occurrence data points represent events corresponding to users looking at words in a particular document. The concept of a document is removed to create observations ^i"a ^k ^t ^l. Sparsity is drastically reduced because documents contain many words, and many words are contained in multiple documents.

In this case, the aspect model produces estimates of conditional probabilities ⁹ ^i"a

y l and^. ^i"t

y l, as well as the latent class variable priors^. ⁱ

l , allowing us to compute

9 i"a k t l W46 . i l . i"a

y l 9 i"t y l n

But we are still interested in estimating probabilities

. i/b

ya l to produce recommendations of the papers that have the highest scores on the ⁹ ^i/b

ya l scale for a given user^a . By assuming conditional independence of words in a document, we can overcome this problem by treating a document as a bag of words: the probability of a document is the product of the probabilities of the words it contains, adjusted for different document lengths with the geometric

(7)

mean: ^.
/1^.7h

. "

CC

where

are words in

and

1

is the length of

. Con- ditional probabilities

./

follow directly from the model: . "

.

9/<

&

9/<o

Inclusion of words through documents, and eliminating documents from direct participation in modeling, increased the density of our dataset (described below) from 0.38% to almost 9%.

6 RESULTS AND EVALUATION

Section 6.1 describes the ResearchIndex data. In Sec- tion 6.2, we examine under what conditions learning occurs at all, by measuring the increase in the log-likelihood of test data as EM proceeds. We find that if data is too sparse, neither EM nor TEM succeeds in significantly in- creasing the test data log-likelihood over a random initial guess. In Section 6.3, we evaluate the recommendations of our density-augmented models, according to Breese et al.’s (1998) rank scoring metric.

6.1 RESEARCHINDEX DATA

The data for our experiments was taken from ResearchIn- dex (formerly CiteSeer), the largest freely available database of scientific literature (Lawrence et al., 1999;

Bollacker et al., 2000). ResearchIndex catalogs scientific publications available on the web in PostScript and PDF formats. The full-text of documents as well as the citations made in them are indexed. ResearchIndex supports keyword-based retrieval and browsing of the database, for example by following the links between papers formed by citations. Document detail page access information was obtained for July to November, 2000 (multiple accesses by the same user were included). Heuristics were used to filter out robots. Words from the first 5 kbytes of the text of each document were extracted.

We used the data from July to October as the training set, and the data from November as the testing set. Due to the rapid growth in usage of ResearchIndex, November accounted for 31% of the total five month activity. The data included 33,050 unique users accessing the details of 177,232 documents. Density of this dataset was only 0.01%.

We extracted a relatively dense (0.38%) subset of the 1000 most active users and the 5000 documents they accessed the most. We believe these very low density levels are typical of many real-world recommendation applications. Ex-

periments reported in this paper were conducted using the relatively dense subset of 1,000 users and 5,000 papers.

6.2 OVERFITTING

6.2.1 User-Document And User-Document-Word Aspect Models

Training the two-way user-document aspect model on the relatively dense set of 1000 users and 5000 documents resulted in immediate overfitting of EM, meaning that the test data log-likelihood began to fall after only the first or second iteration. This immediate overfitting occured for num- bers of latent classes ranging from 3 to 50. Using tempered EM (under several reasonable temperature change sched- ules) only kept the test data log-likelihood approximately at the same level as the initial random seed, without significant improvements.

Including the words contained in the 5,000 documents, and fitting the three-way aspect model also resulted in immediate overfitting. Again, TEM failed to yield significant improvements in the test data log-likelihood.

6.2.2 Standard Aspect Model, Synthetic Data

To examine whether this extreme overfitting was specific to the ResearchIndex data, we tested the aspect model on a simple synthetic data set. Users are divided into three disjoint groups according to the following scheme:

1. users 0–49 read papers 0–299, 2. users 50–99 read papers 300–599, and 3. users 100–149 read papers 600–899,

where the probabilities that users read papers in their interest set are uniform.

We designed the data so that the “correct” model with three latent states is obvious. We generated several datasets of differing densities and trained a three-latent-variable aspect model on each to see whether EM converges to the correct model. We performed validation tests at each iteration with test sets of the same density as the corresponding training set. Figure 3 plots the iteration (averaged over fifty random restarts of EM) where overfitting⁷first occurs versus the dataset density. In datasets of density less than 1.5%

the process consistently overfits from the first iteration. For datasets of density 2.5%, test performance begins to deteri- orate after about five iterations on average. For datasets of density 4%, overfitting begins after ten iterations.

7Defined as the point where test data log-likelihood starts de- teriorating.

(8)

0.01 0.02 0.03 0.04

246810

density

overfitAtIteration

Figure 3: Iteration (averaged over fifty random restarts) where overfitting occurs versus density of the synthetic data.

6.3 RECOMMENDATION ACCURACY

We find that both EM and TEM fail on very sparse data, including ResearchIndex data and synthetic data. In contrast, EM is effective on both of our density-augmented models (Sections 4 and 5). Here we compare these two models to the -NN algorithm, commonly employed in commer- cial recommender systems. We use the rank scoring metric (Breese et al., 1998) to evaluate recommendations.

6.3.1 Evaluation Criteria

Breese et al. (1998) define the expected utility of a ranked list of items as

« ¬

¢®¯C°/±

¬²

¯³°

§

where

¨

is the rank of an item in the full list of suggestions proposed by a recommender,

£¥¤"¦#§©¨ ª

is 1 if user

¦

accessed item

¨

in the test set and 0 otherwise, and^´ is the viewing half-life, which is the place of an item in the list such that it has a 50% chance of being viewed.⁸ As in their paper, we use^´ ^¶µ , and found that our resulting conclusions were not sensitive to the precise value of this parameter. The final score reflecting the utilities of all users in the test set is

· ¸¸¹

¹

º¼»¾½

§

where

º¼»¾½

is the maximum possible utility obtained when all items that user

¦

has accessed appear at the top of the ranked list.

6.3.2 -Nearest Neighbors Figure 4 gives

scores for the experiments with -NN in standard formulation on the user-document data for different values of , ranging from 10 to 60 with an interval of 5.

8We modify Breese et al.’s formula slightly for the case of observed accesses rather than ratings.

The maximum value achieved in these experiments was 1.87 for ^« ^µ .

scores have local maxima, suggesting their sensitivity to the sparsity of the user-document data.

10 20 30 40 50 60

1.41.51.61.71.8

k

R

Figure 4:Total utility of the ranked lists over all users produced by^¿ -NN.

6.3.3 Smoothed Aspect Model

Figure 5 shows the total utility of the ranked lists (

) for all users against the similarity threshold used for smoothing for the example of 25 latent variables. Although the values of

fluctuate, the pattern is clear through the significant linear least squares fit (^À -value of the slope coeffi- cient is 0.02)—

is larger when more content is included (smaller similarity threshold). As the similarity threshold grows, the initial data matrix becomes sparser, until it becomes impossible to learn (immediate overfitting). Local fluctuations are due to the stochastic nature of EM; in particular, its sensitivity to the randomly initialized parameter values and the number of restarts attempted (five in these experiments) when the data matrix becomes sparser as the similarity threshold grows.

0.2 0.4 0.6 0.8 1.0

0.81.01.21.41.61.82.0

threshold

R

Figure 5:Total utility of the ranked lists over all users produced by the similarity-based User-Document model against the similarity threshold used in smoothing (25 latent class variables).

The maximum value

has reached is 2.10, which is greater than the best -NN result (1.87), but not as good as the

(9)

User-Words model (2.92), discussed below.

6.3.4 User-Words Aspect Model

Figure 6 shows the ^Á scores for the User-Words aspect model recommender. Experiments include models with the number of hidden class variables^Â ranging from 10 to 60 with an interval of 10 (two restarts were performed for each experiment). The maximum^Á value achieved in these experiments is 2.92 for the model with 50 hidden class variables, which is significantly higher than 1.87, the best ^Á value achieved with^Ã -NN algorithm.

10 20 30 40 50 60

1.61.82.02.22.42.62.8

numberOfLatentClasses

R

Figure 6:Total utility of the ranked lists over all users produced by the User-Words aspect model.

7 CONCLUSIONS AND FUTURE WORK

We presented three probabilistic mixture models for recom- mending items based on collaborative and content-based evidence merged in a unified manner. Incorporating content into a collaborative filtering system can increase the flexibility and quality of the recommender. Moreover, when data is extremely sparse—as is typical in many real- world applications—additional content information seems almost necessary to fit global probabilistic models at all.

The density of ResearchIndex data is only 0.01%. Even the most active users reading the most popular articles in- duce a subset of density only 0.38%, still too sparse for the straightforward EM and TEM approaches to work. We find that a particularly good way to include content information in the context of a document recommendation system is to treat users as reading words of the document, rather than the document itself. In our case, this increased the density from 0.38% to almost 9%, resulting in recommendations superior to^Ã -NN.

There are many areas for future research. Similar methods to those presented here might be used to recommend items such as movies which have attributes other than text.

A movie can be viewed as consisting of the director and the actors in it, just as a document contains words. Both of

our sparsity reduction techniques, similarity-based smoothing and an equivalent of a user-words aspect model, can be used.

EM is guaranteed to reach only a local maximum of the training data log-likelihood. Multiple restarts need to be performed if one desires a higher quality model. We are planning to investigate ways to intelligently seed EM to reduce the need for multiple restarts, which can be costly when fitting datasets of non-trivial size.

The user-words model does not explicitly use the popularity of items. Including such information may further improve the quality of the recommendations made by the model, but requires additional work on combining and cal- ibrating model predictions with document popularity.

Finally, predictive accuracy was used to validate our models in this paper. We are planning to deploy our recommenders in ResearchIndex and perform a user study col- lecting information on which recommendations are actually followed by users.

References

Basu, C., Hirsh, H., & Cohen, W. (1998). Recommenda- tion as classification: Using social and content-based information in recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intel- ligence, pp. 714–720.

Billsus, D., & Pazzani, M. J. (1998). Learning collaborative information filters. In Proceedings of the Fifteenth International Conference on Machine Learning, pp.

46–54.

Bollacker, K., Lawrence, S., & Giles, C. L. (2000). Discov- ering relevant scientific literature on the web. IEEE Intelligent Systems, 15(2), 42–47.

Breese, J. S., Heckerman, D., & Kadie, C. (1998). Em- pirical analysis of predictive algorithms for collab- orative filtering. In Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52.

Claypool, M., Gokhale, A., & Miranda, T. (1999).

Combining content-based and collaborative filters in an online newspaper. In Proceedings of the ACM SIGIR Workshop on Recommender Systems—

Implementation and Evaluation.

Cohen, W. W., Schapire, R. E., & Singer, Y. (1999). Learn- ing to order things. Journal of Artificial Intelligence Research, 10, 243–270.

Cohn, D., & Hofmann, T. (2001). The missing link - a probabilistic model of document content and hyper- text connectivity. In Advances in Neural Information Processing Systems, Vol. 13. The MIT Press.

(10)

Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (1998).

An efficient boosting algorithm for combining preferences. In Proceedings of the Fifteenth Interna- tional Conference on Machine Learning, pp. 170–

178.

Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992).

Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–

70.

Good, N., Schafer, J. B., Konstan, J. A., Borchers, A., Sar- war, B. M., Herlocker, J. L., & Riedl, J. (1999). Com- bining collaborative filtering with personal agents for better recommendations. In Proceedings of the Six- teenth National Conference on Artificial Intelligence, pp. 439–446.

Heckerman, D., Chickering, D. M., Meek, C., Rounth- waite, R., & Kadie, C. (2000). Dependency networks for collaborative filtering and data visualization. In Proceedings of the Sixteenth Conference on Uncer- tainty in Artificial Intelligence, pp. 264–273.

Hofmann, T. (1999). Probabilistic latent semantic analy- sis. In Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, pp. 289–296.

Hofmann, T., & Puzicha, J. (1999). Latent class models for collaborative filtering. In Proceedings of the Six- teenth International Joint Conference on Artificial Intelligence, pp. 688–693.

Lawrence, S., Giles, C. L., & Bollacker, K. (1999). Digi- tal libraries and autonomous citation indexing. IEEE Computer, 32(6), 67–71.

Mooney, R. J., & Roy, L. (2000). Content-based book rec- ommending using learning for text categorization. In Proceedings of the Fifth ACM Conference on Digital Libraries, pp. 195–204.

Nakamura, A., & Abe, N. (1998). Collaborative filtering using weighted majority prediction algorithms. In Proceedings of the Fifteenth International Confer- ence on Machine Learning, pp. 395–403.

Pennock, D. M., Horvitz, E., & Giles, C. L. (2000a). So- cial choice theory and recommender systems: Anal- ysis of the axiomatic foundations of collaborative fil- tering. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, pp. 729–734.

Pennock, D. M., Horvitz, E., Lawrence, S., & Giles, C. L.

(2000b). Collaborative filtering by personality di- agnosis: A hybrid memory- and model-based ap- proach. In Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, pp. 473–

480.

Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., &

Riedl, J. (1994). GroupLens: An open architecture for collaborative filtering of netnews. In Proceedings of the ACM Conference on Computer Supported Co- operative Work, pp. 175–186.

Resnick, P., & Varian, H. R. (1997). Recommender sys- tems. Communications of the ACM, 40(3), 56–58.

Salton, G., & McGill, M. (1983). Introduction to Modern Information Retrieval. McGraw Hill.

Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. T.

(2000). Application of dimensionality reduction in recommender system – a case study. In ACM We- bKDD Web Mining for E-Commerce Workshop.

Schafer, J. B., Konstan, J., & Riedl, J. (1999). Recom- mender systems in e-commerce. In Proceedings of the ACM Conference on Electronic Commerce, pp.

158–166.

Shardanand, U., & Maes, P. (1995). Social information filtering: Algorithms for automating ‘word of mouth’.

In Proceedings of Computer Human Interaction, pp.

210–217.

Ungar, L. H., & Foster, D. P. (1998). Clustering methods for collaborative filtering. In Workshop on Recom- mendation Systems at the Fifteenth National Confer- ence on Artificial Intelligence.