• No results found

Content-based Recommender Systems

2.2 Recommender Systems

2.2.1 Content-based Recommender Systems

A content-based recommender system (CBRS) has basis on the items’ content to recommend similar items to those ones that the user has already liked before. The similarity of items is calculated based on the features associated with the compared items. CBRS approaches analyze the features of the items previously rated by a user to build a profile of user interests based on the features of the items rated by that user. The recommendation processing, then, consists in matching up the attributes of the user profile against the features of the item to be recommended. The result is a relevance score/ judgment that represents the user’s level of interest in that item. The more accurate the profile is, the more effective the recommendations will be. Therefore, an important step in the content-based recommender systems is the technique used for item representation.

The items are represented by a set of features, also called attributes or properties [98]. In movie applications, for example, the year of the movie, actors, directors, description can be used as features for the items. A simple way to represent the items is then to use keywords-based profiles. This approach is especially suitable when each item is described by the same set of attributes and the possible values for each feature is known. In the case of textual description, keywords-based profiles are not effective as item representation, since simple string matching operation can not deal with polysemy, the presence of multiple meanings for one word, and synonymy, where multiples words have the same meaning [98].

A simple and very used model for representing the items is the Vector Space Model (VSM) broadly used in Information Retrieval (IR). In particular, VSM is used to spatially represent text documents, where each document can be seen as a vector in am-dimensional space. Then, each dimension in the document vector corresponds to a term from the overall vocabulary of the document collection, which is weighted to indicate the degree of relevance between the document and the term. In content-based recommender system, this model can be used in such a way the items and users correspond to the documents, while the items’ features are the terms of the overall vocabulary.

Let T = {t1, . . . , tm} be the set of terms in our vocabulary and D = {d1, . . . , dn} be the

set of documents. Therefore, each document di is represented by itsm-dimensional vector space

~

di= {w1i, . . . , wmi} such that wkj is the relevance of thetk for documentdj.

To evaluate how relevant a termt is to a document k, wtk, we first need to point out important

observations that help us to design the adequate weight function. As discussed in [98, 123]: (i) frequent terms are not necessary more relevant than rare terms; single occurrences of a term in a document are not more important than multiple occurrences; and documents with many terms are not more suitable than documents with less terms. TF-IDF (Term Frequency-Inverse Document Frequency) was developed based on these observations regarding text being the most commonly used term weighting framework. The intuition behind TF-IDF is that terms that occur frequently in one document and are not frequently found in many other documents are more likely to be relevant to the document, while frequent items that occur in several documents are not representative for a specific document. To compute TF-IDF, we need first to compute the term frequency of a term

tk in a documentdi given by

TF(tk, di) = fk,i

maxzfz,i

,

wherefk,i is the number of occurrence of termtk in the documentdi, and maxzfz,istands for the

maximum occurrence of any term z in any document i. With term frequency computed, we can calculate TF-IDF as

wk,i= TF-IDF(tk, di) = TF(tk, di) ·log

N nk

,

where n is the number of documents in the collection and nk is the number of documents that

have the termtk. Analyzing the inverse term frequency componentlog

n nk

we can see that the final score for TF-IDF is higher whennk is lower, and lower for large nk. This means that the term

frequency of a termtk in the documentdi is penalized iftk occurs in many other documents.

Once we have computed the document vector ~di, we can rely on a similarity function to find

similar documents di with respect to a given vector (e.g. user profile vector) in the same m-

dimensional space. Cosine similarity is broadly used to compute the similarity between two vector of an inner product space to measure the cosine of the angle between them given by:

cosine(~di, ~dj) =

~ di· ~dj

|| ~di|| · || ~dj||

.

Therefore, for a user profile vector~u in the same m-dimensional space, we can compute the cosine similarity to find out documents that are relevant or similar to the user profile. The vector space model jointly with a vector-based similarity function are simple and very efficient ways for recom- mending items, mostly due to its simplicity and flexibility to be applied in different domains, such as music, movies, books, venues, etc.

Lops et al. in [98] highlight that keyword-based representations for the items and user profiles can give accurate performance, when the sufficient number of evidence of the user interests is available. However, this approach is not suitable for all applications. As previously discussed, keyword-based methods have some problems regarding polysemy and synonymy, what can lead to inaccurate results by the recommender system. To deal with this problem, an ontology-based representation might be used to integrate the recommender system with external knowledge bases to provide more semantic in the user profiles.

The Space Vector Model can then be used as a framework for the content-based recommender system. In the case of CBRS, the documents are the items and users, while the terms are the features associated with the items. In this way, we represent items and users as feature vectors in such a way similar items to the user profile can be found as recommendations to the user.

The content-based recommender systems have the advantage of: (i) User Independence, since the RS exploits the ratings provided by the user to build her own profile, and it does not need to compute the other users ratings as done in collaborative approaches; (ii) explaining how the recommender system works can be provided by explicitly listing the item features that caused the recommendation of that item - Transparency; (iii) overcoming the new item problem (item not rated by any user), once the item features can be match up against the user profile even when no

user has rated that item.

The content-based recommender systems, on other hand, also have several drawbacks: (i) Domain knowledge is often needed, what might be problematic if the content of items (features) are not enough to discriminate items the user likes from items the user does not like. Therefore, automatic discovery and manual assignment of features to items could not be sufficient to define distinguishing aspects of items to capture and model the user interest; (ii) these RSs have the drawback of over specialization, where only items similar to those items previously rated by the user will be recommended, thus it does not favor for serendipity recommendations (unexpected recommendations); (iii) new users do not have enough ratings or feedbacks to create their profile, thus the system will not be able to provide reliable recommendations.

Not all of the item contents are available, which forces us to design different recommendation techniques from the content-based ones. In particular, when the ratings of users are present in the system, these can be used to discover patterns to support the recommendations of the items. These patterns may indicate users having similar preferences or behaviors and thus items of one could be used as a recommendation to the other; or they may be used to learn model that are capable of assisting the recommendations. In the next section we discuss the recommendation systems based collaborative filtering which uses the patterns on the recommendations of the items.