• No results found

Performance Prediction Approach

Figure 6.1 outlines a general overview of the approach undertaken in this work where, per user, derived measures of the user’s rating information, along with the rating information of other users, are used in a rule which returns a prediction (a predicted MAE score) on how well the system can produce recommendations for the user. The four different datasets outlined in Chapter 3 are tested with this scenario: MovieLens, last.fm, bookcrossing and Epinions.

Figure 6.1: Performance prediction scenario in a collaborative filtering domain.

6.3.1

Learning the Performance Prediction Rules

Figure 6.2 gives an overview of the steps involved in learning, per dataset, rules which can be used to predict, per user, the performance of the system.

Initially a holdout set of test users (up to 10% depending on the dataset) is removed to be used to evaluate the rules learned (as will be described in Section 6.4). The remainder of the dataset comprises the training data.

Figure 6.2: Steps to learn the performance prediction rules.

6.3.1.1 Extract Rating Information

A number of aspects of the user rating information, called features, are extracted from the collaborative filtering datasets. The motivation is to choose aspects of the user rating history that would seem likely to affect a user’s prediction accuracy. The data extracted range from simple calculations (such as the average user rating) to values derived using some item and neighbour information. A list of the eight features follows, with details and formulae where required, and with a shorthand name for each feature which will make the subsequent rules more readable:

1. numRatings: the number of ratings a user has given.

2. avgRating: the average rating a user has given to all the items they have rated.

3. stdev: the standard deviation of the average rating a user has given. 4. numNeighs: the number of neighbours a user has. This is calculated by

first using a Pearson correlation similarity measure to find the similarity between users. Any user with similarity to the current user above a set threshold (in this case 0.1) is counted as a neighbour.

5. sim30neighs: the average similarity of each user to their top closest 30 neighbours (using the Pearson correlation similarity values and, having or- dered by similarity, picking the top 30 users).

6. popItems: the popularity of the items each user has rated. This popularity measure is based on the number of ratings each item has received (and not

considering the actual rating value). The formula per user a is:

PM

i=1numRatingsi

M (6.1)

for M items rated by the user a and numRatingsi being the number of

ratings item i has received from all users in the dataset.

7. likedItems: how well-liked by all users are the items rated by the current user. This measure is calculated using the actual rating value given to items. The formula used per user a is:

PM

i=1avgV ali

M (6.2)

for M items rated by the user a and avgV ali being the average rating value

item i has received from all users in the dataset who have rated item i. 8. tfidf: the importance, or influence, of a user in a dataset. This is based

on the idea of term frequency and inverse document frequency (from the domain of IR) and is the proportion of items a user has rated multiplied by how frequently-rated those items are in the dataset. Frequently-rated items get low values (similar to the IDF component in Information Retrieval, where frequently occurring terms across all documents receive lower scores). The formula used is:

numRatingsa

numItems ×

M

X

i=1

log numU sers numRatingsi

!

(6.3) for M items rated by a user a, where numItems is the number of items in the dataset and numRatingsa is the number of ratings user a has given,

i.e. this is the ratio of the number of ratings the user gave over the number of ratings the user could have given (all items); numUsers is the number of users in the dataset and numRatingsi is the number of ratings item i

received, i.e., this is the ratio of the number of ratings an item could have received (a rating from all users in the dataset) over the number of ratings it did received.

The feature values are all normalised by min/max normalisation to be in the range [0.0-1.0].

6.3.1.2 Collaborative Filtering Technique

For each training and test user, in each of the four datasets, the feature values outlined in Section 6.3.1.1 are extracted. In addition, some score which repre- sents how well a collaborative filtering system can predict items for these users is required for learning and testing. The machine learning approach will learn relative to this score, ideally associating some feature values with low scores and other features values with high scores, and thus finding the predictive power of the features in terms of the accuracy score.

This experiment requires a measurement which is comparable across the four datasets, and which can be suitably averaged so that it can be used as a score over which the machine learning approach will learn. Initial experiments performed using the MAE (see Equation 2.1 in Chapter 2) found it was suitable, as it is reasonably strict while being widely-used and well-understood. In order to obtain a MAE score per training and test user, a collaborative filtering system is required to produce recommendations for a portion of ratings removed from the user’s ratings. Any standard collaborative filtering technique can be used. For this experiment, a nearest neighbour collaborative filtering technique was employed, using Pearson correlation to find similar neighbours and using a weighted average of the neighbour’s ratings of test items to produce recommendations.

6.3.1.3 Create Training and Test Tuples

As previously mentioned, 10% of the dataset is withheld for testing purposes (the holdout set). To allow for the comparison between the actual and predicted MAE scores, a collaborative filtering system is used to produce predictions for a set of items for the users. A MAE score is calculated based on the ratings the user has given the items versus the ratings the collaborative filtering approach produced. The collaborative filtering run is repeated 10 times per user where, for each run, up to 10% of the user’s items are randomly chosen as the test items. Finally, accuracy scores, per user, are averaged over the 10 runs.

Of the remaining 90% of training users, a collaborative filtering approach was repeatedly re-run using 10% of these users and 10% of the rated items for each user — so that an average MAE over the recommendations for the removed items can be calculated for each user (comparing actual with predicted scores). For any given user in the training set, their user ID — along with their average MAE value

and the eight aforementioned features (from Section 6.3.1.1) — comprise the user tuples in the training dataset.

6.3.1.4 Machine Learning Technique

All the data in this experiment is numeric. The target variable (the MAE) is known for each training tuple and therefore a supervised machine learning ap- proach is suited to the problem. Often, a neural network approach would be used in the classification scenario where labelled numeric data exists. However, we wish to understand the underlying patterns and correlations between the feature values and precision scores. We therefore require a technique which will produce one or more rules. The technique used is regression trees, which are similar to ordinary decision trees except they can be used with numeric data [234]. The regression tree used is the model tree inducer M50 [189]. The machine learning

package WEKA is used which has an implementation of M50 [234].

The results of performing feature selection, where some subset of features are selected prior to running the M50 approach, is also tested. A feature selection

stage typically reduces the complexity of the rules produced, that is, the number of features used in the rules. Due to feature selection, the most predictive features with respect to the class (MAE) are chosen. This usually results in simpler rules. As the rules are to be used either prior to, or in conjunction with, producing recommendations, the quicker a performance prediction measure can be generated the better — and thus simpler rules are generally better if the accuracy of the rules with feature selection are comparable to the rules without feature selection.