The first comparison involves using the same single collaborative filtering ap- proach, testing methodology and evaluation metrics across the four datasets. The collaborative filtering approach used is a Pearson correlation nearest- neighbour approach, to find the top-60 similar neighbours. A weighted average of these neighbour’s ratings of test items is used to produce a predicted value for the removed test items.
A standard testing methodology is used where 10% of users and, if possible, 10% of the items of these users are removed (called a 10/90 split). These user-item pairs become the test set on which recommendations are sought. The remainder of the dataset is used to generate recommendations. The collaborative filtering run is repeated 10 times per user where, for each run, up to 10% of the user’s items are randomly chosen as the test items.
The evaluation metrics used are MAE and coverage. A MAE score is calculated based on the actual ratings the user has given the items versus the predicted ratings the collaborative filtering approach produced. Coverage is calculated based on the ratio of test items for which predictions can be found. The MAE and coverage results, per user, are averaged over the 10 runs.
Table 3.2 lists the average MAEs and percentage coverage for each dataset. The M ovieLens dataset has the best combination of low MAE and high coverage. The last.fm dataset has a good MAE value but has a lower coverage value than seen for the MovieLens dataset. Although the MAE values for the Epinions dataset seem reasonable, the coverage is not very good. It can be seen that the bookcrossing dataset has a high MAE value and a very low coverage value in comparison to the MAEs and coverage for the other three datasets. This has been corroborated by other studies [182].
Table 3.3: Comparison 2: Average MAEs and % Coverage for the MovieLens dataset with different test and train Splits: Pearson Correlation Approach.
% TestUsers avg MAE % coverage
20 0.740 97.817 30 0.741 97.897 40 0.740 98.088 50 0.747 98.079 60 0.750 98.188 70 0.762 98.035 80 0.785 96.968 90 0.840 90.613
Table 3.4: Comparison 2: Average MAEs and % Coverage for the last.fm dataset with different test/train Splits: Pearson Correlation Approach.
% TestUsers avg MAE % coverage
20 0.698 76.480 30 0.710 76.055 40 0.717 75.039 50 0.731 73.472 60 0.750 70.864 70 0.781 66.295 80 0.833 54.779 90 0.946 38.188
(Pearson correlation approach) and the same evaluation metrics (MAE and cov- erage) but, for the testing methodology, different percentages of test users are chosen, varying from 20% to 90% of test users (10% already having been tested in the previous comparison). As the number of test users increases, there is a reduced number of users from which to form neighbours and predictions.
The MovieLens results (Table 3.3) show very good performance for all data test splits, degrading mostly — as expected — at the 90/10 split.
The last.fm dataset results (Table 3.4) show that MAE and coverage are main- tained at good values up to the 60% split of test users. After this point, coverage values decrease and MAE values increase until, at the split of 90% test users the MAE value is 0.946 and the coverage value is only 38.188%.
The bookcrossing dataset results (Table 3.5) are poor in general and very poor in particular for coverage — even at the 20% split of test users. The coverage degrades very sharply to only 5.548% at the 90% split of test users. For the few predictions that are being made for each of the splits, the accuracy remains fairly stable (though poor) at around 1.4 and 1.5.
Table 3.5: Comparison 2: Average MAEs and % Coverage for the bookcrossing dataset with different test and train Splits: Pearson Correlation Approach.
% TestUsers avg MAE % coverage
20 1.491 21.670 30 1.489 20.508 40 1.538 18.916 50 1.494 18.390 60 1.512 14.827 70 1.553 12.471 80 1.535 11.616 90 1.596 5.548
Table 3.6: Comparison 2: Average MAEs and % Coverage for the Epinions dataset with different test and train Splits: Pearson Correlation Approach.
% TestUsers avg MAE % coverage
20 0.913 39.637 30 0.935 32.993 40 0.878 40.714 50 0.991 35.582 60 1.017 35.713 70 0.966 26.276 80 0.894 29.936 90 0.892 22.194
The results for the Epinions dataset (Table 3.6), while not as bad, have a similar pattern to the bookcrossing results. Coverage is poor for all splits and decreases as the percentage of test users increases. For the few predictions that can be made, accuracy is generally reasonable, being in the range [0.878 − 1.017]. The third comparison uses a number of different collaborative filtering approaches. The testing methodology is fixed at a 20/80 split (20% of users are used as test users) and results are shown for one run only. An average over ten runs would give a more accurate value but the relative comparisons between techniques should be mostly similar. One evaluation metric is used (MAE). The different collaborative filtering approaches used are a subset of those available in the PREA toolkit [145] and are:
• Baseline: random: predicting uniformly randomly from the score range. • Baseline: test user average: predicting using the average of the current test
user’s ratings.
Table 3.7: Average MAEs for each dataset with a number of techniques. M ovieLens last.f m bookcrossing Epinions Baseline: random 1.3827 2.5479 2.8719 1.6170 Baseline: test user avg 0.8354 0.5899 1.2842 0.9370 Baseline: test item avg 0.8141 1.188 1.6333 0.8979 UserDft: (Pear. Corr.) 0.7380 0.6553 1.7926 0.9667 ItemDft: (Pear. Corr.) 0.7175 0.6420 1.7823 0.9612
NMF: 0.7791 0.6494 2.5713 1.0709
PMF: 0.8126 0.6708 1.2800 0.9053
Bayesian PMF: 0.7481 0.61011 1.6143 0.9497 user’s item ratings.
• usrDft: User-based Pearson Correlation (memory-based technique). • itemDft: Item-based Pearson Correlation (memory-based technique). • NMF: Nonnegative Matrix Factorization [144].
• PMF: Probabilistic Matrix Factorization [201].
• Bayesian PMF: Bayesian Probabilistic Matrix Factorization [200].
Considering the results in Table 3.7, and firstly comparing the collaborative fil- tering approaches against the baseline approaches, it can be seen that, for the last.f m and Epinions datasets, the baseline approach gives better results, using user average and item average respectively.
Secondly, similar results can be seen when comparing the MAE results in Table 3.2 and the MAE results for the UserDrf technique (user-based Pearson correlation as used in the Comparison 1 and Comparison 2 experiments) in Table 3.7. Thirdly, comparing the MAE results for the five collaborative filtering techniques in Table 3.7 we can see that there is no clear “winner” across all datasets in terms of a best overall technique, although PMF performs best for two of the four datasets.