• No results found

GENERAL EXPERIMENTS: CD-CCA VS BASELINE ALGORITHMS

The goal of this chapter of thesis is to answer the research question Q1.1. More specifically, we would like to see if the additional data available to cross-domain recommenders help us to provide better recommendations to users; if the cross-domain recommenders can harm the recommendation performance; if there is a cross-domain recommender system that can perform better than other cross-domain recommender systems; and if the improvement we get from the cross-domain recommendations are because of the additional provided data or the properties of the cross-domain algorithm.

To find an answer to the above questions, we use CD-CCA (and CD-LCAA), as one of the cross-domain algorithms, in addition to other state-of-the-art cross-domain and single- domain algorithms that are mentioned in Section3.3. We compare the performance of these algorithms using Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) of the recommended items. To understand the effect of having additional data on the recom- mendation performance, we apply the single-domain algorithm only on the target domain data and the cross-domain algorithms on both source (auxiliary) and target datasets and compare their results. Additionally, to understand the effect of approach on the recommen- dation results, we apply the single-domain algorithm on a combination of source and target data to have a fair comparison with cross-domain algorithms. This setting is shown in Figure

10. In the next step, we examine the correlation between these algorithms’ performances, on the available domain-pairs in the data, to understand if an increase in the performance of each of the algorithms can lead us to an increase in other algorithms’ performance.

Our hypothesis in this part of analyses is:

CD-­‐CCA,  CMF,   RMGM   SD-­‐SVD   ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? >   CD-­‐SVD   ? ? ? ? ? >   Data  Ma1ers   Algorithm   Ma1ers  

Figure 10: Experiment setup to answer research question Q1.1

• However, using auxiliary material, if selected and applied correctly, should either improve or preserve the performance of recommender systems;

• The performance of single-domain and cross-domain recommenders are correlated with each other due to the data characteristics;

• However, the improvement achieved by using auxiliary data depends also on the applied algorithm.

In the following sections, we present the results of our proposed and baseline algorithms on each of the datasets.

5.1 EXPERIMENT SETUP

To run the experiments on each of the datasets, we implement a user-stratified 5-fold cross- validation setting. The user-stratified setting is used to represent a common situation that happens with recommender systems: we would like to predict the ratings of some (probably

Figure 11: Separating test, train, and evaluation data from the target domain

new) users, given that we have the ratings of other (probably similar) users. As a result, some of the users (20%) are selected as test users and the rest of them (80%) are selected as training users. 80% of the ratings for the test users on the items in target domain is removed randomly from the training dataset. The algorithms approximate this 80% of test user ratings based on the training set. Eventually, the approximated test ratings are compared to the real ones to calculate the error of algorithms.

The reason to remove 80% of test user ratings, and not all of their ratings, is to avoid the extreme cold-start case and to be able to perform a cold-start analysis on the user profile sizes. Thus, we use a random 20% selection of each test user’s rating and estimate the rest of test users’ ratings (the removed 80%) conditioned on observing this 20% of their ratings and the ratings of users in the training set. Having this setting, if a test user has a large profile in the target domain, we will have more information on this user, compared to another test user with a small target domain profile. Consequently, the distribution of profile sizes among the test users is a factor of the gold-start profile sizes distribution. Thus, the amount of information that we have from the test users is kept in accordance with the amount of information we have from them as the gold-standard. This allows us to perform a cold-start analysis that is similar to the real-world setting: some new users are active and have more ratings in the beginning of using a system, while others have less ratings.

Some of the algorithms have parameters that should be selected by cross-validation. For example, the number of components should be provided as an input to the SVD++

algorithms. To find the best set of parameters for each algorithm, we remove a “validation” set of ratings from the training data. Selection of this validation set is in accordance with selection of the test set; we select 15% of users as validation users and remove 80% of their ratings from the training set. Then, we train the algorithms with different values of parameters on the remaining training ratings and test it over the validation dataset to select the parameters that result in the best performance.

After selecting the best parameters, we add the validation set data to the training set; train the algorithms based on this new training dataset; and test it on the test data of the removed 20% of users. Figure 11 shows a toy example of separating the test, train, and evaluation data in a target domain.

We repeat these experiments 5 times, each time selecting a different set of test users, for the 5-fold cross-validation. Eventually, we average over the performance of algorithms in these 5 times and report it.

For the single-domain algorithm, we use only the target domain dataset. However, for cross-domain algorithms, we have both source and target datasets. To be able to compare single and cross-domain algorithms, we remove the same set of ratings for all of the algo- rithms. Thus, for each test user in the cross-domain algorithms, we have all of the users’ ratings from the source domain, plus 20% of her ratings in the target domain, as training data. The remaining 80% of test user’s target domain ratings is what we test the algorithms on.

To measure the performance of algorithms, we use Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). Although there are other performance measures, such as ranked-based measures like nDCG, precision, and recall, that can be used in the recommender systems field, we choose RMSE and MAE because of the way we formalize our problem. The proposed algorithms are formulated as estimating user ratings over the items. Consequently, the closeness of the estimated rating to the real rating is the measure that is important to us. If R is the set of test ratings, ru,i is the rating of user u on item i, and ˆr is the estimated