Correlation Analysis of Errors - Correlation Analysis for the Supermarket Purchase Dataset

7.2 Correlation Analysis

7.2.1 Correlation Analysis for the Supermarket Purchase Dataset

7.2.1.1 Correlation Analysis of Errors

17shows these correlations with the RMSE of algorithms and Table16shows them with the MAE of algorithms.

As we can see, the total KL-divergence of ratings in the source and target domains, the mode of source domain rating values and the average CCA correlations between the source and target domains do not have any significant correlations with the RMSE and MAE of any of the algorithms. Also, the average and median of user-based KL-divergences of source and target domains do not have any significant correlations with the RMSE of algorithms. However, they have a negative correlation with MAE of CD-CCA. The significant correlation of these two factors with the SD-SVD error is meaningless, because in SD-SVD, we only use the target domain data.

Except for the average CCA correlations between the source and target domains, the rest of CCA-related features have at least one significant correlation with the error of algorithms. For example, the number of significant CCA correlations, is significantly correlated with RMSE of CD-CCA, RMGM, CD-SVD, and SD-SVD; the average of first five components’ CCA is significantly correlated with RMSE of CMF, RMGM, CD-SVD, and SD-SVD; and the value of first component’s CCA is significantly correlated with RMSE of CMF, RMGM, and SD-SVD.

Table 16: Correlations of data characteristics with MAE of algorithms on the Supermaket dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001

variables CD-CCA MAE CMF MAE RMGM MAE CD-SVD MAE SD-SVD MAE

user size 0.3262* -0.0441 -0.5925*** 0.4777*** 0.7593***

source item size 0.4702*** 0.0043 -0.2012 0.3167* 0.2956*

target item size -0.1585 0.0074 -0.5761*** 0.2363 0.3918**

source density -0.0326 0.3506* 0.3546* 0.5849*** -0.1233

target density 0.039 -0.8624*** -0.6611*** -0.1777 0.2763

total KL-divergence -0.1717 -0.1384 0.0191 -0.1786 -0.1868

mean user KL-divergence -0.3032* -0.0836 0.2179 0.0352 -0.3761**

median user KL-divergence -0.2994* -0.0645 0.2231 0.1402 -0.3543*

variance user KL-divergence -0.1676 0.3838** 0.8286*** -0.0679 -0.6455***

source mean rating 0.0049 0.3421* 0.247 0.6268*** 0.0283

target mean rating 0.2885* -0.7415*** -0.7859*** -0.016 0.5452***

source median rating -0.0215 0.3475* 0.2726 0.6211*** 0.0025

target median rating 0.3113* -0.7554*** -0.8048*** -0.0283 0.5476***

source mode rating 0.09 0.189 -0.0884 0.2697 0.2319

target mode rating -0.0446 -0.6463*** -0.3259* 0.0486 0.2391

source var rating 0.005 0.3111* 0.1997 0.5877*** 0.0417

target var rating 0.2404 -0.6977*** -0.6131*** -0.0371 0.4268**

source kurtosis rating -0.1073 -0.1338 -0.0356 -0.2982* -0.1064

target kurtosis rating -0.1827 0.2162 0.6107*** -0.1987 -0.4793***

source skewness rating -0.1281 -0.1418 -0.0408 -0.3318* -0.1144

target skewness rating -0.1801 0.2392 0.6676*** -0.204 -0.5167***

user to source item ratio -0.3496* -0.0536 -0.1673 -0.0662 0.1397

user to target item ratio 0.431** 0.0189 0.3157* 0.0514 0.0189

source to target item ratio 0.5161*** 0.024 0.3378* 0.0893 -0.0923

source to target density ratio -0.1478 0.4109** 0.644*** 0.4444** -0.4935***

CCA correlation ≥ 0.80 0.0129 -0.1914 -0.4423** 0.1974 0.2572

CCA correlation ≥ 0.90 0.0492 -0.2208 -0.5352*** 0.2183 0.3403*

CCA correlation ≥ 0.95 0.0256 -0.2492 -0.5717*** 0.204 0.3873**

average correlation -0.0914 -0.1572 -0.247 0.1455 0.114

first component correlation 0.1104 -0.1576 -0.5075*** 0.317* 0.3793**

first 5 components correlation 0.1145 -0.2227 -0.5938*** 0.2587 0.4096**

# components 0.1424 0.0196 -0.4671*** 0.3329* 0.4781***

Table 17: Correlations of data characteristics with RMSE of algorithms on the Supermaket dataset; *: significant with p-value < 0.05; **: significant with p-value < 0.01; ***: significant with p-value < 0.001

variables CD-CCA RMSE CMF RMSE RMGM RMSE CD-SVD RMSE SD-SVD RMSE

user size -0.0598 -0.1703 -0.5693*** 0.4108** 0.7439***

source item size 0.3265* -0.2235 -0.1831 0.2758 0.2827*

target item size -0.3395* -0.2111 -0.5343*** 0.1757 0.363**

source density 0.1617 0.1172 0.3587* 0.6299*** -0.0981

target density -0.3796** -0.4313** -0.7199*** -0.2782 0.2448

total KL-divergence -0.1587 0.0509 -0.0003 -0.2103 -0.2006

mean user KL-divergence -0.1694 0.1084 0.1890 0.0332 -0.3874**

median user KL-divergence -0.1741 0.1612 0.1862 0.1332 -0.359*

variance user KL-divergence 0.3158* 0.4961*** 0.831*** 0.0814 -0.6189***

source mean rating 0.1384 0.1345 0.2583 0.6772*** 0.0516

target mean rating -0.2404 -0.4377** -0.8317*** -0.1303 0.5145***

source median rating 0.1270 0.1470 0.2847* 0.6779*** 0.0267

target median rating -0.2247 -0.4583*** -0.8498*** -0.1496 0.5135***

source mode rating -0.0173 -0.0247 -0.0738 0.2701 0.2444

target mode rating -0.2688 -0.3138* -0.3695** 0.0373 0.2477

source var rating 0.1122 0.1261 0.2051 0.6202*** 0.0639

target var rating -0.2073 -0.3184* -0.669*** -0.1167 0.4091**

source kurtosis rating -0.1199 0.0726 -0.0480 -0.3341* -0.1106

target kurtosis rating 0.0959 0.5192*** 0.5725*** -0.1194 -0.4642***

source skewness rating -0.1420 0.0780 -0.0549 -0.3681** -0.1178

target skewness rating 0.1299 0.529*** 0.6307*** -0.1139 -0.4961***

user to source item ratio -0.4389** 0.1225 -0.1738 -0.0686 0.1429

user to target item ratio 0.437** 0.1710 0.2876* 0.0810 0.0421

source to target item ratio 0.5717*** 0.0210 0.3191* 0.1046 -0.0818

source to target density ratio 0.2573 0.2166 0.647*** 0.5296*** -0.4641***

CCA correlation ≥ 0.80 -0.2278 -0.3547* -0.4285** 0.1043 0.2127

CCA correlation ≥ 0.90 -0.2289 -0.3867** -0.5187*** 0.1224 0.2982*

CCA correlation ≥ 0.95 -0.2576 -0.342* -0.5585*** 0.1185 0.3557*

average correlation -0.2428 -0.2628 -0.2417 0.0763 0.0798

first component correlation -0.0936 -0.3517* -0.4814*** 0.2479 0.361**

first 5 components correlation -0.1678 -0.3901** -0.5733*** 0.1678 0.3747**

# components -0.0752 -0.2096 -0.4355** 0.2862* 0.4628***

# significant correlations -0.3511* -0.2179 -0.5587*** 0.2846* 0.5005***

ical correlation between the domain pairs, these CCA-related features have negative correlations with RMSE of CD-CCA, RMGM, and CMF. It means that the RMSE of these algorithms are lower when there is a high canonical correlation between the source and target domains. However, as it is shown in the table, although CD-SVD is also a cross-domain recommender, these correlations are always positive for its error. It means that the error of CD-SVD grows more with the higher CCA between the source and target domains. Also, we can see that although SD-SVD is a single-domain algorithm (thus there should not be any meaningful correlations between its error and CCA-based features), there is a significant positive correlation between the error of SD-SVD and most of the CCA-related features. As we have seen in section 5.2, the error of algorithms, especially for CD-SVD and SD-SVD, are highly correlated in the Supermarket dataset. Consequently, we hypothesize that the positive correlation between the error of CD-SVD and the CCA-related features is because of the same factors that create a positive correlation between the error of SD-SVD and these features. Especially, because the magnitude and significance of this positive correlation is higher for RMSE and MAE of SD-SVD, compared to CD-SVD.

Among the general dataset characteristics, the density of target domain has a significant negative correlation with RMSE of CD-CCA, RMGM, and CMF and MAE of RMGM and CMF. Thus, we will have a lower error rate when there is more user rating information available in the target domain. The denser the source domain is, the higher error we will have in CD-SVD and RMGM algorithms. It means that more information in the source domain can harm more than help in these two cross-domain recommenders. One interesting observation is the correlation between number of users and RMSE of CD-SVD and SD-SVD (and MAE of CD-CCA, CD-SVD, and SD-SVD). As the number of users grow, we expect to have a better understanding of various user tastes, and thus better recommendations. However, for these two algorithms in the Supermarket dataset, this relationship works in reverse. Also, we see that as the number of users grow compared to the number of source domain items (when the user-item source domain matrix gets taller), we achieve significantly less error from CD-CCA. However, as the target domain’s user-item matrix gets taller, we see an increase in error of CD-CCA and RMGM. Another general factor with a large correlation with the errors is the density ratio of source to target domains. Having a higher density

source domain, compared to target domain, results in worse recommendations from RMGM, CD-SVD, and CMF; and (meaninglessly) better recommendations in SD-SVD.

Among the descriptive statistics features, most of them have a significant relationship with CMF. While the source domain’s central tendency measures have a positive correlation with MAE of CMF, these features from the target domain are negatively correlated with CMF’s error. The target domain central tendency features are also negatively correlated with RMGM’s errors and positively correlated with SD-SVD’s. For the dispersion statistics, we can see that SD-SVD performs worse when there is more variance in the target domain ratings; but the cross-domain recommenders work better in this case. This relationship is the reverse for target ratings’ kurtosis and skewness. More specifically, the RMSE of RMGM and MAE of RMGM and CMF increases significantly when the target data ratings are skewed and have more kurtosis.

In general, we can see that RMGM and SD-SVD have the largest number of significant correlations with the data features. For SD-SVD, many of these correlations do not impose any meaningful relationship, because it only uses the target domains data and many of the features are calculated based on domain pairs. We can get a better understanding of these correlations by looking at the scatter plot of these features against the error of algorithms. These scatter plots can be found in Appendix A.5.

7.2.1.2 Correlation Analysis of Improvement Ratio In this section, we look at the

In document Canonical Correlation Analysis in Cross-Domain Recommendation (Page 130-134)