Item-Item CF - Towards Recommender Engineering: tools and experiments for identifying recommend

● ● ● ● ● ● 0.72 0.75 0.78 25 50 75 100 Neighborhood Size MAE ● ● ● ● ● ● 0.90 0.95 1.00 1.05 25 50 75 100 Neighborhood Size RMSE

Item Similarity ● _Cosine _Cosine+Item _Cosine+User _{Cosine+UserItem} _Pearson

Figure 5.5: Prediction accuracy for item-item CF on ML-1M.

|𝐼_𝑢∩ 𝐼_𝑣|/(√|𝐼_𝑢|√|𝐼_𝑣|) (the relationship is not linear due to its dependence on the actual values of the ratings, not just their presence). We can view this natural discounting as a parameter- free version of signiﬁcance weighting, and our results here show that it seems to be just as eﬀective, if not superior.

Another conﬁguration not considered in previous published user-user literature is aver- aging over deviations from the full user-item personalized mean. Most work has focused on mean-centering or 𝑧-normalizing user vectors prior to computing predictions. We observe here that subtracting the user-item mean — so the user-user collaborative ﬁlter is only at- tempting to model the deviation of each rating from the user and item biases — outperforms both approaches that only consider the user’s ratings.

5.4. Item-Item CF

5.4 Item-Item CF

Figure 5.5 summarizes the performance we achieved with item-item CF (section 3.7.4), revisiting two of the configuration dimensions explored by Sarwar et al. [Sar+01]. The neighborhood size is the number of neighbors actually considered for each prediction; in all cases, the computed similarity matrix was truncated to 250 neighbors per item. No significance weighting or damping was applied to the similarity functions. Each of the different cosine variants reflects a different mean-subtracting normalization applied prior to building the similarity matrix; user-mean cosine corresponds to the adjusted cosine used by Sarwar et al. Consistent with that work, normalized cosine performs the best, and this result still holds on the larger data set. We also find that normalizing by item mean performs better than user mean; this suggests that measuring similarity by users whose opinion of an item is above or below average provides more value than measuring it by whether the prefer the item more or less than the average item they have rated. This result is quite surprising, which is quite possibly why it has not been tried in prior work. However, with LensKit’s flexibility, we decided to try all the baseline normalizers, and found it.

Similarity function and neighborhood size are just a few of the conﬁguration points LensKit’s item-item implementation exposes, however. Other parameters that aﬀect item- item’s performance and behavior include:

• The number of similar items to retain in the model • Item similarity damping

• Normalization strategy

The similarity damping term 𝛽 is a parameter to bias the similarity of items with few users in common towards 0, reﬂecting the lack of information about their true similarity. For

5.4. Item-Item CF ML-1M ML-10M Y!M Subset 0.845 0.850 0.855 0.860 0.865 0.805 0.810 0.815 0.820 1.29 1.30 1.31 1.32 25 50 75 100 25 50 75 100 25 50 75 100 Neighorhood Size RMSE

Model Size 250 500 1000 Full

Figure 5.6: Item-item accuracy by neighborhood size.

cosine similarity, it is incorporated in the denominator (𝑠𝑖𝑚(𝑖, 𝑗) = 𝑟𝑖⃗⋅ ⃗𝑟𝑗

‖ ⃗𝑟1‖2‖ ⃗𝑟2‖2+𝛽). It achieves the same goal as signiﬁcance weighting [HKR02] in an arguably more elegant manner.

Parameter Relationships

To develop a systematic method of efficiently tuning item-item CF, we want to identify the relationships between parameters. In particular, we want to identify interaction effects between parameters with respect to to error metrics in order to see whether some parameters can be trained independently. If the optimal choice for one parameter does not affect the optimal choice for another, then those parameters can be disentangled and trained independently instead of relying on grid search. This decreases the parameter search space for those parameters from 𝑂(𝑚𝑛) to 𝑂(𝑚 + 𝑛).

Figure 5.6 shows the accuracy of item-item CF for diﬀerent model sizes as the neighborhood size is varied. For this evaluation, we normalized ratings by subtracting the item mean, used no similarity or baseline damping, and used item mean for fallback predictions. We observe two key things from this chart:

5.4. Item-Item CF

• Relatively few neighbors are needed (10–20 is a reasonable value across the board). • There is no signiﬁcant interaction between model size the optimal value of the neigh-

borhood size. The curve adjusts slightly for diﬀerent model sizes, but does not aﬀect the optimal neighborhood size.

Since they do not interact within a reasonable range of neighborhood sizes, neighborhood size and model size can be picked independently to achieve an optimal combination. This is expected theoretically: since models and neighborhoods are chosen by the same cri- terion (similarity), the only diﬀerence that the model size makes is restricting the available neighbors. For any prediction where there are enough neighbors in the model for a full neighborhood, having additional neighbors in the model provides no additional beneﬁt.

Figure 5.7 shows accuracy as the similarity damping value is adjusted. The optimal damping value is small and depends strongly on model size. Damping hurts full models but improves accuracy on truncated models. We found no interaction between damping and neighborhood size for reasonable neighborhood sizes; ML1M had a small interaction at 𝑛 = 10, but fitting the neighborhood size before the damping term removes this interaction. These results are consistent with our user-user results in section 5.3 that significance weighting, while necessary for Pearson correlation, does not help cosine similarity. They also suggest that the benefit of damping or significance weighting is in neighborhood selection, not final score computation: by preferring to keep high-confidence neighbors (since low-confidence similarities are damped out), the model is able to achieve higher accuracy; if enough neighbors are available, however, damping does not improve the ability to select neighbors for doing the actual scoring. Model truncation may be able gain benefit by incor- porating confidence into the neighbor selection strategy and forgoing explicit damping of similarities.

5.4. Item-Item CF ML-1M ML-10M Y!M Subset 0.845 0.850 0.855 0.860 0.806 0.810 0.814 0.818 1.29 1.30 1.31 1.32 0 250 500 750 1000 0 250 500 750 1000 0 250 500 750 1000 Damping RMSE

Model Size 250 500 1000 Full

Figure 5.7: Item-item accuracy by similarity damping.

Figure 5.8 shows the impact of the data normalization on recommender accuracy (all recommenders using full models, and use item mean to supply predictions for unscore- able items even when another baseline is used for normalization). Each baseline scorer is used as a normalizer, normalizing rating data by subtracting that baseline’s scores prior to computing similarities and rating predictions. We observe two key things here. First, consistent with ﬁg. 5.5, normalizing ratings by the item mean outperforms the user mean that has historically been used. Second, the item-item recommender’s performance is not rank- consistent with independent baseline performance. That is, the best-performing baseline, when used as a normalizer, does not necessarily produce the best-performing collaborative ﬁlter.

Training Strategy

We propose the following strategy for tuning the conﬁguration of an item-item collaborative ﬁlter:

5.4. Item-Item CF ML-1M ML-10M Y!M Subset 0.85 0.86 0.87 0.88 0.89 0.80 0.81 0.82 0.83 0.84 1.30 1.32 1.34 25 50 75 100 25 50 75 100 25 50 75 100 Neighborhood Size RMSE

Baseline Global Item User UserItem

Figure 5.8: Normalizer impact on item-item performance.

1. Use item-user mean as the fallback for unpredictable items.2

2. Start with item mean normalization (or item-user; in LensKit, item mean is less com- putationally expensive, so it is least expensive to start with it)

3. With a full model, start with a small neighborhood size (e.g. 10) and increase until a local minimum is found.

4. Decrease model size for desired size/quality tradeoﬀ.

5. Try the other of item mean and item-user mean normalization to see if there is im- provement.

6. If desired, add a small amount of similarity damping to recover lost quality due to model truncation.

2_{Our experiments did not do this due to an experimentation error, but item-item has high enough coverage} that any impact on our results should be negligible.

In document Towards Recommender Engineering: tools and experiments for identifying recommender differences (Page 156-162)