Since model size does not affect the best neighborhood size, steps 2 and 3 can be re- versed; in that case, a reasonable neighborhood size (e.g. 20) can be used to pick the desired model size, and then the neighborhood size refined.
The performance difference between using the item mean and item-user mean baselines for normalization seems to vary by data set, with sparsity being a possible reason. More study is needed on a wider array of data sets to understand this relationship more exactly, but using item mean seems to work well.
5.5 Regularized SVD
Figure 5.9 shows the performance of LensKit’s gradient regularized SVD (section 3.7.6)im- plementation on both the 100K and 1M data sets for varying latent feature counts 𝑘. 𝜆 is the learning rate; 𝜆 = 0.001 was documented by Simon Funk as providing good performance on the Netflix data set [Fun06], but we found it necessary to increase it for the much smaller ML-100K set.
Each feature was trained for 100 iterations, and the item-user mean baseline with a smoothing factor of 25 was used as the baseline predictor and normalization. We also used Funk’s range-clamping optimization, where the prediction is clamped to be in the interval [1, 5] ([0.5, 5] for ML-10M) after each feature’s contribution is added.
The performance of matrix factorization recommenders is governed by many hyperpa- rameters. These include:
• feature count 𝑘 • learning rate 𝜆
5.5. Regularized SVD ML-100K ML-10M ML-1M 0.68 0.72 0.76 25 50 75 100 25 50 75 100 25 50 75 100 FeatureCount MAE Learning Rate 0.001 0.002 ML-100K ML-10M ML-1M 0.84 0.88 0.92 0.96 25 50 75 100 25 50 75 100 25 50 75 100 FeatureCount RMSE
Figure 5.9: Prediction accuracy for regularized SVD.
• per-feature stopping condition (threshold, iteration count, or other criteria) • baseline predictor
More sophisticated variants have even more parameters. Most of these parameters will affect the final factorized matrix, requiring the model to be retrained for each variant when attempting to optimize them. Optimizing all these parameters by grid search is therefore prohibitively expensive. In practice, a few of the parameters are tuned, such as feature
5.5. Regularized SVD
count, using default values for many of the rest. The optimal values for some of these hyperparameters is also heavily dependent on the data set: the learning control parameters (learning rate and stopping condition) depend greatly, in our experience, on the number of ratings in the data set.
Further, many of the parameters interact. Learning rate and stopping condition naturally interact — a higher learning rate will accelerate convergence, though at the likely expense of accuracy. Our experiments have also found the regularization term and feature count to interact with the stopping condition in minimizing the recommender’s error.
To decrease the search space, we have attempted to find more automatic strategies for determining when to stop training. The process of learning an SVD needs two stopping conditions: it needs to know when to stop training each feature, and when to stop training new features.
If we can determine when to stop either (or both) of these two processes in a parameter- free fashion (or based on parameters whose values are unlikely to be dataset-dependent), then we can decrease the dimensionality of the hyperparameter search space and make tun- ing significantly more efficient. A similar approach may also be applicable to other param- eters, but we focus here on the stopping condition.
5.5.1 Training a Feature
Any method for determining when to stop training a feature can depend only on information available during the training process. The information available while training a feature includes:
• The number of epochs computed so far • The training error for each epoch
5.5. Regularized SVD
• If the training algorithm reserves a set of ratings for tuning/validation, the error on these ratings after each epoch
• The average estimated gradient in an epoch (and its magnitude) • Derivatives of any of these values
Directly thresholding training error is impractical, because the achievable error will dif- fer between data sets, rating ranges, etc. Applying a threshold to the change in training error between two epochs (thresholding the derivative of training error) is a feasible solu- tion, however: if the change is small, especially over multiple iterations, then the feature values have likely converged. Similarly, the change in validation RMSE can be thresh- olded. Thresholding the magnitude (𝐿2norm) of the average estimated gradient (change in user and item feature weight vectors in an epoch) is also practical, with a low magnitude indicating convergence. We have not yet tested any second derivatives of these features.
The learning rate is also key in the process of training a feature. So far, we have only tested fixed learning rates. It may be that dynamic learning rate schedules would improve the performance, either in training time or output quality, generally, and that it may make thresholding approaches more useful.
5.5.2 Training New Features
Typically, the number of features is fixed in advance, and the initial value (rather than 0) is assumed for the user/item values for features not yet trained; this has the unfortunate side effect of making the training for each feature dependent on the number of features not yet trained. Nonetheless, we have tested approaches that relax this, training each feature independent of the number of remaining untrained features and attempting to automatically detect whether to continue.
5.5. Regularized SVD
The data available to decide whether to train another feature include: • The number of features
• The training error of the last pass for each feature
• The error on a tuning/validation set of ratings after each feature
• The weight of the feature (product of the 𝐿2norms of its user and item vectors; this is the singular value in a true SVD)
• Derivatives of any of these values
These are subject to similar considerations as the training stopping criteria. Threshold- ing the difference in feature weights is similar to using skree plots to pick the number of latent factors in factor analysis.
5.5.3 Tuning Results
Unfortunately, none of these strategies can reliably match or beat well-selected parameter values on ML-1M: 25–30 features for 125–50 epochs per feature. If they cannot reliably find known good values on a well-understood data set, we are hesitant to trust them for tuning on previously-unseen data.
We have yet to find a good way to disentangle stopping training on either an individual feature or the entire model. FunkSVD accuracy seems to be fairly stable in the face of rea- sonable values; differing slightly from our to-beat values does not produce large differences in RMSE or nDCG. However, being unable to reliably match or beat the performance of these values using more automated techniques hurts our ability to develop a tuning strategy. A viable strategy will need to have more sophistication than the first-order approaches we have listed here.