Weighted LambdaFM Factorization Machines

4.4 Boosted Factorization Machines

4.4.2 Component Recommender

4.4.2.2 Weighted LambdaFM Factorization Machines

As mentioned in LambdaFM, the static sampler shows better performance than the uniform sampler. Hence, we employ LambdaFM as a component recom-

4.5 Experiments Table 4.1: Basic statistics of datasets. Each tuple represents an observed

context-item interaction. Note that tags on the MLHt dataset are regarded as recommended items (i.e., i in x_i), while a user-item (i.e., user-movie) pair is

regarded as context (i.e., c inxc).

Datasets Users Items Tags Artists Albums Tuples

MLHt 2113 5908 9079 - - 47958

Lastfm 983 60000 - 25147 - 246853

Yahoo 2450 124346 - 9040 19851 911466

mender, the way of which is referred to as Weighted Lambda Factorization Ma- chines (WLFM).

In Chapter 3, we have proposed three lambda-based negative sampler. In BoostFM, we only investigate the static sampler since it is does not have additional computational complexity. However, we find that a following work by Li

et al. (2018) inspired LambdaFM and BoostFM has verified all three samplers

and show consistent performance. Here, we use the same static negative sampler in LambdaFM, i.e., sampling more popular items approximately proportional to the empirical popularity distribution. pj is given below

pj ∝exp(−

r(j)

|I| ×ρ), ρ∈(0,1] (4.15)

wherer(j)represents the rank of itemj among all itemsIaccording to the overall

popularity,ρis a parameter to control the sampling distribution of negative items.

Therefore, Line 6 in Algorithm5 can be replaced by the above sampler.

4.5 Experiments

In this section, we conduct experiments on the three real-world datasets to verify the effectiveness of BoostFM in various settings.

4.5.1 Experimental Setup

4.5.1.1 Datasets

We use three publicly accessible recommendation datasets for our experiments, namely, MovieLens Hetrec (MLHt)1 _{(user-movie-tag triples, where the context is}

a user-movie pair, the item is the tag), Lastfm2 _{(user-music-artist triples, where} 1

http://grouplens.org/datasets/hetrec-2011/ 2

the context is the user, the item is a music track with an artist) and Yahoo music1 _{(user-music-artist-album tuples, where the context is the user, the item}

is a music track with an artist and album). In the MLHt dataset, the task is to recommend top-N relevant tags for each user-movie pair, while on the Lastfm and Yahoo datasets, it is to recommend top-N preferred music tracks (with item side information) to each user. To speed up the experiments, we follow the common practice as in (Christakopoulou and Banerjee, 2015) by randomly sampling a subset of users from the user pool of the Yahoo dataset2_{, and a subset of items}

from the item pool of the Lastfm dataset. The MLHt dataset is kept in its original form. The statistics of the datasets after preprocessing are summarized in Table 4.1.

4.5.1.2 Evaluation Metrics

To evaluate the performance of BoostFM, we display our results with two widely used ranking metrics, namely, Precision@N and Recall@N (denoted by Pre@N and Rec@N respectively), where N is the number of recommended items (again, tags are considered as items on the MLHt datasets. Please note that the results on other ranking metrics, such as NDCG and MRR, are highly consistent. The definitions of Pre@N and Rec@N have been given in Chapter 3.

4.5.1.3 Baseline Methods

In our experiments, we compare our algorithm with several powerful baseline methods, namely, Most Popular (MP), User-based Collaborative Filtering (UCF), Bayesian Personalized Ranking (BPR), Factorization Machines (FM). Specifi- cally, for tag recommendation on the MLHt dataset, we utilize MP, FM and PITF as baselines. For music recommendation based on additional content information, we utilize MP, UCF, FM, BPR, and PRFM as baselines. For clarity, we refer to BoostFM with WPFM and WLFM as B.WPFM and B.WLFM respectively. We refer to PRFM with the CE loss and Hinge loss as PRFM.CE and PRFM.H respectively. The descriptions of CE and Hinge loss, and baselines, including MP, BPR and FM, have been given in Chapter 3.

http://webscope.sandbox.yahoo.com/catalog.php?datatype=r&did=2 2

4.5 Experiments

• User-based Collaborative Filtering (UCF) (Gao et al.,2013): It is a typical memory-based CF algorithm applicable for both rating prediction and item ranking tasks. Pearson correlation is used in this work to compute user similarity and the top-20 most similar users are selected as the nearest neighbors.

• Pairwise Interaction Tensor Factorization (PITF) (Rendle and Schmidt-

Thieme, 2010): PITF is a state-of-the-art tensor factorization model opti-

mized by the BPR loss. It is the winner in Tag Recommendation of ECML PKDD Discovery Challenge1_.

4.5.1.4 Hyper-parameter Settings

There are several critical hyper-parameters needed to be set for BoostFM. • The number of component recommenderT: For the purpose of comparison,

T of BoostFM is set to 10 in all three datasets if not explicitly declared.

The contribution of T is discussed later in Section 4.5.2.2.

• Learning rateη and regularization γθ: We first employ the 5-fold cross val-

idation to find the best η by running BoostFM with η ∈ {0.005,0.01,0.02, 0.05,0.08,0.1,0.2,0.4}, and then tune γθ the same way by fixing η. Specif-

ically, η is set to 0.08 on the Lastfm and Yahoo datasets, and 0.4 on the

MLHt dataset; γθ is set to 0.05, 0.02 and 0.005 on the Lastfm, Yahoo and

MLHt dataset respectively. In our experiment, we find all FM based models perform well enough by just employing polynomial term (refer to Eq. (2.7)), and thus we omit the configuration of the linear term. Baseline algorithms are tuned in the same way.

• Latent dimension k: Like in Chapter 3, for comparison purposes, the ap-

proaches assign a fixed k value (e.g., k = 30 in our experiments) for all methods based on factorization models. Results for k = 10,50,100 show similar behaviors.

• Distribution coefficient ρ: ρ ∈ (0,1] tuned according to the data distribution. Details will be given in the later section.

4.5.2 Performance Evaluation

All experiments are conducted with the standard 5-fold cross validation. The average results over 5 folds are reported as the final performance.

4.5.2.1 Accuracy Summary

Figure4.1(a-f) shows the prediction quality of all algorithms on the three datasets. Like in Chapter 3 there are several similar interesting observations that can be made. In this chapter, we mainly focus on investing the performance impact by using the boosting strategy.

BoostFM vs. PRFM and PITF: In Figure4.1, we observe that our BoostFM

(i.e., B.WPFM and B.WLFM) consistently outperforms the state-of-the-art methods PITF and PRFM. For example, on the MLHt dataset, we can calculate that B.WPFM outperforms PITF by 6.1% and 5.4% in terms of Pre@10 and Rec@10 respectively1_{. In particular, the significant improvements by B.WPFM (com-}

pared with PRFM.CE and PRFM.H) are more than 18% on Pre@10 and 35% on Rec@10 on both Lastfm and Yahoo datasets. The results shows that the accuracy of top-N recommendation can be largely improved by using boosting technique. Note that B.WPFM will reduce to PITF and PRFM when the component recommenders T = 1.

B.WPFM vs. B.WLFM: In contrast to B.WPFM, B.WLFM achieves much

better results on all datasets in Figure 4.1. The difference is that the component recommender WLFM is trained by more advanced static sampler while WPFM is trained by a uniform sampler. The impact of different negative samplers have been thoroughly studied in Chapter3, and the results are consistent with previous studies in LambdaFM.

In document Learning implicit recommenders from massive unobserved feedback (Page 93-97)