4.2 Learning to Rank Based on Feature Vector Model (FVM)
4.2.7 LambdaRank
Then, the loss function is represented by the cross entropy between the target proba- bility and the modelled probability can be calculated by:
Lossdi,dj =−Pdi,djlog(Pdi,dj)−(1−Pdi,dj)log(1−Pdi,dj) (4.2.8)
The Backpropagation neural network uses the loss function to learn the ranking model. Burges et al. did not provide the dataset that was used in their experiments to be avail- able for other researchers. In addition, their comparison was based only on one evalua- tion metric which is NDCG with its computational runtime. The comparison was made against other gradient descent methods which are PRank and RankProp techniques. In this comparison, one layer and two layer versions of RankNET outperformed PRank and RankProp techniques regarding NDCG values, but they were slower than linear Prank and RankProp. There is no extensive comparison for this technique with recent LTR tech- niques in the literature. RankNET is similar to other pairwise techniques that checks the quality of the ranking model for each two query-document pairs separately in each learn- ing iteration rather than the whole retrieved list. Thus, this technique has a drawback in terms of evaluation metric values on datasets that have multiple relevance labels than bi- nary relevance labels. Further details about this approach can be found in (Burges et al., 2005;Burges, 2010). The method has been implemented in the RankLib package (Dang,
2016).
4.2.4
ListNET: Listwise Learning to Rank Based on Neural Nets
ListNET is a listwise and probabilistic technique for LTR proposed by Cao et al. (Cao et al.,2007). This technique is different than RankNET in its way for calculating the loss function. The loss function in ListNET is a listwise loss function. This technique is based on the probability distribution in the ranking list of the query-document pairs. Suppose that each query i has the instance(Xi, yi), where Xi is the feature vector of the query- document pair and yi is the ground truth label. Then, the training data that contains N queries is given asS ={(Xi, yi)}Ni=1. The ListNET technique is used to create a ranking
model which has a vector of weightW = (w1, ..., wM), whileMis the number of features in the training data. The ranking model function can be represented by F(X, W). The ListNET calculates the KL Divergence probability of all training query-document pairs as the total loss function value. Then, it attempts to minimise the total loss by updating the learning ranking model weights. The total loss function is given by:
L(w) = m
X
i=1
L(yi, F(xi, W)) (4.2.9)
Here L(yi, F(Xi, W)) is the cross-entropy loss function for each query. This loss function for top-K query-document pairs for query i is given by:
L(yi, F(Xi, W)) =−
X
yi,jGroundT ruthKj
K Y j=1 exp(yi,j) Pni L=jexp(yi,L) · log K Y j=1 exp(F(Xi,j, W)) Pni L=jexp(F(Xi,L, W)) (4.2.10)
where ni is the number of query-document pairs for each query i. The ListNET up- dates the ranking model weight vector in each learning iteration for better accuracy by
W = W −η5L(W), whereη is a learning rate parameter that can be chosen in the training time and5L(W)is the Gradient of the total loss. Cao et al. (Cao et al., 2007) compared their technique with RankBoost, RankNET and RankSVM using NDCG and
MAP evaluation metrics. They argued that ListNET outperformed RankBoost, RankNET and RankSVM on LETOR2 (Ohsumed, TD2003 and TD2004) benchmarks (Qin et al., 2010). However, they did not mention the parameter settings of each technique nor the training computational runtime of each technique. This method has been implemented in the RankLib package (Dang,2016).
4.2.5
AdaRank: A Boosting Algorithm for Information Retrieval
The AdaRank technique is a listwise approach based on Adaptive Boosting (Ada-Boost) in text classification (Xu and Li, 2007). The main difference between RankBoost and AdaRank is their loss functions. In the RankBoost technique the loss function is an exponential pairwise loss function, while the loss function in AdaRank is an exponential listwise loss function. Similar to RankBoost, AdaRank combines the linear weak rankers
ht(x)to produce an effective ranker modelH(x) = Ptαtht(x). The symbolxrepresents the training query-document pairs set, whilet is the number of weak rankersht(x)used to produceH(x).
On the other hand, the parameter αt represents the importance weight value of the weak rankers ht(x) in H(x). In the learning procedure, the AdaRank repeats the pro- cess of re-weighting the training samples to create each weak ranker. Then, it calculates the weight (the importance) for the weak ranker in the learning ranking model. Further- more, the AdaRak technique is used to optimise an exponential loss function based on the IR evaluation metrics such as MAP, NDCG, Error Rate, Reciprocal Rank, Precision. The exponential loss function is the upper bound of the normal loss function based on the eval- uation metrics. In each learning iteration, AdaRank maintains a weight distribution over the training set. This distribution is used to identify the importance of each weak ranker in the ranking model. Xu and Li (Xu and Li,2007) compared their approach with Okapi- BM25, SVMRank and RankBoost approaches on four benchmarks: Ohsumed, WSJ, AP and .GOV datasets. The AdaRank outperformed these approaches on these datasets using MAP and NDCG evaluation fitness metrics. However, AdaRank has not been tested us- ing more fitness evaluation metrics nor state-of-the-art large LETOR datasets. They also did not mention the parameter settings of each technique nor the training computational
runtime of each technique. Furthermore, the implementation of AdaRank and other LTR approaches does not consider the whole training instances (query-document pairs) in each learning iteration to check the quality of the proposed solution, which causes a drawback in the evaluation values (accuracy) of AdaRank. More details about this approach can be found in (Xu and Li, 2007) and the technique has been implemented in the RankLib Package (Dang,2016).
4.2.6
RankGP: Learning to Rank Using Genetic Programming
Yeh et al. proposed a new research trend for evolving ranking function using Genetic Programming called RankGP (Yeh et al., 2007;Mick, 2016). This approach is a listwise LTR approach. They used LETOR2 (TD2003 and TD2004) benchmarks (Qin et al., 2010). Their approach outperformed the traditional, probabilistic and machine learning ranking functions (BM25, RankBoost and SVMRank) in terms of the IR system effec- tiveness. The system effectiveness was measured by three IR evaluation measures which are Precision of each top-10 query-document pair retrieved, Mean Average Precision (MAP) (Baeza-Yates and Ribeiro-Neto, 2011) and Normalised Discounted Cumulative Gain (NDGG) (Jarvelin and Kekalainen,October 2002).
However, the computational cost of their approach was high in comparison with other approaches. It cost approximately 35 hours to learn a better evolved ranking function for the TD2003 benchmark. The equipment used for these experiments was a 1.8 GHz Intel Core 2 CPU and 2GB memory PC. The main characteristics of this approach were as follows:
1. Before applying the GP approach, all the features existing on the trained and vali- dation subsets were normalised into values between 0 and 1.
2. This approach used Layered Genetic Programming (Lin et al., 2007a, July, 2012; Mick, 2016) with ramped half-and-half for creating the initial population for the proposed function with a maximum depth of 8 terminals and operators.
3. The function set contained {+, -, *} and the division was neglected to evolving linear solutions with less computational cost. The terminal set contained all bench-
mark features (44 features) and 44 constant values between 0 to 1. In addition, the fitness function wasMean Average Precision (MAP)for all queries.
4. The crossover rate, mutation rate, and the number of generations and reproductions were set according to (Lin et al., 2007a). Furthermore, the mutation rate was the adaptive mutation rate tuning AMRT (Lin et al.,2007a,July, 2012).
The limitations of this approach and all learning to rank approaches using EC referenced in the literature are as follows:
• The computational runtime is higher than for other machine learning applications as mentioned by Yeh et al. (Yeh et al.,2007). In addition, this technique requires a large problem size to represent a population of the proposed solutions in each evolv- ing iteration compared to (1+1)-Evolutionary Algorithms (ES-Rank technique).
• The state-of-the-art machine learning techniques outperformed this approach in terms of NDCG and MAP metrics theoretically from the results recorded in the literature papers and documented in (Tax et al.,2015). However, there is no prac- tical comparison between the state-of-the-art LTR techniques and RankGP on the same datasets that considers the computational runtimes and the accuracy values.
This technique has been implemented in the LAGEP Package (Mick,2016).
4.2.7
LambdaRank
Burges et al. proposed the LambdaRank technique, which is based on the RankNET technique (Burges et al., 2006). The LambdaRank is a pairwise technique that utilises the minimisation of the surrogate loss function which is equal toL(W) = −λ, whereλ
is based on the Normalised Discounted Cumulative Gain (NDCG) of the training query- document pairs on each learning iteration. Theλparameter is equal toPK
j=1( 1 1+exp(si−sj)∗
(N DCGi−N DCGj)), whereK is the number of query-document pairs in the retrieved truncated ranking list and the parametersSi and Sj are the score rankers for documents i and j. Suppose the gradient of the loss function is 5L(W), then, the LambdaRank updates the ranking model weight vector in each learning iteration for better accuracy, throughW =W −η5L(W), whereηis a learning rate parameter that can be chosen in
the training time. The experiment settings for Burges et al. technique were 1 layer and 2 layers (with 10 hidden nodes) nets experiments and they run on a 2.2GHz 32 bit Opteron machine. Burges et al. only compared this technique with the RankNET technique and LambdaRank outperformed it in terms of NDCG and runtime across speedup procedures. However, the detailed characteristics and its source (feature type and the name of the search engine) of the dataset did not state in his paper. In addition, there is no show for the total runtime of the technique on the dataset.