We use three common performance metrics to ranksystemconfigurations: MAP, P@100, and Rprec. These metrics are chosen because they were found to be the least correlated . Three types of L2R techniques have been proposed in the literature based on point-wise, pair-wise and list-wise prin- ciples . The point-wise approaches aim at learning to predict a relevance score or class for each document, while the pair-wise approaches learn to predict if one document is more relevant than another. Finally, the list-wise models consider the whole list of documents and optimise a ranking measure. All the L2R models could be suitable to our task: they can ranksystemconfigurations in such a way that the best configuration will be ranked first. This is the config- uration that we want to select. Notice, however, that the relative positions of the elements at lower positions are also important in L2R models, in particular in pair-wise and list- wise models. The optimisation related to this part of ranking may not be crucial or necessary for our task. Our learning objective could be different. However, we do not examine this question in this paper.
The concepts of e-learningsystem have been outlined. E- learning is considered in the context of formally and systematically organized teaching and learning activities, in which the instructor and the learner(s) use ICT to facilitate their interaction and collaboration. The use of data mining based e-learningsystem will definitely impact the quality of the education that is delivered and the deliverability of information through knowledge and information sharing. The newly designed E-learningSystem using Rank-Based Clustering Algorithm (EUSRBCA)shows an improvement over the existing systems with better results. From the various evaluations carried out, the performance of the system found to be good comparatively to other systems in e-learning domain.
We develop models for online learning of ranking systems, from explicit but highly restricted feedback. At a high level, we consider a ranking system which interacts with users over a time horizon, in a sequential manner. At each round, the system presents a ranked list of m items to the user, with the quality of the ranked list judged by the relevance of the items to the user. The relevance of the items, reflecting varying user preferences, is encoded as relevance vectors. The system’s objective is to learn from the feedback it receives and update its ranker over time, to satisfy as many users as possible. However, the feedback that the system receives at end of each round is not the full relevance vector, but relevance of only the top k ranked items, where k m (typically, k = 1 or 2). We consider two problem settings under the general framework: non-contextual and contextual. In the first setting, we assume that the set of items to be ranked are fixed (i.e., there are no context on items), with the relevance vectors varying according to users’ preferences. In the second setting, we assume that set of items vary, as traditional query-document lists. We highlight two motivating examples for such feedback model, encompassing privacy concerns and economic and user-burden constraints.
For each sentence in each dataset, the annotators provided as many substitutions for the target word as they found appropriate in the context. Each sub- stitution is then labeled by the number of annotators who listed that word as a good lexical substitution. Experimental setup and Evaluation. On both datasets, we conduct experiments using a 10-fold cross validation process, and evaluate all learning al- gorithms on the same train/test splits. The datasets are randomly split into 10 equal-sized folds on the target word level, such that all examples for a par- ticular target word fall into either the training or the test set, but never both. This way, we make sure to evaluate the models on target words not seen during training, thereby mimicking an open vocab- ulary paraphrasing system: at testing time, para- phrases are ranked for unseen target words, simi- larly as the models would rank paraphrases for any words (not necessarily contained in the dataset). For algorithms with tunable parameters, we further di- vide the training sets into a training and a validation part to find the best parameter settings. For evalua- tion, we use Generalized Average Precision (GAP) (Kishida, 2005) and Precision at 1 (P@1), i.e., the percentage of correct paraphrases at rank 1.
Lee et al. developed a system to extract attributes from free text and rank them using a typicality score [TL13]. The authors employ a typicality score in order to measure how typical an attribute is to an instance or concept on the basis of their co-occurrence in the dataset. IBMiner is another system that improves knowledge bases using text mining techniques [MGZ13]. It introduced a tool for infobox template suggestion. It collects attributes from different sources and knowledge bases. The attributes are ordered by popularity based on their co-occurrences in the dataset. These approaches require expensive processing and indexing for the document set; they are usually very domain specific and hard to scale for large document collections. A recent approach combines attributes generated from structured knowledge bases with those generated from the documents retrieved from the general web [JDW17]. This idea is particularly interesting because it is scalable and combines the advantages of both data sources. Although the approach was introduced in the query facet mining context, we plan to apply this idea in an entity attribute ranking context.
Non-contextual setting: Existing work loosely related to ranking of a fixed set of items to satisfy diverse user preferences (Radlinski et al., 2008, 2009; Agrawal et al., 2009; Wen et al., 2014) has focused on learning an optimal ranking of a subset of items, to be presented to an user, with performance judged by a simple 0-1 loss. The loss in a round is 0 if among the top k (out of m) items presented to a user, the user finds at least one relevant item. All of the work falls under the framework of online bandit learning. In contrast, our model focuses on optimal ranking of the entire list of items, where the performance of the system is judged by practical ranking measures like DCG and AP. The challenge is to decide when and how efficient learning is possible with highly restricted feedback. Theoretically, the top k feedback model is neither full-feedback nor bandit-feedback since not even the loss (quantified by some ranking measure) at each round is revealed to the learner. The appropriate framework to study the problem is that of partial monitoring (Cesa-Bianchi, 2006). A very recent paper shows another practical application of partial monitoring in the stochastic setting (Lin et al., 2014). Recent advances in the classification of partial monitoring games tell us that the minimax regret, in an adversarial setting, is governed by a property of the loss and feedback functions called observability (Bartok et al., 2014; Foster and Rakhlin, 2012). Observability is of two kinds: local and global. We instantiate these general observability notions for our problem with top 1 (k = 1) feedback. We prove that, for some ranking measures, namely PairwiseLoss (Duchi et al., 2010), DCG and Precision@n (Liu et al., 2007a), global observability holds. This immediately shows that the upper bound on regret scales as O(T 2/3 ). Specifically for PairwiseLoss and DCG, we further prove that local observability fails, when restricted to the top 1 feedback case, illustrating that their minimax regret scales as Θ(T 2/3 ). However, the generic algorithm
Understanding why the PL loss fails in some datasets is important to design more effective algorithms, thus we conduct experiments to analyze these datasets, and figure out one principle as the condition for the PL loss, which states that as compared to average document number per query, the number of features should be large enough. Therefore in order to gain better performance, we have to use more features for PL loss. There are several ways to enrich features of datasets: kernel mapping, neural network mapping, and gradient boosting. We select the gradient boosting with decision trees as weak rankers in this work due to the convenient comparison with LambdaMART, and leave the others for further work. A merit of the PL loss is its concise formula to compute functional gradients, Eqn. (2.11), which results in our ranking system, called PLRank.
Either as a living creature or in the software world, bugs are not very popular. However, they both are inevitable parts of their environments. Software development teams spend a lot of time fixing bugs on a regular basis. Currently, the process of debugging and correcting defects in a software is a tedious, time consuming, and expensive task. A lot of real-world software projects get an expansive number of bug reports day by day, and addressing them requires much time and effort. Maintenance of software is a resource-consuming activ- ity; even with the increasing automation of software development activities, resources are still scarce. It has been reported that most of the software cost is devoted to evolution and system maintenance of existing source code . Investigating bugs can contribute up to a large portion of the aggregate cost for a software project. Therefore, there is a squeezing requirement for automated strategies that makes development teams work easier. This issue has persuaded extensive work proposing automated troubleshooting answers for different cases. The debugging process summary is explained as when a developer is provided with a bug report and then he/she must replicate the defect and perform several reviews of the source code to find the root of the cause. In addition, bug reports can be very different since they are provided by many people who are dealing with the software and any one of them have different inputs for the bug reports, this makes the process of debugging even more difficult.
is an indicator function. Note that the scores do not depend on the expert k and thus represent the consensus preference expressed by the experts. In logistic form the Bradley-Terry model is very similar to another popular pairwise model, the Thurstone model . Extensions of these models in- clude the Elo Chess rating system , adopted by the World Chess Federation FIDE in 1970, and Microsoft’s TrueSkill rating system  for player matching in online games, used extensively in Halo and other games. The popular learning- to-rank model RankNet  is also based on this approach. The Bradley-Terry model was later generalized by Plackett and Luce to a Plackett-Luce model for permutations [23, 27]. A Bayesian framework was also recently introduced for the Plackett-Luce model by placing a Gamma prior on the selection probabilities .
Situations with an asymmetric distribution of informa- tion have also been explored. In weakly supervised learn- ing, the annotation available at training time is less detailed than the output one wants to predict. This situation occurs, e.g., when trying to learn an image segmentation system us- ing only per-image or bounding box annotation . In multiple instance learning, training labels are given not for individual examples, but collectively for groups of exam- ples . The inverse situation also occurs: for example in the PASCAL object recognition challenge, it has become a standard technique to incorporate strong annotation in the form of bounding boxes or per-pixel segmentations, even when the goal is just per-image object categorization .
Moreover, neither Twitter’s current chronological order based ranking nor the recently introduced popularity based ranking can avoid spam. A developer can accumulate hundreds of thousands of followers in a day or so. At the same time, it is not difficult for spammers to create large quantities of retweets. By contrast, content relevance ranking can effectively prevent spammers from cheating. Different from ranking tweets through chronological order and popularity, a content relevance strategy considers many characteristics of a tweet to determine its ranking level. Thus it is difficult for spammers to break the ranking system by simple methods such as increasing retweet count or number of followers.
All techniques for learning to rank require two essential pieces of information: training data, which provides examples for the learning algorithm as to what dis- tinguishes good results from poor results, and an error metric that the algorithm optimizes relative to this training data. Most previous research in learning to rank has assumed a supervised learning setting where training data is provided by some offline mechanism. Such data is often obtained by paying expert relevance judges to provide it, for instance presenting them with a sequence of recorded search queries and Web documents. The role of the judge is to guess the users’ intentions based on the query issued, and provide an appropriate graded rele- vance score such as very relevant or somewhat relevant for each document assessed. However, judgments collected from users would be preferable, as they would reflect the users’ true needs, and be much cheaper and faster to collect. With respect to error metrics, most algorithms optimize metrics that aggregate over the judgments made for (query, result) pairs, assessing how well the rankings produced by the learned ranking function agree with the judgments provided by the experts. Again, it would be preferable for error metrics to instead reflect the experiences of interactive information ranking system users.
Some research has been done in the area of parallel or distributed machine learning [ 53 , 42 ], with the aim to speed-up machine learning computation or to increase the size of the data sets that can be processed with machine learning techniques. However, almost none of these parallel or distributed ma- chine learning studies target the Learning to Rank sub-field of machine learn- ing. The field of efficient Learning to Rank has received some attention lately [ 15 , 16 , 37 , 194 , 188 ], since Liu [ 135 ] first stated its growing importance back in 2007 . Only a few of these studies [ 194 , 188 ] have explored the possibilities of efficient Learning to Rank through the use of parallel programming paradigms. MapReduce [ 68 ] is a parallel computing model that is inspired by the Map and Reduce functions that are commonly used in the field of functional program- ming. Since Google developed the MapReduce parallel programming frame- work back in 2004 , it has grown to be the industry standard model for parallel programming. The release of Hadoop, an open-source version of MapReduce system that was already in use at Google, contributed greatly to MapReduce becoming the industry standard way of doing parallel computation.
The problem of learning to rank has gained attention in the field of Information Retrieval (IR) since 2005. It has been boosted by the ongoing development of the LETOR benchmark data set . Until now, most learning-to-rank research has been directed at developing new techniques and evaluating them on the LETOR data col- lections. This has resulted in a good understanding of the performance of a range of ranking techniques for this specific data set. However, it is not yet known to what extent their performances will change for other data sets. This paper is a step towards understanding to what extent the results do generalize to other data and applications. Learning-to-rank experiments are meaningful for applications that produce a ranked list of items (documents, entities, answers, etc.) that are described by a set of features and a class label according to which they can be ranked. In IR applications, the class label refers to the item’s relevance. In the case of QA, relevance is generally defined as a binary variable . On the other hand, all operational QA systems still present a ranked list of answer candidates for each individual input question . For our system for why -QA, we also use binary relevance labeling while aiming at a ranked result list. Although we found that it is to some extent possible to label the answers to why-questions on a multi-level relevance scale, we decided to treat answer relevance as a binary variable (see Section 3.3). This means that our ranking function needs to induce a ranked list from binary relevance judgments. 2
Modern Information Retrieval (IR) systems have become more and more complex, involving a large number of parameters. For example, a system may choose from a set of possible retrieval models (BM25, language model, etc.), or various query expansion parameters, whose values greatly influence the overall retrieval ef- fectiveness. Traditionally, these parameters are set at a system level based on training queries, and the same parameters are then used for different queries. We observe that it may not b e easy to set all these param- eters separately, since they can be dependent. In addition, a global setting for all queries may not best fit all individual queries with different characteristics. The parameters should b e set according to these char- acteristics. In this article, we propose a novel approach to tackle this problem by dealing with the entire systemconfigurations ( i.e., a s et o f p arameters representing a n I R s ystem b ehaviour) i nstead o f s electing a single parameter at a time. The selection of the best configuration i s c ast a s a p roblem o f r anking differ- ent possible configurations given a q uery. We apply learning-to-rank a pproaches for this task. We exploit both the query features and the system configuration f eatures i n t he l earning-to-rank m ethod s o t hat the selection of configuration i s q uery d ependent. T he e xperiments w e c onducted o n f our T REC a d h oc col- lections show that this approach can significantly o utperform t he t raditional m ethod t o t une s ystem con- figuration g lobally ( i.e., g rid s earch) a nd l eads t o h igher e ffectiveness th an th e to p pe rforming sy stems of the TREC tracks. We also perform an ablation analysis on the impact of different f eatures o n t he model learning capability and show that query expansion features are among the most important for adaptive systems.
user feedback in a system was introduced by Rocchio . This method was introduced as relevance feedback and made the users able to communicate their evaluation to the system after every operation. Relevance feedback is an example of explicit feedback, which as the name suggests is collected from custom interactions in the system. This makes explicit feedback expensive for the users, since it takes both their time and effort. Instead, implicit feedback can be used, which is extracted directly from the users’ natural interactions with the system. An early approach learning from this type of feedback was presented by Joachims , which proved that it can be used to improve ranking in search engines. Examples of implicit feedback are clicks, mouse movement and dwell time. Mouse clicks is a good choice of implicit feedback compared to the others, since large quantities can be collected at a low cost . An illustration of the interaction between the user and ranker (ranking algorithm) where mouse clicks are evaluated can be seen in figure 2.3. The user issues a query to the system, which returns a ranked list. Once the user clicks a document in the list, the click is registered and evaluated, which is used to re-learn and update the ranking function.
Abstract. Most existing learning to rank methods neglect query-sensitive information while producing functions to estimate the relevance of documents (i.e., all examples in the training data are treated indistinctly, no matter the query associated with them). This is counter-intuitive, since the relevance of a document depends on the query context (i.e., the same document may have different relevances, depending on the query associated with it). In this paper we show that query-sensitive information is of paramount importance for improving ranking performance. We present novel learning to rank methods. These methods use rules associating document features to relevance levels as building blocks to produce ranking functions. Such rules may have different scopes: global rules (which do not exploit query-sensitive information) and query-level rules. Firstly, we discuss a basic method, RE-GR (Relevance Estimation using Global Rules), which neglects any query-sensitive information, and uses global rules to produce a single ranking function. Then, we propose methods that effectively exploit query-sensitive information in order to improve ranking performance. The RE-SR method (Relevance Estimation using Stable Rules), produces a single ranking function using stable rules, which are rules carrying (almost) the same information no matter the query context. The RE-QR method (Relevance Estimation using Query-level Rules), is much finer-grained. It uses query-level rules to produce multiple query-level functions. The estimates provided by such query-level functions are combined according to the competence of each function (i.e., a measure of how close the estimate provided by a query-level function is to the true relevance of the document). We conducted a systematic empirical evaluation using the LETOR 4.0 benchmark collections. We show that the proposed methods outperform state-of-the-art learning to rank methods in most of the subsets, with gains ranging from 2% to 9%. We further show that RE-SR and RE-QR, which use query-sensitive information while producing ranking functions, achieve superior ranking performance when compared to RE-GR.
early IR research, unsupervised scoring meth- ods such as TF-IDF, Okapi-BM25 and lan- guage models among others were used (Man- ning et al., 2008). Using only one scoring method in IR systems is not very efficient. Moreover, the accuracy of results produced by learning models such as Okapi-BM25 and language mod- els is dependent on the relevance judgment (Tonon et al., 2015; Urbano, 2016; Ibrahim and Landa-Silva, 2016). This inspires the need for using more than one scoring method for ranking retrieved documents with respect to the user queries. In addition, it is also impor- tant that other features such as the business importance of the documents on the web and the host server among other desirable features are considered for the ranking of documents. Recently, Qin et. al. proposed a new trend in the research on ranking documents by produc- ing the LETOR datasets (Qin et al., 2010). These datasets are distilled benchmarks from search engines and from the well-known TREC conference collections. These benchmarks con- tain more than one scoring weighting scheme as part of the benchmark features. They also contain some other features that indicate the importance of the documents on the web. The documents in these datasets were mapped into fully judged query-document pairs for Learn- ing to Rank (LTR) research problems.
In general, as shown in the 4th row, 3rd column of Fig- ure 1, the importance of feature sets is independent of the choice of learning to rank technique, since the lines that cor- respond to the various feature sets are horizontal over the learning to rank techniques. The difference in the effective- ness of NoQI and the other feature sets again shows us that removing QI features significantly reduces the effectiveness of both learning to rank techniques (see Table 7 & Figure 3). The case of the field-based weighting models (FM) is dif- ferent from that of QI, in that removing the query dependent FM features makes a significant gain in the effectiveness of the learned models for CW09A while, in contrast, it causes a significant loss in effectiveness for CW12B (this can also be observed in Figure 1: 2nd row, 4th column, and a marked loss for CW12A. In particular, while the query dependent features in the WM set encapsulate the anchor text, only the FM feature set allows the learner to separately weight the presence of query terms within the anchor text. We believe that these results suggest that the presence of spam within ClueWeb09 (particularly category A) can mislead the learner as to the usefulness of the anchor text - which will vary according to the prevalence of spam in different queries. On the other hand, with the reduced amount of spam in ClueWeb12, the FM feature set is useful for retrieval, and its ablation results in effectiveness degradations, which are significant in the case of CW12B.
Learning the optimal ordering of content is an important challenge in website design and online advertising. The learning to rank (LTR) framework captures such a challenge via a sequential decision-making model. In this setting, a decision-maker repeatedly selects orderings of items (product advertisements, search results, news articles etc.) and displays them to a user visiting their website. In response the user opts to click on none, one, or more of the displayed items. The objective of the decision-maker will be to maximise the number of clicks received over many iterations of this process. Such an objective is a reasonable and widely-used proxy for the most common interests of a decision-maker in this setting: e.g. maximising profit, and maximising user satisfaction. As such, methods which achieve this objective can be hugely impactful in real-world settings.