Machine Learning for IR - News vertical search using user-generated content

ply sufficient funding (Macdonald, Soboroff & Ounis, 2009). However, such an approach is limited, as the number of documents that can be judged is determined by the number of participants. Further- more, the size of the document pools used to assess systems, in comparison to size of the collections examined, i.e. the completeness of produced relevance assessments, has been diminishing almost year- on-year (He, Macdonald & Ounis, 2008). This violates the completeness assumption of TREC-style assessment (Voorhees et al., 2005) to an ever greater degree, increasing the probability of error during evaluation. Furthermore, for novel tasks or when evaluating on new collections, relevance assessments may not (as yet) be available, hence the task of creating these assessments falls to the researcher. Later in this thesis we describe how we created relevance assessments to evaluate our news search tasks using crowdsourcing (see Chapter 5).

2.5 Machine Learning for IR

Machine learning refers to the field of approaches that automatically learn solutions to problems using prior data (Carbonell, 1990). Machine learning has become closely linked with information retrieval, as many tasks in information retrieval can be formulated in a manner that can be tackled by machine learning approaches, e.g. categorising documents (Yu et al., 2002) or learning how to rank documents (Liu, 2009). Moreover, machine learned approaches have shown to be effective for many of these tasks (Agichtein et al., 2006; Arguello et al., 2011; Dai et al., 2011; Kang et al., 2011; Zeng et al., 2004). Indeed, commercial Web search engines like Google and Bing use machine learned models to drive their search rankings.

An important concept within machine learning is that of a feature. A feature is some property about the subject of the learning. For example, for information retrieval ranking problems, the features might be about the documents or user queries. An example of a query feature is query length, while a document feature might be its PageRank (Page et al., 1999) score.

In this thesis, we use two different types of machine learning, associated to two different tasks. In particular, we use machine learning for query classification and for document ranking. In the following two sub-sections, we describe machine learning with respect to these two tasks.

2.5.1 Classification

Classification approaches tackle problems that require the labelling of instances into two or more distinct classes. For example, Web page classification involves the labelling of Web pages into pre-defined categories, e.g. personal homepages, resume pages, etc. (Yu et al., 2002). Classification is a supervised machine learning problem, i.e. the classifier uses a training set containing instances whose class is

2.5 Machine Learning for IR

already known. This classifier builds a model using the training instances that can be then be used to estimate the class of new, unseen instances. This is done by first extracting a fixed set of features about each instance, then the learner uses the training data to identify and combine the features with discriminative power, i.e. those that are useful for discriminating between instances of each class. The final combination of features is referred to as the classification model.

There are a variety of different approaches proposed within the literature to learn a classification model. These can be summarised as, decision trees, rule-based learners, percepteron learning, statistical learning, instance-based learning and support vector machines (SVM) (Kotsiantis et al., 2007). Decision trees, as their name suggests, build a tree-like structure, where features are used to make a decision at each branching point and leaf-nodes are the resultant classes. An example of a decision tree algorithm is C4.5 by Quinlan (1993). Rule-based learners are similar, in that they construct rules (combinations of features) that are each comparable to a single path through a decision tree. However, rule-based learners directly induce the rules from the training instances, rather then building tree-like structures (F¨urnkranz, 1999). In contrast, single or multi-class percepteron learners build vectors of percepterons (Ivakhnenko, 1975), where each percepteron outputs a binary decision based upon a threshold for an input feature. Such a learner is trained by varying the percepteron thresholds until a vector that produces the correct result for all training instances is found (Kotsiantis et al., 2007). Statistical learning approaches differ from the prior approaches described in that they define an explicit probability model to describe how an instance is related to each class. In particular, such approaches typically produce a probability estimate that each instance belongs to each class. The most well-known type of statistical learning approach is a Bayesian network, that defines a directed acyclic graph, where each node corresponds with an input feature and connections represent influences between the features (Jensen, 1996). Instance-based learning approaches, also known as lazy learners, are a special type of learning approach that avoids training a model beforehand. Instead, they store the training instances directly, and then compare new instances to the training set to find the closest match. The most well-known instance based approach is nearest-neighbour search (Aha, 1997). Finally, support vector machines represent each training instance in vector space and attempt to partition this space into distinct classes using hyperplanes (Vapnik, 2000). In particular, they attempt to maximise the distance between each hyperplane that separate the classes, as this has been shown to reduce generalisation error.

Work by Kotsiantis et al. (2007) indicates that in general, support vector machines produce the most effective classification models, at the cost of learning time. On the other hand, statistical approaches train models more quickly, but the resultant models may be less accurate (Kotsiantis et al., 2007). In this work, we build models comprising of hundreds of individual features, across thousands of instances. For

2.5 Machine Learning for IR

this reason, in this thesis we primarily use a faster statistical approach, namely linear logistic regression trees (Landwehr et al., 2003) for classification, although we compare to other classifiers where possible. In particular, we use machine learning to build a real-time news query classifier in Chapter 7.

2.5.2 Learning to Rank

Learning to rank (LTR) approaches use machine learning to tackle document ranking problems. In an information retrieval setting, this typically involves ranking with respect to relevancy, although other ranking criteria are possible. The aim of learning to rank approaches is to improve a given document ranking with respect to some property. This is achieved by re-ranking an initial ranking such that documents with the desired property are promoted into the top ranks.

In their simplest form, learning to rank techniques use initial document rankings for a set of query topics, features about the individual documents within those rankings, and relevance assessments about the individual documents for each query topic, to form a ranking model. This model can then be applied to unseen document rankings, re-ranking them to increase relevancy (or some other desired ranking property). In particular, when building (or training) a model, an initial document ranking is created, referred to as the sample. A sample should have high recall in terms of documents with the desired ranking property, e.g. for relevancy-based rankings, the sample should contain many relevant documents (Mac- donald et al., 2012). However, these documents do not need to appear within the top ranks; indeed it is the aim of LTR to achieve this through re-ranking. Next, features about each of the documents are extracted. An effective feature should aid in distinguishing the documents that have the desired property, e.g. relevance to the query. In effect, the LTR approach aims to find a combination of these features that leads to effective ranking. Indeed, given the sample and its features, a learning to rank approach will try different combinations of those features to find those that lead to increased effectiveness when ranking the sample. LTR approaches repeat this process for many document samples to find the feature combination that leads to improved effectiveness across all of those samples. This feature combination is referred to as the ranking model. The idea is that the resultant ranking model will generalise to unseen sample rankings, if the training samples are representative of the types of rankings encountered.

Learning to rank approaches can be categorised into three different types. Each type of approach uses a different strategy to evaluate the sample ranking. These types are point wise, pair wise and list wise. Point wise techniques learn on a per-document basis, i.e. each document is considered indepen- dently. Pair wise techniques optimise the number of pairs of documents correctly ranked. List wise techniques optimise an information retrieval evaluation measure, like mean average precision, that con- siders the entire ranking list at one time (Liu, 2009). Prior work has indicated that list wise techniques

In document News vertical search using user-generated content (Page 58-61)