to define relatedness between each candidate document and the query, rather than the language models used under Model 2 approaches. More advanced voting techniques such as CombMNZ take into account the number of voting documents for each candidate (Macdonald, 2009):
score candCombM N Z(Q, Dc) = |R(Q) ∩ Dc| · X d∈R(Q)∩Dc
score(d, Q) (2.22)
where |R(Q) ∩ Dc| is the number of documents that both match the candidate set c and were retrieved for the query Q. Other voting techniques may consider only a subset of the candidate set. For example, CombMAX scores each candidate based upon its highest scoring document:
score candCombM AX(Q, Dc) = max d∈R(Q)∩Dc
score(d, Q) (2.23)
More recently, Macdonald & Ounis (2011) proposed a learned variant of their approach that combined different document ranking features as well as aggregate ranking approaches together to form a compos- ite model for ranking aggregates. In particular, under this approach, the scores produced by individual aggregate ranking approaches for a candidate act as features about that candidate. Many candidate fea- tures are combined using a learning to rank approach (see Section 2.5) to form an aggregate ranking model based upon a set of pre-defined training instances. The resultant aggregate ranking model can be used to score unseen candidates after the extraction of each feature. Through experimentation, they showed that this learned approach could outperform the individual approaches upon which it was built over multiple datasets.
In summary, aggregate ranking approaches tackle information retrieval problems where the objects (candidates) to be ranked are not represented initially as individual documents, but rather as multiple documents. Aggregate ranking approaches can be divided into two types, those that represent a can- didate as a virtual document and those that consider each document individually, but then aggregate each document together to form a final score. One effective aggregate ranking approach is the Voting Model (Macdonald, 2009), that considers each candidate’s document to be a vote for the that candidate. In this work, we build upon the Voting Model to tackle one of the news search tasks addressed in this thesis (see Chapter 6).
2.8
Resource Selection/Vertical Search
Resource selection is a specific field of information retrieval that deals with the ranking of content from multiple diverse sources, rather than a single centralised one (Callan, 2000; Craswell, 2000). Resource selection is also known as federated search. Similarly, vertical search is a specialisation of resource
2.8 Resource Selection/Vertical Search
selection, where each of the external sources represent different domains of content, e.g. fresh news results, standard Web pages, product results, etc.
The motivation behind resource selection is the following. In an operating environment one might have multiple separate content sources each providing unique content that we might want to retrieve for the user query, returning an aggregate of those results. In a Web search scenario, vertical search over Web pages, news articles, blogs, etc. is a common instance of resource selection. However, some (or all) content sources may be controlled by 3rd parties, in the form of search services that do not expose the underlying statistics of the collection and/or may be rate limited. Hence, the challenge is to select a sub-set of the available sources to search and then combine the results in an effective manner for a given query (Craswell, 2000).
In general, the task of resource selection can be described as follows. Given a set of available document sources S that are searchable, and an incoming query Q, select a subset of sources S0 ⊂ S. Next, retrieve the top k documents for the query Q (R(Q)) from each source in S0 and merge the retrieved results into a final composite ranking that satisfies Q. Hence, resource selection can be seen as two distinct tasks. Firstly, the resource selection technique must chose a subset of the available sources to search. Secondly, once documents have been retrieved from each source, the most relevant of these need to be identified and ranked.
Importantly, there are two environments that determine the types of technique that are suitable for resource selection, namely: cooperative and non-cooperative. In a cooperative environment, all of the sources are considered to be working together to produce a final ranking of results. To this end, it is assumed that underlying statistics and indeed documents from each source are available to the merging application. On the other hand, in a non-cooperative environment, each source is only available as a search service. Both the selection of sources to retrieve from, and the subsequent merging of result lists is easier to achieve in a cooperative environment than in a non-cooperative one, as more information about each source is available (Craswell, 2000). Techniques that work in non-cooperative environments instead simulate the missing information by querying each source to produce document samples repre- senting those sources (Si & Callan, 2003c).
Another critical distinction that should be made is that there are two different strategies for selecting the sources to search, dependant upon the final search goal. For federated search tasks, the aim is to find all of the relevant documents, hence approaches focus on estimating the probability that each source contains relevant content. In contrast, vertical search tasks tend to assume that relevant content will exist, instead focusing on the types/verticals (e.g. products, flight results, etc.) of content that should be returned to the user based on their query.
2.8 Resource Selection/Vertical Search
In this thesis, we employ resource selection techniques to merge news and user-generated content from different sources together. Notably, due to our experimental setting where we have indices for all of the sources used, we are working in a cooperative environment. In the following three sub-sections, we describe prior work in the field of resource selection. In particular, Section 2.8.1 describes relevance- focused source selection approaches, while Section 2.8.2 details approaches for selecting sources based upon vertical types. In Section 2.8.3, we describe prior techniques for merging result lists from different verticals with varying background statistics.
2.8.1
Source Selection via Expected Relevancy
Federated search tasks aim to select content sources that contain relevant content to the user query. One of the most well-known approaches to select sources from which to retrieve content is the CORI algorithm, proposed by Callan et al. (1995). CORI scores different sources by the probability that they contain relevant content using an inference network. In particular, the CORI algorithm is a type of virtual document approach, whereby a source that may be selected is represented as a single large document. Notably, CORI assumes that a virtual document for each source is available, this might be generated from the original document collection underlying each source, or be estimated from a document sample retrieved. CORI scores each source by the document frequency and a (variant of) the inverse document frequency for each term within the query Q. The document frequency comes from the source to be ranked s, while the inverse document frequency is calculated over all sources S. In particular, the score for a source given a term t ∈ Q can be formulated as follows:
scoreCORI(s, S, t) = b + (1 − b) · df (s, t) df (s, t) + 50 + 150 · |cs|/avg terms(S) · log|contain(t,S)||S|+0.5 log(C + 10) (2.24) where b is a constant (typically 0.4 (Callan, 2000)), df (s, t) is the document frequency of the term t in s, |cs| is the number of terms within s, avg terms(S) is the average number of terms in all sources S, |S| is the number of sources and |contain(t, S)| is the number of sources that contain the term t. As can be seen, the CORI algorithm above (from left to right) is comprised of three components: a constant; a variation of Robertson’s term frequency (Robertson & Walker, 1994), where the term frequency has been replaced with document frequency (df ) and the constants have been multiplied by a factor of 100 to account for the larger df values; and the Turtle’s (Turtle & Croft, 1989) inverse document frequency, where the number of documents has been replaced with the number of sources (|S|). Approaches that use a similar virtual document approach to CORI include CVV (Yuwono & Lee, 1997), and KL- divergence (Xu & Croft, 1999). CORI has been reported to be effective over multiple datasets (Callan,
2.8 Resource Selection/Vertical Search
2000; Craswell et al., 2000; Powell & French, 2003). However, some work has contested that these claims (D’Souza et al., 2004; Shokouhi, 2007).
Later, Si & Callan (2003c) noted that CORI often failed when the different sources vary greatly in size. To tackle this, they proposed an alternative to CORI, referred to as ReDDE. ReDDE estimates the number of relevant documents to the user query within each source using a sampling method (assuming a non-cooperative environment). In particular, they query each source to build up a sample of documents from that source beforehand. This sample is then indexed. For a user query, ReDDE submits that query to the index of sample documents, scoring the source based upon the documents retrieved. In particular, each source s is scored for a query Q as follows:
scoreReDDE(s, S, Q) = X d∈s(sample)
P (rel|d, Q) · 1
|s(sample)| ∗ |S| (2.25) where Cj(sample) is a sample set of documents from s and |s| is the size of this set. P (rel|d, Q) is the probability that document d in s(sample) is relevant to Q, and is calculated as the retrieval score of d for Q over the combination of all sample document sets S. Si and Callan show that ReDDE performance is at least as good as CORI when selecting the top 10 resources when selecting from large numbers or resources.
Shokouhi (2007) proposed an alternative approach to ReDDE, where instead of using a probabilistic measure of relevancy, they instead use normalised retrieval scores produced from IR document weight- ing models. This approach is referred to as Central-Rank-based Collection Selection (CRCS). CRCS is calculated as:
scoreCRCS(s, S, Q) =
|s|
(maxs0∈S|s0|) · |sample(s)|
∗Xd ∈ s(sample)R(d, Q) (2.26) where |s| is the size of source s and |sample(s)| is the size of the document sample and R(d, Q) is the score for the document d for query Q using a document weighting model. Through experimentation on a testbed using the TREC GOV2 test collection, Shokouhi (2007) showed that CRCS provided superior performance in most cases to both CORI and ReDDE.
2.8.2
Source Selection via Vertical Type
In contrast to federated search, vertical search tasks choose sources to search based upon their vertical type (Arguello et al., 2009). For instance, for the query ‘laptop’, a vertical search technique might decide to retrieve results from a product vertical. Related to this type of source selection is the field of query classification into topic categories (Li et al., 2008), i.e. the incoming query is classified based upon its relation to multiple available categories, where each category represents a source (vertical).
2.8 Resource Selection/Vertical Search
For categorisation, query classification approaches typically involve the leverage of evidence from outwith the query text itself, since user queries are often short and under-specified. Beitzel et al. (2005, 2007) proposed one such approach referred to as selectional preference. This approach uses machine learning to discover textual relations from a large (unlabelled) query-log, such that queries like ‘laptop’ can be associated to category descriptors such as ‘electronic’ or ‘product’. In contrast, Shen et al. (2005) as part of the 2005 KDD CUP (Li et al., 2005), used documents from each source to determine whether a query belonged to each category, in a similar manner to the ReDDE federated search approach by Si & Callan (2003c). Li et al. (2008) used machine learning in conjunction with lexical features from each query, in addition to category labels inferred from a query-log query-click graph, to classify unseen queries. Similarly, Diaz (2009) also used machine learning to classify queries, although for the news vertical only. In particular, Diaz extracted features from a collection of news articles and from web/vertical query-logs. However, Diaz also introduced click-feedback, i.e. allowing a user’s subsequent clicks to enhance the model over time. Arguello et al. (2009) later expanded upon the approach by Diaz for multiple verticals, extracting features from each vertical considered to build a classification model.
In this thesis, we focus on the news vertical only. Indeed, we perform an initial classification of the user query to determine whether it is news-related. This classification is investigated in Chapter 7. However, unlike in a traditional search setting where we would search only newswire sources, we instead propose to search multiple news and user-generated content sources for relevant content. Hence, in Chapter 9, we also use source selection via expected relevancy to chose from our different sources which ones to display content from.
2.8.3
Merging of Document Rankings
The task of merging document rankings can be summarised as follows. Given multiple document rank- ings for a query SR(Q)0 , where each individual ranking sR(Q) ∈ SR(Q)0 provides scores for each docu- ment, merge SR(Q)0 into a single ranking MR(Q). Approaches to merge the document rankings focus on normalisation of the scores for the documents contained within each ranking, such that document scores across rankings become comparable even though they are generated by collections with different underlying statistics.
An effective approach for result merging in a cooperative environment was that deployed by the IN- QUERY system (Callan, 2000). This approach normalised each document by the (estimated) maximum and minimum scores that could be assigned to any source and the (estimated) maximum and minimum