scores that any document could receive for the query Q, as follows:
score(d, s, S0) =score(d, s) − dmins dmaxs− dmins + 0.4 ·score(d, s) − dmins dmaxs− dmins · score(s) − smin smax− smin (2.27)
where score(d, s) is the score for document d retrieved from source s, dminsis the minimum score that any document could receive for query Q, while dmaxs is the maximum score that any document could receive. score(s) is the score for the source s, using the CORI source selection approach described in Section 2.8.1. smin is the minimum score that could be assigned to any source, while smax is the maximum score that could be assigned to each source.
Approaches for merging ranked lists in non-cooperative environments have also been proposed (He et al., 2011; Si & Callan, 2003b). In particular, Si & Callan (2003b) proposed to employ semi-supervised learning (SSL) to re-score documents within each ranked list. Firstly, each source is sampled to create document sets representing those sources. Next, a central document set is created by combining docu- ments from all of the sample sets. SSL leverages documents that occur in both ranked lists retrieved from each source with documents within the central document set, using the ranks at which matching docu- ments appear to build a regression function. This can then be used to convert the scores for non-matching documents retrieved from any of the sampled sources into a normalised central scores that are compa- rable and can be used for ranking. Shokouhi & Zobel (2009) later proposed the Sample-Agglomerate Fitting Estimate (SAFE) approach for merging result lists in non-cooperative environments. Like Si and Callan’s approach, SAFE uses SSL to perform regression upon documents sampled from each source. However, instead of relying upon document overlap between the central document set and the retrieved documents, it instead estimates the rank of each document for a query using the uniform sampling as- sumption. All of the estimated ranks for documents retrieved for each query are then used for regression rather than only those that match the central document set.
2.9
Conclusions
In this chapter we have provided a summary of the key concepts within information retrieval (IR) that this thesis builds upon, in addition prior works in IR that are relevant to our later discussions or exper- iments. In particular, in Section 2.1 we introduced the field of information retrieval, while Section 2.2 detailed the process of indexing that is used to build the searchable structures that enable document re- trieval in an IR setting, including more advanced topics such as large-scale indexing with MapReduce. Section 2.3 described the document retrieval process, where by an index is used to retrieve documents for a user query, in addition to techniques that we use in later experiments to increase search effective- ness. In Section 2.4, we discussed evaluation in an IR setting, including measures for determining the
2.9 Conclusions
effectiveness of an IR system and the IR evaluation methodology that we use in later chapters. Sec- tion 2.5 introduced the concept of machine learning for IR and how both classification and ranking tasks that we tackle later can be achieved using machine learning. In Section 2.6, we provided an overview of prior work in the field of topic detection and tracking, that is related to the news search tasks addressed in this thesis. Section 2.7 describes approaches to tackle aggregate ranking tasks within IR, that we build upon in a subsequent experimental chapter. Finally, in Section 2.8 we cover prior works in the field of resource selection, that we use to combine content from many news and user-generated sources in our final experimental chapter. In the next chapter we discuss how user-generated content streams have used for news search tasks and identify the knowledge gap that this thesis addresses.
Chapter 3
Search and User Generated Content
3.1
Introduction
In the previous section we provided a background review of works in the field of Information Retrieval (IR) that are relevant to this thesis. However, IR research efforts that have traditionally focused upon Web search are increasingly shifting into search within user-generated content sources, as the availability and volume of such data increases. In this chapter, we provide an overview of recent works that use user- generated content sources for search and other applications that are relevant to this thesis.
In Chapter 1, we defined user-generated content as documents published on the Web by the general public, rather than by paid individuals, corporations or companies. We group user-generated content of similar types or from the same providers into sources/streams. For instance, the blogosphere, refers to the collection of all blogs posted by individuals. Using our terminology, the blogosphere is a user- generated content source. In practice, this can be seen as a time-ordered stream of blog posts. Within the context of this thesis, where we focus on a real-time search setting, the terms source and stream are interchangeable.
There are a wide variety of different user-generated content sources. We divide these sources into two distinct types, namely: explicit and implicit. Explicit user-generated content refer to documents that the user posts upon the Web with the intent for them to be viewed by other users. For instance, a Twitter tweet is an example of explicit user-generated content. Implicit user-generated content on the other hand, refers to data that is generated by users as a by-product of their online activities, possibly without their knowledge. A high profile example of an implicit user-generated content is the logs of all user queries that Web search engines store (Brenes & Gayo-avello, 2009; Craswell et al., 2009).
In this thesis, we examine whether different user-generated content sources can aid in satisfying news-related queries submitted to universal Web search engines in real-time. In pursuit of this goal, dur-