Property Value
# Queries 2268
Crowdsourced Classification # News-Related 176 # Non-News-Related 2092 Taxonomy # News-Related Breaking 62 # News-Related Recent 69 # News-Related Long-Running 24 # News-Related Generic 23
Table 4.1: Classifications for the 2268 queries sampled from the MSN query log.
• Long-Running: Queries relating to older events that are still of interest, e.g. ‘enron trial’1.
• Generic: General news-page finding queries (e.g. ‘news’ or ‘fox news’). Here the user does not specify any particular news event and is instead is searching to find what is happening at that time.
The ‘Taxonomy’ rows of Table 4.1 report the distribution of queries over these four news-related query classes. From Table 4.1, we see that queries relating to breaking and recent news events are the most common type of news-related query. On the other hand, generic news queries and queries relating to older events are less frequent. The news search framework that we propose supports all four types of news-related query.
4.4
Framework Overview
Figure 4.2 illustrates our proposed news search framework that describes the functionality of a univer- sal Web search engine and supports a news vertical. Recall from Section 4.1 that our framework is comprised of four components, where each tackles one of the news search challenges that we identified previously in Section 1.2. Each component is represented in Figure 4.2 by a rounded corner box with a solid border. The rounded corner box with a dashed border represents the Web vertical that generates the Web search ranking. Solid block arrows denote how a query is processed by each of the components in turn, resulting in the final document ranking. Unfilled block arrows denote the passing of information about currently important events, which are not dependant upon the user query. From Figure 4.2, we observe that when the user submits a query, that query is processed sequentially by three of the four components, namely: News Query Classification, Ranking News-Related Content and News-Related Content Integration. The fourth component — Top Events Identification — supports the News Query Classification and News-Related Content Integration components by providing up-to-date rankings of
1The trial of staff from the U.S. energy, commodities, and services company, Enron, took place during the period of our
4.4 Framework Overview
Figure 4.2: Overview of the proposed News Search Framework.
important events. We state the purpose of each of the four framework components and formalise their inputs and outputs below:
Top Events Identification: As we have motivated in the Section 1.2, one of the challenges that a news vertical faces in a real-time Web search setting, is how to identify from the set of all recent events those that are actually important. In particular, assume that within our universal Web search engine, the news vertical is continuously crawling newswire sites and feeds to maintain a set of newswire articles A representing recent events. The aim of the Top Events Identification component is to rank these newswire articles A by their current importance at a point in time t. To facilitate this ranking, we assume that the component has access to a real-time stream of documents from which it can estimate the importance of each newswire article, we denote this stream S. Hence, the Top Events Identification component can be considered use a function I that produces a score for each article a ∈ A:
I(a, t, S) (4.1)
where a is a newswire article representing an event to be scored, t is the current point in time and S is a stream documents to use as evidence. In this thesis, we use newswire articles from the New York Times
4.4 Framework Overview
and Reuters news providers for a. We use streams of blog posts and tweets for S. The output from the News Query Classification component is a list of the newswire articles a ranked by their importance I.
News Query Classification: When a user submits a query Q at time t, the news vertical needs to determine whether that query is news-related or not. If a the user query is news-related, then news- related content will be selected and then merged into the Web search ranking, else the Web search ranking will be returned unaltered. To make this classification, the classifier has access to one or more real-time streams of documents, e.g. current newswire articles or blogs. As before, we denote these streams S. News Query Classification can be considered as a function C that outputs a binary decision for an input query Q at the time that the query arrives t, using evidence from one or more available content streams S:
C(S1..n, Q, t) (4.2)
where S is a stream of documents, n is the number of available streams S, Q is the user query and t is the time that query was made. We use newswire articles, blog posts, tweets, Wikipedia pages and digg streams for S in this thesis. The output from the News Query Classification component is a binary decision for the query Q, i.e. is it news-related or not.
Ranking News-Related Content: Once a query Q has been identified as news-related at time t, news- related content relevant for that query needs to be selected, such that it can be integrated into the Web search ranking. Within our news search framework, we consider the news-related content selection to be a ranking task, where news-related content from each available newswire and user-generated source is to be ranked for the user query. The Ranking News-Related Content component can be seen as a series of functions R, one for each source S, which ranks documents within S published before time t for the query Q:
R(S, Q, t) (4.3)
where S is the source from which to rank documents, Q is the user query and t is the time that query was made. The sources S that we use are the same as the five streams used for news query classification above. The outputs of the Ranking News-Related Content component are n rankings R for the query Q, one for each source S.
News-Related Content Integration: After news-related content has been ranked for the user query Q at time t, the last stage of the news search process is to integrate these results into the Web search ranking.