News Query Classification Datasets

5.2 Datasets Overview

5.2.2 News Query Classification Datasets

The second component of our news search framework that we investigate in this thesis is the News Query Classification (NQC) component. Notably, no standard news query classification datasets exist, hence we develop two new datasets for NQC evaluation. The datasets that we develop contain end-user queries, including some that are news-related for the component to classify. Each dataset also contains multiple aligned news and user-generated corpora from the same timeframe as its queries, representing real-time streams of content. These streams are used by the component as evidence for classification. Finally, for each of the queries to be classified, each dataset includes assessments identifying those queries that are news-related. These assessments are used as a ground truth using which classification accuracy can be computed.

The aim of our news query classification in Chapter 7 is to determine whether the use of user- generated content streams can lead to increased classification accuracy in comparison to using newswire streams. As a result of this, each dataset includes one or more newswire article corpora, to use as baseline evidence sources when classifying. Furthermore, if possible, the dataset should also contain a query log, again from the same time-frame, such that the state-of-the-art news query classification features proposed by Diaz (2009) can be used.

5.2 Datasets Overview

Figure 5.2: An example of news story importance scoring on the nytimes.com website - 10/01/2012.

Dataset TREC Crowdsourced # Queries Time-Range Corpora Used Dataset? Assessments?

N QCM ay2006 6 4 1,206 01/05/06 → 31/05/06 BBCM ay2006

GuardianM ay2006

T elegraphM ay2006

BlogsM ay2006

W ikiN ewsM ay2006

W ikiU pdatesM ay2006

M SNM ay2006

N QCApr2012 6 6 2,935 11/04/12 → 23/04/12 BlekkoN ewsSnippetsApr2012

BlekkoN ewsArticlesApr2012

BlekkoBlogSnippetsApr2012

BlekkoBlogsF ullApr2012

DiggSnippetsApr2012

DiggsF ullApr2012

T weetsApr2012

W ikiU pdatesApr2012

Table 5.4: News Query Classification datasets used in this thesis. Datasets that were produced by TREC or that contain crowdsourced relevance assessments are denoted with a4 in the associated column.

5.2 Datasets Overview

Table 5.4 reports the datasets that we use later in Chapter 7 to evaluate the NQC component of our news search framework. From Table 5.4, observe that we use two different datasets, one from May 2006 and one from April 2012. We denote these datasets as N QCM ay2006 and N QCApr2012, respectively. The N QCM ay2006dataset contains newswire article, blog post, Wikipedia and query log corpora. However, it does not contain Digg or Twitter corpora, since at that time Twitter has not yet been launched and Digg was still in its infancy. Meanwhile, the later N QCApr2012dataset contains newswire article, blog post, Digg, Twitter and Wikipedia corpora. However, it does not contain an aligned query log, since no Web search engine query logs have been made available to academia for that period.

The N QCM ay2006dataset contains 1,206 queries to be classified. These queries were sampled using a Possion sampling strategy (Ozmutlu et al., 2004) from the M SNM ay2006query log corpus. For each of these queries we create ground truth labels identifying each as news-related or not. To create these classification labels, we employ crowdsourced workers to label each. This is the second of the tasks that we use crowdsourcing to achieve. We describe our methodology for creating these labels, the validation strategies that we employ and evaluate the quality of the assessments later in Section 5.5.

The N QCApr2012dataset contains 2,935 queries from the 11th to the 23rd of April 2012. Unlike the queries used in the N QCM ay2006dataset, these queries are not a sample from a Web search engine query log from the period, since no such log is available to us from that period (see Figure 5.1). Instead, we build a simulated query log sample based upon the expected proportion of news to non-news queries from prior query log studies (see Section 3.5.2) and our own analysis (see Section 4.3). In particular, we use the Google Trends1_{analytics tool to collect 199 news-related queries. These 199 queries were} initially chosen since they were unusually popular during the period. For verification, we manually matched each of these queries to an event, confirming that they were in fact news-related. These 199 queries are event-related, i.e. encompass the breaking, recent and long-running categories of news- related query from our taxonomy (see Section 4.3). For use as the fourth category — generic news queries — the author manually selected 26 such queries (with reference to generic news-related queries identified in Section 4.3). The 199 event-related and 26 generic news queries form the news-related portion of our simulated query log sample. To create the (larger) news-unrelated portion of the query log sample (between 89% and 93% from Section 4.3) we randomly sample 2,710 queries from the M SNM ay2006 query log. Since these queries come from an older query log, the queries are very unlikely to be related to events from the time of the N QCApr2012dataset. Combining the 199 event- related queries, 26 generic news queries and 2,710 news-unrelated queries forms our simulated query log sample. Note that we do not need to create additional news/non-news assessments for these queries,

5.2 Datasets Overview

Dataset TREC Crowdsourced # Queries # of Time Range Corpus Dataset? Assessments? Assessments

BlogT rackT opN ews−P hase22010 4 4* 68 7,975 14/01/08 → 10/02/09 Blogs08

M icroblogT rack2011 4 6 50 60,129 23/01/11 → 08/02/11 T weets2011

Table 5.5: Ranking News-Related Content datasets used in this thesis. Datasets that were produced by TREC or that contain crowdsourced relevance assessments are denoted with a4 in the associated column. Where crowdsourced assessments is marked with a4* the dataset assessments were crowdsourced by ourselves on behalf of TREC.

as we have done so implicitly during the selection process, e.g. we know that the 199 Google Trends queries are news-related.

In document News vertical search using user-generated content (Page 121-124)