Experimental Methodology - News vertical search using user-generated content

6.4 Experimental Methodology

We evaluate our news story ranking approaches in two distinct settings, each leveraging a different user- generated content stream. Firstly, we evaluate our news story ranking approaches within the context of the TREC Blog track 2009 and 2010 top news stories identification tasks, i.e. using a blog stream. The datasets associated with these tasks are the BlogT rack₂₀₀₉T opN ewsand BlogT rackT opN ews−P hase1₂₀₁₀ , described in Section 5.2.1. Secondly, we evaluate our proposed ranking approaches on a real-time tweet stream provided by Twitter from two points in time and spanning three datasets. The datasets associated with this stream are the T witterT opN ews−N Y T_Dec2011 , T witterT opN ews−N Y T_{J an2012} and T witter_{J an2012}T opN ews−Reuters datasets. Notably due to the disparate time-frames for which user-generated content streams are available – as discussed in Section 5.1 – the blog stream covers the year of 2008, while the Twitter stream is from the period from late 2011 to early 2012. For brevity, in the remainder of this chapter we refer to the BlogT rack₂₀₀₉T opN ewsand BlogT rackT opN ews−P hase12010 datasets as Blogs09 and Blogs10, respectively. Similarly, we denote the three Twitter datasets as Twitter11−N Y T, Twitter12−N Y T and Twitter12−T R, respectively. In Section 6.4.1, we describe our evaluation methodology on the Blog track datasets, while Section 6.4.2 details our methodology for evaluation on the Twitter datasets.

6.4.1 Blog Stream Methodology

The Blogs09dataset spans the period of January 2008 to February 2009. It is comprised of the TREC Blogs08 blog post corpus (Macdonald & Ounis, 2006b) and the N Y T 08 news story corpus, which span the same period. With the dataset, news story importance assessments for 50 ‘topic days’ are provided. Each topic day refers to exactly one day from the year long corpus. For these topic days, news stories from the N Y T 08 corpus were manually judged as important or not for that day by human assessors. For evaluation, we have each of our news story ranking approaches assess the importance of each news story published on the 50 topic days. As this ranking task is defined in terms of days, the time of ranking t is considered to be 23:59 on each topic day. For our unsupervised approaches and story ranking features that define a small window describing the period to rank for r, r equals 1 day. The story rankings produced by each of our approaches are evaluated against the stories that the assessors marked as important. Notably, as 2009 was the first year that top news stories identification was run at TREC, the task was formulated as a retrospective ranking task. Here systems were allowed to use evidence from after the time of ranking. In effect, this means that the original TREC runs on the Blogs09 dataset were not truly real-time in nature. However, we are interested in using user-generated content

6.4 Experimental Methodology

sources for their real-time nature. Hence, in this work we maintain our real-time constraints for runs on this dataset, i.e. we do not make use of any future evidence, unlike what the TREC systems could do.

The Blogs10dataset similarly spans the period of January 2008 to February 2009. The same Blogs08 blog post corpus (Macdonald & Ounis, 2006b) is used as the user-generated content stream, however a different news story corpus from the Reuters news agency, denoted T RC2 (Leidner, 2010) was used. Again, ‘topic days’ with associated assessments for stories published on those days are provided in terms of their newsworthiness, against which we evaluate the rankings produced by our approaches. The ranking task is once again defined in terms of days, hence the time of ranking t is considered to be 23:59 on each topic day. However, TREC 2010 task followed a strict real-time setting, so approaches on the Blogs10dataset can only use posts from before the time of ranking t.

Due to slight differences in setting between the TREC 2009/2010 task formulations, we make the following changes to create a consistent setting and make cross-corpus training possible. Firstly, the TREC 2009 task (N Y T 08 topics) considered that stories both after and before each topic day might still be relevant due to differences in time-zone, which the 2010 task (T RC2 topics) did not. We follow the TREC 2010 setting and only rank the stories published on each topic day. Secondly, the 2010 task introduced category classification of articles, i.e. each article was judged as to the degree to which it is important on the topic day with regard to one of five news categories. Importantly, these categories can introduce a confounding variable into the evaluation, as even a perfect article ranking system will be heavily penalised should it use a poor classifier. Hence, in this chapter, we report news story ranking performance only and leave category classification to future work.

Of note is that the T RC2 news corpus provides both an article headline for each story, as well as the full article content, whilst the N Y T 08 corpus provides only the article headline. To make these corpora comparable, we independently crawled the missing article content for the N Y T 08 corpus, cleaning the resulting text with the BoilerPipe article extractor (see Section 2.2.6).

We use the Terrier information retrieval platform (Ounis, Amati, Plachouras, He, Macdonald & Lioma, 2006) to index the Blogs08 corpus of blog posts, removing standard stopwords, and apply- ing Porter’s English stemmer. To generate the ranking of blog posts with respect to a news article R(a, St−w→t), we use a w setting (background window size) of 10 days, i.e. 10 days worth of blog posts prior to t. For our unsupervised approaches, we use the DPH ranking model from the Divergence from Randomness framework (Amati et al., 2007) (see Section 2.3.3) to rank blog posts for a news story a. In contrast, for our learned approach, to show its generality, we use two effective weighting models, namely: the probabilistic BM25 (Robertson et al., 1992) (Section 2.3.2); and the aforementioned DPH model. To make our results comparable with other systems participating in the TREC Blog track 2009

6.4 Experimental Methodology

top news stories identification task, we use the default parameters for the weighting models employed, as no training data was available for Blogs08 at the time the task was run. In particular, we use the default parameters for BM25 of k1 = 1.2, k3 = 1000, b = 0.75 (Robertson et al., 1994). DPH on the other-hand, is a ‘parameter-free’ model, where all parameter values are derived from the collection statistics.

Notably, our story ranking approaches have a variety of parameters that may effect the story ranking effectiveness. Unless otherwise stated, for the following experiments, the default size of |R(a, St−w→t)| (ranking depth or |R()|) is 1000, based upon recommendations by Macdonald (2009). The default blog representation is a blog post, and the default story representation (a) is the headline of the article for each story. In Appendix A can be found a detailed analysis of the effect that these parameters have on news story ranking1_{. For a state-of-the-art baseline, we compare our story ranking approaches to the} best systems participating in TREC 2009 and 2010. For comparison to various approaches, statistical significance is measured using the t-test at p<0.01 and p<0.05. A single_{N or H indicates a statistically} significant increase or decrease at p<0.05, while two such symbols indicate significance at p<0.01.

The LTRS framework described previously, uses four components to generate story ranking features, namely: a story representation, document ranking approach, document sub-feature and an aggregation model. We sampled a subset of the possible components that we considered a-priori might be effective for news story ranking and combined them to create 160 news story ranking features. Table 6.2 lists the instances of each of the four components that are used for feature extraction on the blog stream in our later experiments. Importantly, not all possible components are used here, e.g. we do not make features leveraging the GaussBoost extension to Votes. Our reasoning is two-fold. Firstly, not all components are applicable for the day-centric setting of the TNSI task, e.g. M axBurst. Secondly, the number of possible features is multiplicative with the number of components, hence the feature space increases very quickly with the number of components, increasing training time. To train a learned model using these features, we experiment with two different training regimes, namely Cross-Corpus and Per-Corpus. In particular, under Cross-Corpus training, we train using the topics from one corpus (either N Y T 08 or T RC2) and then test upon the topics from the other corpus and vice-versa. Under Per-Corpus training, we train and test on the same topic set using a 5-fold cross validation. We train our models using the automatic feature selection (AFS) algorithm (see Section 2.5.2).

1_{Author Note: We do not include the analysis in this chapter as they do not impact the conclusions that we draw, but are}

6.4 Experimental Methodology

Component Name Description

Story Headline The story headline.

Representation QE Blogs08 The headline expanded using the Blogs06 blog post corpus (Macdonald & Ounis, 2006b). QE NYT06 The headline expanded using 2000 news articles from the

New York Times during May 2006.

QE TRC2 The headline expanded using 13 days of news stories from the TRC2 corpus but before the start of Blogs08 (Leidner, 2010). Content The article content.

Entities Named entities from the article content identified by a Wikipedia-based dictionary (Santos, Macdonald & Ounis, 2010b). Noun-Phrases Noun Phrases extracted from the article content (Schmid, 1995). Summary Story summary generated using part-of-speech tagged article

content (Lioma et al., 2006).

Document Ranking BM251000 The BM25 (Robertson et al., 1992) document ranking model retrieving 1000 blog posts. r is varied using values from 1 day to 10 days.

Approach DPH1000 The DPH (Amati et al., 2007) document ranking model retrieving 1000 blog posts. r is varied using values from 1 day to 10 days.

Document Sub-Features Relevance The retrieval score for the blog post.

Aggregation BM25SU M Aggregated relevance-based story ranking model (Xu et al., 2010). Model Votes Voting-based story ranking model (McCreadie et al., 2010c).

RW AC6.2.2 _{Voting-based story ranking model (McCreadie et al., 2010c).}

Table 6.2: Blog stream LTRS feature instances used.

6.4.2 Tweet Stream Methodology

We also evaluate our story ranking approaches on three different tweet datasets, namely: Twitter11−N Y T; Twitter12−N Y T; and Twitter12−T R(see Section 5.2.1). In particular, these datasets contain tweet corpora that cover the time periods of the 17th of December 2011 to the 31st of December 2011 and from January the 5th 2012 to January the 12th 2012.

To evaluate the effectiveness of our proposed approaches at identifying the top stories for a point in time, we compare against the rankings observed on the website homepages of major e-newspapers. Note that this differs from our blog stream evaluation, where human assessors were used to judge the importance of each news story. Instead, we use the story ranking from a news website as the ground truth, on the assumption that the editor of the e-newspaper knows what the important stories of the moment are. In particular, in parallel with the creation of the tweet corpora described above, we also downloaded the homepages of two major news providers, namely the New York Times and Reuters each hour. Homepages were only downloaded during the latter half of each tweet corpus. i.e. we only evaluate on the latter half. This is to allow for a reasonable background of tweets to be collected, avoiding issues regarding unstable background term statistics when few tweets have yet been seen. Duplicate story rankings, i.e. when the homepage has not changed since the last download, are removed. From each homepage download, we extract the set of current news headlines for our system to rank and also the ground truth ranking of stories against which we will compare. We evaluate story ranking effectiveness in terms of Normalised Discounted Accumulative Gain (NDCG) (Järvelin & Kekäläinen, 2002). In this case, the importance score assigned to each newswire headline is used as the relevance label. In this way, NDCG measures the ability of our approaches to promote higher ranked stories. We evaluate the story ranking as a whole (N DCG) and where our approach correctly identifies the top story in rank 1 (Success@1). Notably, unlike our blog stream evaluation, the ranking task is defined in

In document News vertical search using user-generated content (Page 162-166)