News Query Classification Features - News vertical search using user-generated content

resulting in an even split between news and non-news classes. On N QCApr2012, in a similar manner we randomly remove 2714 of the 2935 non news queries, leaving 221 non-news and 225 news queries. Performance is reported using the linear logistic regression trees (LogitBoost) classifier (Landwehr et al., 2003) implemented within the Weka toolkit (Witten & Frank, 2005) to classify news queries. All of our features are normalised into the range [0,1] (based upon the maximum observed range of feature values) to enable comparison of those features in our later experiments.

We also evaluate how the evidence provided by each of our seven corpora changes over time. To do so, we create two additional simulated settings. In particular, in one simulated setting, each query in the evaluation dataset was ‘re-timed’ to exactly 6 hours after the time recorded in the query log. In a second simulated setting, each query in the evaluation dataset was re-timed to exactly 24 hours after the time recorded in the query log. Importantly, for each of these two additional settings, all stream features are re-extracted for the re-timed queries, as new content will be available in each of the news and user-generated corpora at these later points in time.

7.4 News Query Classification Features

Under our FANS news query classification approach, we use features about each query to determine to what extent that query is news-related. Recall that in Section 7.2 we divided our features into two types, query-only features and stream features. Query-only features are derived from the query Q. In contrast, stream features are derived from a stream of content S bounded by a time window of size w that ends at time t for the query Q. Notably, some stream features can be applied to any news or user-generated content stream. An example of such a feature is the document frequency – the number of documents matching one or more query-terms. Naturally, this can be extracted from any stream of documents, i.e. by counting those documents from the stream and within the time window that contain each query term. Other features can be stream-specific. Stream specific features make use of some property of the stream that make it applicable to that stream only. For example, the Twitter stream might have a feature ‘top tweet retweets’ that counts the number of times the most relevant tweet to the query has been retweeted. Clearly, the number of retweets is applicable to only to the Twitter stream, hence is stream-specific.

We structure our features into six distinct feature sets, each encapsulating a different type of evidence. Query-only features form one feature set, while we separate stream features into five types, one of which represents our stream-specific features. Each feature set is described below:

• Query-only features encapsulate latent information from within the query. For example, a query- only feature might be the time the query was made, or whether it contains a celebrity name. These

7.4 News Query Classification Features

features are based upon the ‘query-only’ features proposed by K¨onig et al. (2009).

• Frequency features consider evidence from the usage of the query terms within each of the news and user-generated content streams prior to the query time. These features primarily measure how the usage of the query terms have changed over the short term. These features are an expansion of the ‘corpus’ features proposed by K¨onig et al. (2009).

• Retrieval features use the document weighting models described in Section 2.3 to find the most relevant documents from within each stream that were posted before the query time. The ag- gregate relevance of each set of documents retrieved for the query are used as features. These features are loosely related to those proposed by Diaz (2009), although we use document weighting models rather than counting matches and we employ them for sources other than query logs.

• Burstiness features use the burst detection algorithms described in Section 3.5.1 to identify query terms that are undergoing a burst in one or more of the document streams at the time of the query.

• Importance features measure the current importance of the query using the importance estima- tion approaches described previously in Chapter 6.

• Stream-specific features are similar to retrieval features, but rather than measuring the relevance of the top documents retrieved, we instead record stream-specific information about those documents.

Recall that we use two different datasets for evaluation in this chapter. These datasets contain different corpora. Hence, the features we extract from each similarly differ. In Table 7.2, we summarise all of the features – excepting the stream-specific features – extracted from the N QCM ay2006 and N QCApr2012 datasets. Meanwhile, Table 7.3 summarises the remaining stream-specific features extracted from the two datasets. From Table 7.2, observe that we use 961 query-only and stream features from the N QCM ay2006dataset, while we use 755 query-only and stream features from the N QCApr2012 dataset. From Table 7.3, we see that there are no stream-specifc features from the corpora contained within the N QCM ay2006 dataset, but 10 stream-specific features are extracted from the N QCApr2012 dataset. The lack of stream-specific features from the N QCM ay2006 dataset is due to the 7 streams within that dataset providing only the text of each document.

A detailed description of all features used in this chapter can be found in Appendix C1_{. In the next} section, we discuss the research questions that this chapter investigates. In Sections 7.6 and 7.7 we

1_{Author Note: Although these features and how they are extracted in a real-time streaming environment is a contriubution,}

7.4 News Query Classification Features

N QCM ay2006

Source(s) Feature Set Feature Description # Features Query Query-only # Tokens Query length (tokens) 1 Query Query-only # Terms Query length (terms) 1 Query Query-only # Entities Number of named entities (Wikipedia) 1 Query Query-only # People Number of person entities (Wikipedia) 1 Query Query-only # Places Number of place entities (Wikipedia) 1 Query Query-only # Organisations Number of named organisations (Wikipedia) 1 Query Query-only # Products Number of product entities (Wikipedia) 1 Query Query-only Contains URL Does the query contain a URL? 1 Query Query-only Contains ‘news’ Does the query contain the term news? 1 7 streams Frequency TFstream Term frequency 14

7 streams Frequency DFstream Document frequency 14

7 streams Frequency TF-IDFstream Term frequency inverse document frequency 14

7 streams Frequency DF-IDFstream Document frequency inverse document frequency 14

7 streams Retrieval BM25 Sum of the scores for the top retrieved documents for the query 140 using the BM25 document weighting model (Equation 2.4)

7 streams Retrieval DPH Sum of the scores for the top retrieved documents for the query 140 using the DPH document weighting model (Equation 2.9)

7 streams Retrieval LM Dirichlet Sum of the scores for the top retrieved documents for the query 140 using the LM Dirichlet document weighting model (Equation 2.13) 7 streams Burstiness # BurstyTerms Number of bursty terms contained 224 7 streams Burstiness BurstMagnitude Bursty term scores 224 7 streams Burstiness MultiResolution Composite bursty term scores 28

Total 961

N QCApr2012

Source(s) Feature Set Feature Description # Features Query Query-only # Tokens Query length (tokens) 1 Query Query-only # Terms Query length (terms) 1 Query Query-only isMorning 6am to 1pm 1 Query Query-only isEvening 5pm to 11pm 1 Query Query-only isWeekend Saturday or Sunday 1 Query Query-only isNightTime 11pm to 6am 1 Query Query-only Contains URL Does the query contain a URL? 1 Query Query-only Contains ‘news’ Does the query contain the term news? 1 Query Query-only Contains<entity> Does the query contain a named entity. (AlchemyAPI) 173 7 Streams Frequency TFstream Stream Term frequency 14

7 Streams Frequency DFstream Stream Document frequency 14

7 Streams Frequency TF-IDFstream Stream Term frequency inverse document frequency 14

7 Streams Frequency DF-IDFstream Stream Document frequency inverse document frequency 14

8 streams Retrieval BM25 Sum of the scores for the top retrieved documents for the query 80 using the BM25 document weighting model (Equation 2.4)

8 streams Retrieval DPH Sum of the scores for the top retrieved documents for the query 80 using the DPH document weighting model (Equation 2.9)

8 streams Retrieval LM Dirichlet Sum of the scores for the top retrieved documents for the query 80 using the LM Dirichlet document weighting model (Equation 2.13) 8 Streams Importance DPH+Votes Number of recently retrieved documents for the query 8

using the DPH document weighting model (Equation 2.9)

8 Streams Importance DPH+RWA Sum of the scores for recently retrieved documents for the query 136 using the DPH document weighting model (Equation 2.9)

8 Streams Importance DPH+RWA Sum of the scores for recently retrieved documents for the query 120 +GaussBoost normalised by their age

7 Streams Burstiness BurstMagnitude Bursty term scores 14

Total 755

Table 7.2: All query-only and stream features extracted from the N QCM ay2006 and N QCApr2012 datasets, excluding stream-specifc features.

7.5 Research Questions

N QCM ay2006 No Stream-specific features

N QCApr2012

Source(s) Feature Set Feature Description # Features Digg Retrieval # Poster Profile Views Sum of the digg view counts for diggs retrieved for the query 1 Digg Retrieval # Comments Sum of the comment counts for diggs retrieved for the query 1 Digg Retrieval # Diggs Sum of the digg counts for diggs retrieved for the query 1 Twitter Retrieval Retweet Count Sum of the tweet retweet counts for tweets retrieved for the query 2 Twitter Retrieval User Statuses Count Sum of the number of tweets made be authors of tweets retrieved 1

for the query

Twitter Retrieval User Favourites Count Sum of the number of tweets favourited be authors of tweets retrieved 1 for the query

Twitter Retrieval User Friends Count Sum of the number of friends that the authors of tweets retrieved for 1 the query have

Twitter Retrieval User Followers Count Sum of the number of followers that the authors of tweets retrieved for 1 the query have

Twitter Retrieval User Listed Count Sum of the number of user lists that the authors of tweets retrieved for 1 the query appear in

Total 10

Table 7.3: All 10 corpus specific stream features extracted N QC₂₀₁₂0 s Digg and Twitter streams.

report on whether using the features described here, when extracted from both news and user-generated content sources, leads to effective news query classification performance in comparison to a baseline classifier that leverages only features from newswire streams.

7.5 Research Questions

In the following two sections, we investigate whether user-generated content can aid the in the real- time classification of Web search queries as news-related or not. In particular, we aim to determine whether by adding features extracted user-generated content streams, we can more accurately classify news-related queries than baseline classification approaches. For this task, our baselines are a classifier that uses only newswire article streams to drive classification, and (for the N QCM ay2006dataset where query-logs are available) a classifier that uses recent search queries to drive classification, like that proposed by Diaz (2009).

As described in Section 7.3, we evaluate over two datasets, namely; N QCM ay2006and N QCApr2012. We evaluate each dataset in its own section. In Section 7.6, we evaluate using the N QCM ay2006dataset, while in Section 7.7 we use the N QCApr2012dataset for evaluation. Within each of these two sections, we answer three research questions, namely:

1. Does the addition of features from user-generated content streams enable news-related queries to be more accurately classified than when using our baseline classifiers? (Sections 7.6.1 and 7.7.1)

2. Of the feature sets described in Section 7.4, which provide the most useful features and from which streams are do these come from? (Sections 7.6.2 and 7.7.2)

In document News vertical search using user-generated content (Page 196-200)