3.4.5
Summary
In summary, Twitter is a real-time information sharing platform, that receives hundreds of millions of posts each day. Posts on Twitter have some unique characteristics that introduce challenges when using them for IR tasks. Twitter is well known as a source of real-time news content, that has lead to research in the fields of real-time search and event detection. The Microblog track at TREC was proposed in 2011 to investigate real-time ad hoc search, however, research in the field is still in its infancy. Furthermore, no prior work has examined how tweets can be used within the context of a news vertical of a universal Web search engine. For instance, it has not been shown when or where tweets are useful results to return for news-related queries. In this thesis, we investigate whether Twitter can be used to more effectively satisfy news-related queries submitted to Web search engines. We use Twitter streams later in Chapters 6, 7, 8 and 9.
3.5
Query-Logs
Modern universal Web search engines record each query submitted to them by users, normally in a structure referred to as the query log. The query log normally contains the query each user submitted, the time that query was submitted and the IP-address and/or other user identification information (Craswell et al., 2009). A query log is highly useful, since it shows how end-users searched over a period of time. The query log is normally accompanied by a click log that records what document the user clicked on (if any). Due to the large numbers of users that search each day, query logs are rich sources of information. Indeed, the Google Web search engine serves around a billion of user queries each day (Norman, 2011). Web search engines do not often release query or click logs due to privacy concerns1. Below we list six query logs have been released for analysis by researchers, although these may no longer be available:
• Excite, March 1997: 51,473 queries from the Excite search engine (Jansen et al., 1998).
• Fireball, July 1998: 16 million end-user queries from the German Fireball search engine (Hoelscher, 1998).
• AltaVista, August-September 1998: approximately 993 million queries from the AltaVista search engine (Silverstein et al., 1999).
• FAST, February 2001: 500,000 queries from the European search engine FAST (Spink, Ozmutlu, Ozmutlu & Jansen, 2002).
3.5 Query-Logs
• AOL, March-May 2006: 20 million queries from over 650,000 users from the AOL search en- gine (Brenes & Gayo-avello, 2009).
• MSN, May 2006: approximately 15 million queries from the MSN Live Search engine (now Microsoft Bing) (Craswell et al., 2009).
Query logs have been the subject of wide ranging investigations. As such, we structure our dis- cussion of query logs into two main sub-sections, each representing a set of prior works relevant to news-vertical search. In particular, in Section 3.5.1, we describe works that have analysed how users search in general and over time. Meanwhile, Section 3.5.2 discusses works that categorise user queries and examine vertical search. Section 3.5.3 summaries the findings of this section.
3.5.1
Query Log Analysis
Each of the aforementioned Web search engine query logs have seen some examination. For instance, Jansen et al. (2000) performed a detailed analysis of the Excite query log, while Silverstein et al. (1999) analysed the AltaVista query log. These works expose similar querying behaviour by users. For exam- ple, they conclude that users typically issue short queries of only 2-3 terms in length and view only the top ten search results. An analysis of the Excite query log from May 1997 indicated that about 10% to 20% of Web search queries contain query operators, e.g. phrase searches (Jansen et al., 1998). However, from an analysis of 100 queries with operators from the Excite search query log from the 1 May 2001, they later showed that such query operators had little effect on search performance (Eastman & Jansen, 2003).
Prior works have noted that query frequency tends to follow a long tail distribution, where few queries appear often and many queries appear only one or two times. For instance, Jansen & Pooch (2001) reported that 57% of query terms from the Excite log from March 1997 were used only once, while 78% were used less than three times. Meanwhile, Silverstein et al. (1999) during their analysis of the AltaVista query log reported that the top 25 queries represent 1.5% of the total query volume.
Queries submitted to Web search engines are time orientated. For example, Beitzel et al. (2004) investigated how the Web query traffic varied hourly. They show that less than 1% of the day’s total queries appear between 5-6am, while 6.7% appear between 9-10pm. However, they also report that the ratio of distinct to total queries in a given hour is nearly constant throughout the day, i.e. the proportion of head and tail queries remains the same. Diaz & Jones (2004) also used the temporal profiles of Web search queries to improve the prediction of average precision for a query, while Chien & Immorlica (2005) used temporal query profiles to find other similar queries.
3.5 Query-Logs
Other works have examined the identification of news-related queries that experience a burst of activity (Kleinberg, 2002) in tandem with an event. Vlachos et al. (2004) performed one of the first query burst detection studies using an early MSN query log. They built a time series for each query n-gram using Fourier analysis to identify news-related queries that undergo either short term or long terms bursts of activity. Jones & Diaz (2007) proposed a classification for bursty queries into those that exhibit no bursts in activity, those that exhibit one large burst and those that exhibit multiple smaller bursts. They relate these bursty queries to events described by newswire article corpora. Subasic & Castillo (2010) later examined one year of query logs, identifying bursty news-related queries based on increased search volumes and clicks. Relatedly, Kulkarni et al. (2011) examined how queries and the documents that users click on for those queries change over time using a propitiatory query log from the Bing search engine. They identified four time-based features that can be leveraged to classify user queries that abruptly change their behaviour over time, namely: the number of bursts; the periodicity of those bursts (for those that exhibit repeating patterns); the shape of the burst; and the overall trend (up, down, flat or up then down). Their results showed that news-related queries tend to exhibit a single large burst, with bursts of querying activity for unexpected events dropping off quickly after the event.
From a news vertical search perspective, the most important aspect of query logs are the bursts of activity that (news-related) queries experience when a related event breaks. However, from these prior works, it is not clear how effective burst detection techniques are when identifying news-related user queries. Indeed, later in Chapter 7, we examine how techniques such as burst detection in query logs can be leveraged to identify news-related queries in real-time.
3.5.2
Query Types and Vertical Selection
One research area that has seen extensive investigation is the types of queries that users submit to Web search engines. Various prior works define query taxonomies for different domains. For instance, early work by Spink et al. examined how Web search queries could be categorised into topical categories. They analysed Excite query logs comprised of over a million queries for single days in 1997, 1999, and 2001 (Spink, Jansen, Wolfram & Saracevic, 2002; Spink et al., 2001; Wolfram et al., 2001). Using 2,500 queries from each log, they categorised them into 11 topical categories, including: entertainment, recreation; commerce, travel, employment and economy; and computers and the internet. They found that the mix of search topics have changed over the years. However, little prior work has examined the type of query examined in this thesis, namely news queries.
Broder (2002) proposed a general taxonomy of Web search queries that is comprised of the three categories shown below:
3.5 Query-Logs
• Navigational: The intent is to reach a particular website, e.g. queries that contain the URL to the site that the user is intending to reach.
• Informational: The intent is to acquire some information assumed to be present on one or more Web pages.
• Transactional: The intent is to perform some activity that requires interaction with a Web service, e.g. online shopping.
News queries can be considered to be informational in nature, where the user’s intent is to acquire infor- mation about a news story. Broder’s taxonomy was later extended by other researchers by subdividing the three originals into further sub-categories, or by adding abstraction layers. For instance Rose & Levinson (2004) added Directed, Undirected, Advice, Locate and List sub-categories to the informa- tional category. Meanwhile, Jansen et al. (2008) supplied a hierarchical classification of user intents as expressed by Web queries based upon prior works. However, these works do not consider news queries explicitly.
Automatic query categorisation approaches have also been proposed to identify queries of specific types. For example, Pu et al. (2002) proposed an automatic query categorisation approach by attributing the categories of top Web search results returned for those queries to them. Broder et al. (2007) later extended this approach by using a machine learned classifier to automatically classify top Web pages retrieved for those queries. Query categorisation approaches can be used to drive vertical selection. In particular, by categorising a query, specialist content for that category from a vertical can be integrated into the Web search ranking and displayed to the user. For instance, if a query can be categorised as relating to a recent event, then news vertical content can be integrated for that query.
Vertical selection for the news vertical has seen some investigation in the literature. In particular, Arguello et al. (2009) examined what types of evidence are useful when selecting a vertical for the user query. They investigated 19 different verticals, including the news vertical. Their results indicate that using machine learning to combine multiple features describing each vertical’s relevance to the query provides the best performance. However, they also show that of the verticals tested, the news vertical was one of the hardest to select queries for. Arguello et al. (2010) later investigated how to build a general model that can select verticals without vertical specific training data. This approach uses resource selection techniques, such as those described in Section 2.8, to estimate when a vertical contains relevant content. They report that an effective general model needs to use portable features, i.e. those that work well across multiple verticals. However, their results indicate that the news vertical is difficult to predict, receiving the second lowest overall accuracy using a trained model under the