Chronotype Discovery through Network Analysis

6.3 Temporal Semantic Query Expansion Approach

6.3.5 Chronotype Discovery through Network Analysis

The previous section details the construction of a topic-specific TSN, based on QE candidate terms found in PRF documents. This section outlines the final stage for identifying optimal

6.3 Temporal Semantic Query Expansion Approach candidate QE terms – by discovering the topic chronotype, and weighting chronotype term importance. Recall, the topic chronotype is the cluster of terms that temporally relate most consistently with one another, and further, are distinctive of the query topic (that is, they are distinctive in the feedback documents). I posit that it is these terms which will lead to optimal QE performance in a time-based collection.

Chronotype discovery is achieved by network analysis of the TSN, determining which candidate QE terms are most valuable based on temporal and non-temporal evidence. For this purpose, I employ PageRank with Priors (PRwP) as proposed by White and Smyth (2003) – a random walk network analysis algorithm – to measure the importance of candidate QE terms based on both non-temporal and temporal evidence captured in the topic-specific TSN. The characteristics and flexible behaviour of this established network analysis algorithm which make it suitable for combining evidence are discussed. Furthermore, the rank assigned to each term by PRwP is employed for weighting the importance of each term in subsequent QE.

A random walk model refers to the mathematical formalisation of a succession of random steps taken by a stochastic process. In this work I am most concerned with PageRank (Page et al., 1999), and its derivatives. Although initially designed to distinguish authoritative web pages based on web link analysis, PageRank is often used more generally to highlight the relative importance of vertices in a graph, based on their connectedness with other nodes. Intuitively, the PageRank model is interpreted as a web surfer infinitely and randomly moving through the web graph, subject to occasional jumps to other randomly selected web pages (depending on the “damping” factor), thus stopping it becoming permanently stuck in dead ends. Consequently, it models the likelihood that the random surfer might visit any of the web pages (or, graph vertices) at any given moment in time. Hence, vertices are ranked according to a probabilistic distribution. With an intuitive measure of vertex association in place, PageRank can be used to estimate the relative importance of vertices as a function of their connectedness with other vertices with similarly defined importance.

The rationale for employing a random walk model for network analysis in this scenario, and in particular one derived from PageRank (Page et al., 1999), is that it is suitable for identifying the core of strongly interconnected QE terms contained the TSN, i.e., those terms with a relatively high temporal semantic similarity among themselves. Furthermore, with some modification, PageRank can be biased towards terms that are most distinctive of the topic in the TSN, i.e., terms with high frequency in the PRF documents.

6.3 Temporal Semantic Query Expansion Approach PageRank with Priors

Temporal evidence is important, but so too is non-temporal evidence determining valuable QE terms. Neither should be used exclusively, and indeed both must be combined for optimal retrieval performance (cf. Appendix C). It is therefore important to develop an approach that supports setting the precedence of non-temporal versus temporal evidence contained in the TSN for selecting QE terms.

The PageRank with Priors (PRwP) algorithm (White and Smyth, 2003) is used to combine non-temporal and temporal evidence contained in the TSN in a intuitive manner. The PRwP algorithm extends the original PageRank algorithm, such that (i) it integrates edge weights representing the probability of the walker following an edge from a vertex, and (ii) it supports prior probabilities (i.e., biases) representing the relative importance of each vertex in the graph during a random jump. Additionally, rather than a damping factor, it has a back probability, β, where 0 ≤ β ≤ 1 controls how frequently the random surfer jumps to another vertex (the selection of which is biased by the vertex priors).

As PRwP requires directed edges, the undirected TSN graph is adapted into a directed graph by replacing each undirected edge with two opposing same-weight directed edges. Normal- ized vertex priors are computed as:

p(v) = _Pwv

v∈V wv

(6.2)

and, the probability of following an edge e from vertex v as:

pout(e, v) =

e∈eout(v)we

(6.3)

I use the PRwP implementation as described in White and Smyth (2003), however several other more deterministic and efficient approximate variants exist, e.g., Rodriguez and Bollen (2006). The PageRank, π, for an index term represented by vertex v at iteration i is computed as: π(v)(i+1) = (1 − β)   X u∈din(v)

p(v|u)π(i)(u)



+ βpv (6.4)

Where β is the aforementioned parameter governing random jumps (i.e., the mix between

6.3 Temporal Semantic Query Expansion Approach kohlberg kravis lbos corporate restructurings rjr nabisco lbo junk bonds leveraged buyouts corporate takeovers henry kravis corporate restructuring nabisco inc takeover activity stock market crash

economic downturn debt service

leveraged buyout

Figure 6.3: Chronotype terms identified through network analysis of the topic-specific temporal semantic network for TREC-1 topic 53 (“leveraged buyouts”). Filtered to edges with a high temporal semantic similarity of > 0.9 between terms for clarity.

index terms still deemed semantically following the graph pruning step). p(v|u) is the probability of the random walker following the edge from vertex u to vertex v (i.e., the strength of

the semantic similarity between the index terms). And finally, π(i)_{(u) is the PageRank, at the}

last iteration, of the other semantically similar index term vertex u.

The number of iterations necessary for the PageRank scores to stabilize varies depending on graph characteristics (Page et al., 1999). Since this is a relatively small and non-complex graph (at least when compared to web-scale graphs), I iterate 40 times to guarantee conver- gence.

Combining Temporal and Non-temporal Evidence

Adapting the degree to which QE term selection relies on non-temporal and temporal evidence is a desirable feature, since individual queries and collections are likely to have differ- ing temporal characteristics.

The behaviour of PRwP can be adjusted by setting the back probability parameter, denoted as β. When β = 1, the random surfer will always jump randomly, therefore PageRank scores for terms will follow the same distribution as the vertex priors, i.e., using only non- temporal evidence. Conversely, when β = 0, the random surfer will never jump and so it will move using edge transition probabilities only, i.e., only temporal evidence. With 0 <

6.4 Experimental Setup β < 1, a mixture of both non-temporal and temporal evidence will be combined in PageRank computation. Experiments explore a range of β settings to determine the optimal mix of temporal and non-temporal evidence for effective QE.

Figure 6.3 shows the strongest chronotype terms identified in the topic-specific TSN for TREC-1 topic 53 (“leveraged buyouts”). PRwP was computed with β = 0.1 (i.e., mixing temporal with a little non-temporal evidence). The resulting 16 highest ranking (i.e., most important) candidate QE terms are shown. Only edges representing a temporal semantic similarity of > 0.9 between terms are included. Bear in mind that this topic-specific TSN covers 1988-89. It would likely look very different if constructed using a more recent collection where new entities, terminology and semantic similarities have emerged, thus supporting the case for using temporal semantic similarity in QE for time-based evolving collections. The highest ranking topic chronotype terms identified by this method are those used for QE in experiments presented in the following section.

6.4 Experimental Setup

I conduct comprehensive retrieval experiments using four time-based document test collections with diverse characteristics. Table 6.1 outlines the characteristics of each collection, such as the duration they cover and the time series temporal resolution used for representing index term document frequency temporal dynamics.

For the news wire collections, AP and WSJ, I use both TREC-1 ad-hoc topics 51-100 and TREC-2 ad-hoc topics 101-150. For the Twitter collection, MB, tweets by users with a non- English default language are discarded. TREC Microblogging Track 2011 topics MB001- MB050 are used. For the blog collection, Blogs06, I use the TREC Blog Track 2006 ad-hoc topics 851-900.

The diverse characteristics of each time-based test collection allow evaluation of the proposed TSQE approach under varying conditions. AP and WSJ are relatively small and clean collections, containing 164,597 and 173,252 documents respectively. MB contains approximately 16 million tweets of up to 140 characters. While tweets have a strong temporal dimension (Teevan et al., 2011), limited document length and noise poses several issues to traditional IR approaches. Blogs06 is a spam prone web collection containing approximately 3.2 million

permalinkdocuments and their associated blog feeds, with no editorial control over content

or structure.

The temporal resolutions used to represent temporal dynamics were selected with the volume and velocity of the collections in mind. Twitter changes very rapidly, so a relatively short

6.4 Experimental Setup

Table 6.1: Details of experimental test collections, including the period of time the collection covers and the temporal resolution used for representing temporal dynamics.

Collection Period Temporal Resolution

Associated Press (AP) 12-Feb-1988 7 days

to 31-Dec-1989

Wall Street Journal (WSJ) 1-Dec-1986 14 days

to 24-Mar-1992

Microblogging Track (MB) 23-Jan-2011 4 hours

to 8-Feb-2011

Blog Track (Blogs06) 6-Dec-2005 1 day

to 21-Feb-2006

temporal resolution (i.e., 4 hours) is necessary to reflect change occurring throughout the day. In contrast, the AP and WSJ collections are based on daily reporting of short- and long- term news events, and so, I choose a larger temporal resolution (i.e., 7-14 days) which can represent the expected change over the multiple years spanned by both collections.

6.4.1 Obtaining Temporal Dynamics of Candidate QE Terms

All documents in AP, WSJ and MB are used to derive document frequency temporal dynamics of each QE candidate term. However for Blogs06, the provided RSS/ATOM syndication feeds are mined instead. As the permalinks were crawled 2 weeks after initial creation (to capture reader comments), they then contained future content. Unfortunately many syndication feeds contained only limited excerpts of the permalinks and so provided restricted insight into term use. AP and WSJ do not contain continuous streams of documents. Both had nu- merous gaps ranging from days to weeks, which were reflected in all temporal dynamics. Since this issue is consistent for all terms, the semantic similarity method employed is not adversely affected.

In document Temporal dynamics in information retrieval (Page 158-163)