2.3 IR Evaluation: Test Collections and Evaluation Measures
2.3.5 User-models, Test Collections, and Evaluation Measures
Moffat, Webber, and Zobel (2007) develop methods that select documents to judge from across submitted systems, to differentiate best systems from the others, when evaluating over the RBP performance measure. In their methods, the judgement effort is shifted towards documents that can help compare the best systems with each other. Given the RBP user model, and the associated RBP score for a document at a given rank, documents are weighted and appropriately selected for judgement by assessors. This work is similar to our work in Chapter6, wherein we utilize the MSU user model to select updates into a pool for assessment, based on the probability of them being read.
Radlinski, Kurup, and Joachims (2008) show that implicit feedback can be utilized to measure retrieval quality. Implicit feedback based measures include click through rates on a Search Engine Result Page (SERP), the dwell time—time spent reading a document, the number of page views, the abandonment rate, amongst others (Agichtein, Brill, and Dumais,2006). However, such methods are arguably more suited to systems having a wide user-base.
Smucker and Clarke (2012c) outline that test-collections created using the Cranfield paradigm need not be at odds with user-oriented evaluation measures. They point out that evaluation measures inherently model the behavior of a user and predict user performance over a given system. Better user models may lead to better evaluation measures that would in turn lead to better estimates of user performance, provided that the test collection is relatively unbiased and complete.
Recall curves over time (Smucker, Allan, and Dachev, 2012), which in turn are inspired by recall curves over characters read (Lin,2007) could serve as an alternative to the Latency Comprehensiveness (or Recall) measure of TST. In a similar vein, the concept of trailtext and U-measure (Sakai and Dou,2013) could also serve as an alternative for TST-evaluation. Yang and Lad (2009) demonstrate an evaluation method similar to ours (Chapter 5), however it differs in user model and system interface, and it is has been developed over a much smaller corpus than TST. It will be interesting to compare against their method by calibrating MSU with a ranked interface presented at every session.
Evaluations over Time Intervals
Dietz, Dalton, and Balog (2013) put forth an approach for time-aware evaluation of stream- ing data. They carry out their experiments in the context of the KBA time-ordered docu- ment stream. Specifically, the documents should be “citable” by the Wikipedia article for the topic. Dietz, Dalton, and Balog (2013) find that to keep evaluation of systems fair, the evaluation should be aware of time intervals that contain bursts of intensity for a topic, and propose that the final evaluation score of a run could be the average over the evalu- ation scores at individual time slices. Kenter, Balog, and Rijke (2015) develop a metric to measure performance of systems that filter documents from streams, in the context of the KBA track, with applicability to similar stream filtering tasks. They demonstrate that traditional metrics like MAP, F1, nDCG fail to address the performance of a system over
time. Their method essentially estimates the trends of F1 over batches (of time durations).
A line fitted through F1 scores of batches (a trend line) allows measurement of change in
performance with time. Such research is aligned with MSU with the primary difference being that both the time granularity and bursty access is user-driven for MSU. However, information regarding the effect of bursts on user-behavior can greatly help improve upon MSU in its current form.
Chapter 3
The Temporal Summarization Track
at TREC
The TREC Temporal Summarization Track (TST) (Aslam et al., 2013; Aslam et al.,
2014) promotes research on the development and evaluation of IR systems designed for retrieving updates about breaking news events. Figure3.1illustrates the general Temporal summarization task. For each topic, a query duration determines the time interval for which a user is interested in a topic. Given a time ordered document stream, a temporal summarization system processes the documents in temporal order from the stream, and outputs (emits) updates it deems as relevant to the topic. The primary constraint for the task is that at any given time instant, retrieval (or processing) should not involve data from the future, i.e., the retrieval algorithm should respect the temporal ordering of documents for processing.
Figure 3.1: The Temporal Summarization task: Following a newsworthy event that occurs at some point in time (red arrow), the system must find and emit sentences concerning the event, from a time ordered stream of documents (blue arrow), for as long as the user is interested in the event (the query duration).
The temporal summarization task can also be looked upon as an on-line filtering task wherein sentences relevant to a topic are filtered from a stream of sentences. The number of updates to emit and the time at which to emit them is decided by the system; e.g., a system may emit a potentially relevant update as soon as it is processed, while another system may emit a fixed number of updates at regular or suitable intervals.
3.1
Temporal Summarization Track 2013
TST 2013 was the first iteration of the temporal summarization Track at TREC. Topics in the track were instances of event types: accident, bombing, earthquake, shooting and storm. Given a topic’s query string, the track required that systems (runs) return sentences (updates) that are relevant to the topic, from the TREC KBA Stream corpus 2013 (Frank et al., 2014). Every update is associated with the timestamp at which it was emitted by the system.