• No results found

How effective and reusable are probability-based pooling methods as compared to the standard score-based pooling method, is an interesting question for future work. Tradi- tional research on pooling focusses on depth-pooling based on system assigned scores for documents (Cormack, Palmer, and Clarke, 1998; Sparck-Jones and Van Rijsbergen, 1975; Carterette, Allan, and Sitaraman, 2006; Voorhees, Harman, et al., 2005). Furthermore, it is yet unclear how analyses on the quality of pools (Cormack and Lynam, 2007; Buckley et al.,2007) can be applied to probability based pooling methods.

One potential disadvantage of the probability based pools is that, some users may never encounter relevant updates. Since the probabilities are averaged across users and runs, the resultant pools may not contain updates that are read by a very small fraction of users. Some of these updates could be relevant but since they would be absent from the evaluation pool, MSU may report inaccurate values of system performance. One possible solution could be to combine P (read) and the system assigned score for every update to compute a “pool membership” score. A pool can then be formed by using pool membership scores for every update.

Relevance assessments are required for the probability-based pools to enable pertinent effectiveness analyses and further experiments. Any line of future work is hindered con- siderably by the lack of relevance judgements for the probability based pools. Procuring relevance judgements for the probability based pooling methods is essential for comparing probability based pooling with standard (score-) depth-based pooling.

Chapter 7

Conclusions and Future Work

In this thesis, we mainly looked at how we can evaluate systems that produce a stream of news updates, from a user-oriented perspective. We also explored factors that might affect the evaluation of such systems.

7.1

Summary

Following from the user-model oriented evaluation paradigm (Clarke and Smucker, 2014; Clarke et al., 2013; Smucker and Clarke, 2012d), we developed modeled stream utility (Baruah, Smucker, and Clarke, 2015) (MSU), a user-oriented evaluation measure for the evaluation of news filtering systems. We demonstrated our measure using the participant systems of the temporal summarization track (TST) 2013. MSU differs considerably from ELG and LC, the evaluation measures developed for TST.

We developed a simple user model for the behavior of a user accessing information a stream of updates. Essentially, the MSU user model simulates a user alternating be- tween spending time reading updates, and spending some time away from the system. We simulated users reading content from the stream of updates produced by a system and measured gain for every information nugget contained in relevant updates that were read. Our experiments show that, the performance of a system can vary considerably depending on the users that use the system. Different characteristic user behaviors (users that spend different amounts of time reading, and away from the system), lead to different amounts of gain experienced by users. This finding shows that system developers would greatly benefit by knowing the behavior of their target user population.

We observed that duplicate sentences can exist in very large numbers in a web-scale corpus, specifically, the KBA stream corpus (Frank et al.,2014). We investigated the effect of including the duplicates of judged sentences into the qrels, on the evaluation of stream filtering systems, specifically the participating systems at TST 2013 and TST 2014. We compared the evaluation with duplicates included, to the respective tracks’ evaluation, for the measures ELG, LC and MSU. Our key finding was that the relative ranking of runs does not change significantly, which is noteworthy given the fact that the duplicates of judged updates in the corpus can number 1000 times the judged updates. However, even though the relative performance of systems is not appreciably affected, the absolute scores of the systems changes significantly for over half the systems across the three different metrics for TST 2013. In contrast, TST 2014 runs do not show much change in performance over the track’s measures even when duplicates are added to the qrels. This is mainly because, TST 2014 evaluation does not elide unjudged sentences but instead considers them to be

non-relevant. However, the performance of TST 2014 runs does change when evaluated using MSU when duplicates are added to the qrels.

The MSU user model essentially simulates users reading updates in sessions. The number of updates each modeled user reads in a session depends on the respective user’s reading speed. If the user encounters a relevant update, the user gets gain. We observed that a large proportion of simulated users do not encounter all the judged relevant updates. About a third of the judged relevant updates are read by less than 1% of the modeled users. For evaluating a system, it may be beneficial to judge those updates that are read by most modeled users. Accordingly, we investigated pooling of updates for adjudication, using the probability of an update being read by users, as the pooling selection criterion. We explored alternative formulations for the probability of an update being read, and investigated depth-pooling based on probabilities as well as pooling based on probability mass cover.

We found that pools constructed using the TREC-standard depth based pooling based on updates’ scores, have less than 45% overlap with the pools constructed using probabil- ities of updates being read. Furthermore, the overlap for the number of known relevant updates between the two pools does not exceed 70%, even for pools with a depth of 1000. Ascertaining the usefulness of pooling based on user-model induced probabilities is an in- teresting avenue of future work. Although such pooling methods may help to alleviate retrieval algorithm bias, they may turn out not to be reusable if the user model changes.