sues is the highly contended notion of relevance. NIST-employed human assessors evaluate documents based on their perceived relevance given the query topic, yet, relevance is un- doubtedly a judgement that is based on complex individual user and context factors such as prior experience, domain expertise, recent information interaction, learning and indeed, time. While collections composed to time-stamped information objects (e.g., tweets and news) exist, there are currently no test collections accompanied by repeatedly collected temporal relevance judgements simulating temporal information needs or queries submitted over time.
However, at the time of writing, the NTCIR1 12th Evaluation Conference is developing the
new “Temporalia” track with some of these goals in mind.
2.4.2
User-oriented Evaluation
For many users, overall user satisfaction is derived not only from the quality of the results, but also increasingly important is the interface which supports the user’s interactions. User- oriented evaluation involves direct and measured studies of real users interacting with an IR system, in order to observe the combination of factors that might affect satisfaction.
Studies of this type are concerned with the user satisfaction during typically exploratory inter- active IR (IIR) tasks performed using the system. Experiments are often based on providing users complex question answering tasks to complete, or a simulated situation such as a tourist looking for information on where to visit in a city (Borlund, 2003). Questionnaires are used to elicit user opinions and perception, and logged interactions such as result clicks and dwell time can be used to observe and quantify underlying behaviours. User-oriented evaluation of- fers an avenue for high-quality objective and subjective evaluation, however user studies are expensive and therefore difficult to scale. Zuccon et al. (2013) recommend hybrid evaluation approaches mixing system-oriented evaluation with cheap crowdsourced and lab-based user studies as a possible solution to cost-effective but comprehensive IR system evaluation.
2.5
Evaluating Retrieval Effectiveness
System-oriented evaluation relies on various metrics of retrieval effectiveness to quantify the satisfaction expected to be experienced by a real user. Satisfaction is usually defined in terms of how many relevant results the system provides, and whether those results are highly ranked.
Different effectiveness metrics are helpful to characterise retrieval effectiveness in scenarios where user goals may differ. The most elementary family of retrieval metrics are set-based.
2.5 Evaluating Retrieval Effectiveness These measures quantify only the existence of relevant results in the top k results (denoted as measure@k) retrieved by an IR system. I examine two common set-based retrieval effective- ness metrics in the following subsections.
2.5.1
Recall
Recall measures the ratio of retrieved relevant documents to all known relevant documents for a query. A high recall is desirable in tasks where the user wishes to see all relevant items, for example, in patent retrieval. Recall is formally defined as:
Recall = | relevant documents retrieved |
| relevant documents | (2.3)
2.5.2
Precision
Precision measures the fraction of retrieved documents known to be relevant. High precision is desirable in tasks would rather see only relevant items, for example, web search. Precision is formally defined as:
Precision = | relevant documents retrieved |
| retrieved documents | (2.4)
In reality, a trade-off between recall and precision is often necessary. Increasing recall will inevitably lead to reduced precision as non-relevant results will be introduced to the result ranking, and vice-versa. An understanding of the user’s task and goals govern the emphasis on either objective of this trade-off choice.
Of course, the majority of modern retrieval systems produce a ranked list of results to ensure that users see most relevant results first. In ranked retrieval, as well as the presence of relevant results in the first few pages of results, higher ranking of the most relevant results is also desirable. Ranked effectiveness metrics higher weight the presence of highly relevant results in the highest ranking positions. I examine two common ranked effectiveness metrics in the following subsections.
2.5.3
Mean Average Precision (MAP)
MAP offers a single metric to quantify the all-round effectiveness of a retrieval system for a representative set of test topics (e.g., the queries cover diverse popular and unpopular infor- mation needs) (Voorhees and Harman, 2005). By evaluating an IR system using a selection of queries with different characteristics, a more realistic view of IR system effectiveness can be
2.5 Evaluating Retrieval Effectiveness obtained. I use MAP to characterise retrieval effectiveness in retrieval experiments reported in Chapter 6.
As its name implies, MAP is computed as the mean over the average precision (AP) obtained for each test topic. AP is a rank-sensitive measure of precision, computed as the average of the precision values obtained at each rank, up to the rank cut-off position k:
AP =
Pn
k=1(P (k) × rel(k))
| number of relevant documents | (2.5)
Where P (k) is the precision computed at rank cut-off k, and rel(k) is an indicator function equal to 1 if the item retrieved at rank k is relevant, or 0 if non-relevant.
The AP computed for each sample query q is averaged over the set of all test topics Q to produce the final Mean Average Precision (MAP) for the retrieval system:
MAP =
PQ
q=1AP(q)
| Q | (2.6)
2.5.4
Chapter Summary
In this chapter I introduced the fundamental aspects of IR. I described the common types of IR tasks performed by users, and how these provide the foundation for IR research and de- velopment. I briefly summarised the principles of common retrieval model families, namely boolean, vector space, traditional probabilistic and language model approaches. Following this, I outlined the system- and user-oriented experimental methodologies typically employed in IR research and development to quantify user satisfaction, and associated retrieval system effectiveness. Finally, I detailed standard set-based and ranked effectiveness evaluation met- rics used to characterise retrieval performance.
In the next chapter I provide a background and motivation on the involvement of time in IR. As part of this, I examine many of the fundamentals of IR presented in this chapter again, with respect to time.