Chapter 3 describes the algorithms used for creating and searching signatures in TOPSIG; however, there are numerous other important details that are necessary before a search engine can start producing reasonable results. These range from fundamentals such as the way of determining where terms in a document begin and end to algorithms used to weight the scores of terms in order to reduce the likelihood of important terms being overwritten by less important terms and other important refinements. TOPSIG exposes these refinements and their settings through the configuration system.
In order to determine whether particular refinements work, how effective they are, what settings to use for them and whether the ones that reduce indexing or search performance are worthwhile it is necessary to have some means of empirically evaluating the effectiveness of a search engine. All of the collections being tested have associated relevance judgements that can be used in conjunction with evaluation tools to produce various scores that can be used to determine whether certain refinements are effective.
6.1 Evaluation metrics
In information retrieval it is important to be able to make an empirical determination about how well a given search engine or ranking approach works. This is useful both for improving an engine in isolation and for performing comparative analysis of multiple search engines.
One issue with this is that search engine value is a difficult concept to pin down with a precise definition, and an even harder one to measure objectively. Despite this, there are a
157
number of popular measures that are used in the information retrieval field.
6.1.1 Dichotomous classification
The simplest form of document classification, and the form that this chapter looks at, is that of dichotomous classification1[J¨arvelin and Kek¨al¨ainen, 2000]. In this classification system documents are determined to be either relevant or not relevant in accordance to a particular query. Other classification schemes may code certain parts of the document as being relevant while other parts are not, and others may assign a value to each document quantifying the degree to which it is relevant; certain documents may be considered to be of a higher value than others, and despite them both being useful to the user, it would be preferable to return one in place of the other if the choice had to be made.
For the purposes of search engine evaluation, relevance judgements (as described in Ap-pendix B) are available. These are created by volunteers and consist of lists of documents that are relevant to each query in a set of queries. Each system runs the set of queries and returns a list of documents it considers to be the most relevant, in order of relevancy. These lists are evaluated using the relevance judgements as ground truth and one or more information retrieval metrics to produce a final score. The trec eval [Buckley, 2004] tool was used to perform these evaluations. Popular evaluation metrics that make use of dichotomous classification include precision at n (§ 6.1.3) and mean average precision (§ 6.1.5).
Graded classification
The primary alternative to dichotomous classification is graded classification, in which each document is given a relevance score. Graded classification is used to express the concept that some documents are more relevant than others, and systems evaluated with a graded classification scheme are rewarded, not only for finding the right documents, but returning them in the right order. Popular evaluation metrics that make use of graded classifications are normalised discounted cumulative gain [J¨arvelin and Kek¨al¨ainen, 2000] and expected reciprocal rank[Chapelle et al., 2009].
As not all of the document collections used in this research have graded classifications
1Also referred to as binary classification.
available, dichotomous classification is used for all evaluations presented in this monograph.
6.1.2 Precision and recall
The two main measures from which most evaluation metrics are derived are precision and recall[Voorhees, 2007].
Precision
Precisionrefers to the proportion of documents, out of the documents that have been retrieved for the user, that are relevant to the user’s query.
If an information retrieval system returns 10 documents and 7 of them meet the user’s needs, the system is said to have a precision of 0.7.
Recall
Recallrefers to the proportion of relevant documents in the entire collection that have been retrieved for the user in response to the user’s query.
If an information retrieval system returns 10 documents and 7 of them meet the user’s needs, but there are a total of 20 documents in the collection that meet the user’s needs, the system is said to have a recall of 0.35.
These measures look at different needs of search engine users. A user who wants to get a quick overview of a subject will usually have their needs best met by a system with high precision, particularly on the first page of results. A user who wants to thoroughly investigate a given topic will be best suited by a system offering high recall. Effectively, precision penalises type I errors while recall penalises type II errors.
Recall and precision are rarely used as metrics by themselves. A system would able to maximise recall simply by presenting every document in the collection to the user. While such a system would certainly provide maximum recall, the precision would be too low for the system to present any practical use to a user. Similarly, a search engine designed to maximise precision may find that this goal is met by only returning one document, especially if the system has a high degree of confidence in this document. The document may be highly relevant to the query, but with such a low degree of recall such a system would be of limited use to most users.
As a result, popular metrics for judging search engine quality attempt to find a balance between the two needs, incorporating precision and/or recall in addition to other metrics to produce a final result.
6.1.3 Precision at n
One of the simplest information retrieval metrics, precision at n simply works on the portion of retrieved documents after n documents have been retrieved. n is typically a fixed value like 5 or 10 and search engines are ranked based on how well they retrieve these early documents.
The essential idea is that a typical search engine interaction will not involve going past the first page of results. With this measure, the better job a search engine does of filling up this first page with useful information for the user, the better the search engine will rank.
6.1.4 Precision at recall
Precision at recall is similar to precision at n, but instead of judging precision after a cer-tain fixed number of results have been returned, precision is instead judged after a cercer-tain proportion of the relevant documents in the collection have been returned. For example, the precision at recall of 0.1 is the precision after 10% of the relevant documents have been returned.
This metric normalises differences in results between topics that can arise as a result of different topics having differing levels of representation within the document collection.
Given two queries of similar difficulty, if one query has more relevant documents in the collection searches for it will naturally result in a higher precision at n score than searches for the other query. This also makes it possible to average the scores of different topics;
highly desirable to ensure that search engine ranking is not unduly influenced by a single topic. With precision at n the queries with poorer representation in the collection will be accordingly poorly represented in the final average. To illustrate this, consider two queries of identical difficulty, one with 4 relevant documents in the collection, another with 8. A search engine that returns half of the first query’s relevant results but all of the second query’s will receive a average precision of 0.6, while a search engine that retrieves all of the first’s but half of the second’s will only receive a score of 0.4.
Figure 6.1: An example of a recall-precision curve, calculated from a TOPSIG run with 50 topics on the WSJ87-92 collection
Precision at recall scores are usually interpolated with other scores at neighbouring recall proportions.
6.1.5 Average precision
Average precision simply refers to the score returned from averaging the precision at recall scores at every possible recall proportion.
A common method of presenting the interpolated precision at recall metric discussed in
§ 6.1.4 is in a graph, with precision plotted against recall. The average precision for a search is calculated as the average of all the individual precision values. An example of a graph of this nature is Figure 6.1. The average precision of this run is the area under the recall-precision curve; in this case, 0.1262.
The related mean average precision (MAP) score is the average precision score for mul-tiple topics in a run all averaged together. This metric offers a way of reducing search engine performance over an entire run down to a single figure for the purposes of comparison.
6.2 Performance evaluation
The result of running the TOPSIGsignature search implementation, implemented almost ex-actly as defined in chapters 3, 4 and 5, using the WSJ87-92 (§ B.1) collection as a test corpus, results in a mean average precision (MAP) score of 0.0311, as reported by trec eval
IP@r Score 0.00 0.3733 0.10 0.0775 0.20 0.0530 0.30 0.0320 0.40 0.0244 0.50 0.0000 0.60 0.0000 0.70 0.0000 0.80 0.0000 0.90 0.0000 1.00 0.0000 MAP 0.0311 P@10 0.1760
(a) trec eval results
(b) A plot of interpolated recall-precision data for the run. The MAP score is the total area under the curve.
Table 6.1: Results from a reference TOPSIGrun with all special features disabled
(Table 6.1a). This represents the area under the recall-precision curve (Table 6.1b), which is trec evalhas calculated as 3.11%.
6.2.1 Reported evaluation measures
trec eval reports the results of a number of evaluation measures when run. The three measures presented in Table 6.1a are:
IP@r (Interpolated precision at recall)
An interpolated form of the precision at recall measurement described in § 6.1.4, sampled at 11 points between (inclusive) 0.00 recall and 1.00 recall. The interpolation is performed as results may appear at any recall, and only sampling precision at specific recall intervals will result in an unnatural stair-stepping effect in the resulting precision-recall curve. In-terpolating the precision values also allows trec eval to report a sensible result at 0.00 recall.