5.6 Meta-Evaluation with Different References
6.1.3 Evaluating Estimation Accuracy
There are two different aspects that must be considered when evaluating the quality of an estima- tion: the estimation error of the metric scores; and the correctness of system orderings prediction. Mean Absolute Error and Root-Mean-Square Error. An elementary approach for evaluating estimation quality is to calculate the difference between the estimated performance and the “true” performance of a system. We refer this difference as mean absolute error (MAE). Similarly, a classic Root-Mean-Square Error (RMSE) can also be calculated, adding a high penalty in connec- tion with large discrepancies. However, when metric scores are no longer represented as a point but as a range, RBP scores for example, then the best way to calculate MAE and RMSE is not so obvious. Consider the residual calculated by RBP, as defined in Equation 2.23, as an example. As shown in Figure 6.1, suppose a depth 100 judgment pool can find the majority of relevant documents for a topic, and the solid line shows the “true” effectiveness score of a retrieval system. By definition, the RBP score is represented using an interval, as shown in “Final Range”. When only a set d = 10 judgments are available, as shown by the dashed line, the lower bound score is much smaller than the true effectiveness score (solid line). Using the set of d = 10 judgments, we can make estimations of the true system effectiveness and there are three possible cases, shown as
“A”, “B” and “C” in Figure6.1. The error values can be calculated using the difference between the estimated values and the two score bounds (LB, UB). Let ˆMi be the estimated metric M score of system i, then the absolute estimation error i for each topic is:
i = ˆ Mi− UB, if ˆMi>UB 0, if LB ≤ ˆMi≤ UB LB − ˆMi if ˆMi<LB. (6.2)
This definition respects the residual range, and only gives non-zero values if the estimated effec- tiveness falls outside the score range arising from the use of “complete judgments” at depth d. Let nbe the number of systems estimated, then for each topic, RMSE is computed as:
RMSE = v u u t1/n n X i=1 2 i, (6.3)
where i is defined in Equation6.2. The overall RMSE value is averaged over the entire set of topics for each dataset considered.
Weighted Kendall’sτ distance. The second criteria for evaluating the estimation results is how well they approximate system orderings in a batch evaluation setting consisting of n systems. There are various measurements for examining the correlation between two ranked lists, as dis- cussed in Section2.4.5. Aside from computing the correlation coefficient, examining the distance between two lists is an alternative way of measuring the agreement between two metrics. One straightforward method is to compute Kendall’s τ distance, which counts the number of inverted pairs between two n-item orderings. In IR batch evaluation where the system performance is examined using a set of topics, we consider the relative order of system Si and Sj based on ef- fectiveness metric means over a topic set. Let σi,j represent the pairwise relationship between the effectiveness metric means ¯Mi and ¯Mj of systems Si and Sj over a set of topics according to one measurement regime, with σi,j ∈ {−1, 0, 1} indicating that ¯Mi < ¯Mj, that ¯Mi = ¯Mj, and that ¯Mi > ¯Mj, respectively. Let σ0i,j be the corresponding values for a second measurement regime and the system means that it induces, for example, using pooling to a different depth. Then Kendall’s normalized τ distance is the number of pairs 1 ≤ i < j ≤ n in which σi,j · σi,j0 < 0, divided by n(n − 1)/2 to bring it into the range 0 ≤ τdist≤ 1, with 0 meaning “identical”.
Paired t-tests are often used to quantify the strength of the relationship between two systems, and the values σi,j and σi,j0 can also be defined as continuous values rather than discrete ones. Kumar and Vassilvitskii [53] describe a weighted τ distance that counts the strength of each dis- cordant pair, focusing solely on cases where σi,j· σi,j0 < 0. In practice we are not only interested in the discordant pairs, but also in pairs that are deemed to be significantly different according to one of the measurement regimes but not the other, even if their overall relationship is concordant. Suppose that ¯Mi > ¯Mj according to mean effectiveness scores of the two and that a paired one-tail statistical test across topics yields pi,j. Values of pi,j near zero indicate a significant
superiority of Siover Sj; values close to 0.5 indicate that it is by chance. If we define σi,j = 0.5− pi,j if ¯Mi > ¯Mj 0.0 if ¯Mi = ¯Mj pj,i− 0.5 if ¯Mi < ¯Mj,
then −0.5 ≤ σi,j ≤ 0.5 is a real-valued quantity that captures both the direction and strength of the relationship between the two systems. We compute σ0
i,jsimilarly by considering both direction and strength, and then, to compare the difference between the two alternative rankings of the n systems, using:
dist = X 1≤i<j≤n
α· |σi,j0 − σi,j| , (6.4) where α ≥ 0 is an additional scaling factor. For example, if α = |σi,j| then the strength of the relationship between Si and Sj according to the first measurement regime also influences the measured distance. Overall, if dist ≈ 0, then the two measurement regimes agree in terms of both the direction of each pairwise relationship Si versus Sj, and also its strength. If dist is substantially greater then zero, then the two measurement regimes give rise to many system pairs for which there are non-trivial disagreements (including in both discords and in concords) over the strength of the measured relationships. Compared with Kendall’s τ distance, Equation 6.4
operates over continuous values, which makes it both resistant to inconclusive changes in rank position, and also sensitive to differences in which the direction of the relationship between Si and Sj stays the same, but the statistical strength varies markedly.