6.2 Approaches for Estimating System Effectiveness for Deep Metrics
6.2.3 A Two-Stage Estimation Framework
A natural extension of rank-level estimators is to seek a solution to produce unified global estima- tions for unjudged documents.
Preliminaries. Consider the system matrix Sk×n from the left hand side of Figure6.3 where each column is a system and each row is a vector of documents. The main focus is to extend our method of local estimation presented in the previous section. To do this, we now consider a document-rank representation rather than the system-rank representation that was utilized for obtaining the rank-level estimators. Figure6.3indicates that, for each document, we can obtain its ranking information based on the set of contributing systems. This is also true of its own rank-level estimation vector. Let ki,jbe the rank of Document Digiven by system Sj and g0j,mbe the rank-
Figure 6.6: The ranking matrix for all documents (LHS), mapped from contributing systems (RHS) in Figure 6.3. The total number of rows is the number of documents in the collection and the total numbers of columns is the same as the number of contributing systems. An ∞ represents the document not appearing in top 1,000 of the corresponding system.
level estimate using Gm(j). For any document D, we can get its rank vector by kD,i(1≤ i ≤ n), and corresponding rank-level estimations. For simplicity, we numbered the rank-level estimators and refer them using numbers henceforth. Consider D1in Figure6.6and suppose the third model is used, then immediately, we can represent D1 as hg01,3, g010,3, g0∞,3, g01,3, g01,3, g01,3i. We can consider all documents and possible estimators similarly. For simplicity, we use g0∗,m to represent the rank-level gain vector for a document, when considering the m-th model Gm, and the subscript “∗” represents hkD,1, kD,2, . . . , kD,ni for a given document D (each row of the RHS matrix in Figure6.6). We still focus on one topic and consider how to estimate system performance through estimating the background effectiveness gain of unjudged documents.
Overview of the Two-Stage Estimation Framework. To compute score estimates, we propose a two-stage framework, guided by a unified optimization goal, and built on a set of m ≥ 1 per- topic rank-level estimators. The overall structure of this mechanism is described in Algorithm5. We omit the process of obtaining rank-level estimations discussed in the previous section and assume as our starting point that m different rank-level estimators have been generated, each derived from the judged documents D ∈ J0, and that values for a set of gain functions have been computed, with g0j,`the gain associated with an unjudged document that appears in the j-th position of any of the n system rankings, as predicted by the `-th of the m different estimators. Prior to forming the new combined estimates, we first compute the coefficient of covariance γ from the judgment set [22], in order to determine whether to use a background “unjudged are not relevant” predictor. Estimation is computed by steps4to15, with h1(·) and h2(·) two parametric combining functions, in which the parameters are obtained by minimizing a loss function L(·). We discuss the details of Algorithm5, including the rationale behind the use of γ, in the next few paragraphs.
Algorithm 5Estimation Framework
Input: System matrix Sk×n; partial relevance judgments J0with g2[D] the gain associated with document D for D ∈ J0and undefined otherwise; and a set of m rank-level background gain estimates, g0j,` for 1 ≤ j ≤ k and 1 ≤ ` ≤ m, with g0∗,` ≡ hg0j,` | 1 ≤ j ≤ ki and g1∗[D]≡ hg1`[D]| 1 ≤ ` ≤ mi.
Output: Values g2[D], gain estimates for the documents D ∈ J \ J0
1: forD∈ J \ J0 dog2[D]← 0
2: γ ← COMPUTECV(J0, S) // compute coefficient of covariance
3: ifγ > θ then // adjust only if γ exceeds threshold
4: for`← 1 to m do
5: forD∈ J0 dog1`[D]← 0
6: w1opt ← arg min w1∈[0,1]n L (h1(g0∗,`, w1)| D ∈ J0) 7: forD∈ J do 8: g1`[D]← h1 g0∗,`, w1opt 9: end for 10: end for
11: w2opt← arg min w2∈[0,1]m
L (h2(g1∗[D], w2)| D ∈ J0)
12: // get final per-document estimation 13: forD∈ J \ J0do 14: g2[D]← h2 g1∗[D], w2opt 15: end for 16: end if 17: returng2
First Stage. As noted already, one problem with rank-level estimators is the potential inconsis- tency across runs of the gain attached to any particular document. In this stage, we only consider one rank-level estimator and aim at aggregating the n local estimations for each document to one estimation. That is, the m rank-level estimators are treated separately at first, in the loop at step4, to obtain a consistent background gain for each document D for each model, denoted g1`[D]. This is done via a combining function h1(·) that maps a vector to a single value. There are plenty of design choices for implementing a combining function h1(·), depending on different assumptions on relationships among n systems. For simplicity, we assume that the systems are independent and that they vary in quality. Therefore, for each document D, a natural combining function is to compute a weighted average, with h1(step6) parameterized by an n-element weighting vector w1that is specific to the ` th estimator:
∀D ∈ J0, h1(g0∗,`, w1| D) = n X i=1 g0kD,i,`· w1i with n X i=1 w1i = 1and w1i ∈ [0, 1] , (6.6)
and where g0kD,i,`applies the `-th estimator to the rank at which document D appears in the i-th of the n runs. One practical issue is that a document may not be retrieved by all systems in their top-k ranked lists, where k is the maximum depth of lists returned. In such cases the rank-based background gain of that document for that system is set to the modeled gain at depth k.
To compute a value for w1, we consider the aggregation process as an optimization problem, where the goal is to minimize the estimation error. We can view the estimation error from two granularities based on the two different representations in Figure6.3and Figure6.6, respectively. On the set of judged documents, we can measure the difference between system effectiveness calculated using true and estimated relevance values, therefore, the first possible optimization goal is to minimize this difference. We refer to this loss as La. On the other hand, if we consider Figure6.6, our goal can also be to approximate the true relevance value of a document, then the second loss function is the overall error for all documents in the training set. The loss described in the second case is modeled as Lb. We can use loss function Laor Lbat step6of Algorithm5 from either perspective.
In the first case, our aim is to minimize the estimation error of the rank-level estimators with regard to the true system performance; as such, the loss function Lais defined as:
La(·) = v u u u u u t n X i=1 k X j=1 sj,i∈J0 (WM(j)· (h1(·, w1 | sj,i)− rj,i)) 2 , (6.7)
where J0 is the set of judged documents and h
1(·, w1 | sj,i) is the combining function defined in Equation6.6with a rank-level estimator. Since all m rank-level estimators will be considered, we can assume the `-th is considered in the current process iteration without loss of generality. Therefore, for system Si and the pooling-to-d0 judgment set J0, the parameterized effectiveness score is k X j=1 sj,i∈J0 WM(j)· (h1(·, w1 | sj,i) = k X j=1 sj,i∈J0 WM(j)· ( n X i=1 g0ksj,i,`· w1i)
As noted, Laminimizes the overall estimation error of the evaluation scores for the set of systems, which can be obtained directly by comparing the difference between the estimated and true system scores.
The alternative loss function Lb uses the document-position representation (kD,1,kD,2,. . . , kD,n): Lb(·) = X D∈J0 v u u t n X i=1 (WM(kD,i)· (h1(·, w1 | D) − rD))2, (6.8) in which rD is the relevance value of document D and is included only once for each document, rather than once per document-rank. This equation takes the representation in Figure6.6 in to consideration and computes the difference between the estimated relevance and true relevance for each document in the judgment pool J0. When compared to Equation6.7, which considers esti- mation errors at the system-level, this loss function is focused at the per-document level, seeking to minimize the overall estimation error for the weighted gain of each document. Either Equa- tion6.7or Equation6.8is used at step6of Algorithm5, with the combination function h1(·) and constraints defined in Equation6.6. The result is the computation of a sequence of w1optvectors, one for each of the m different rank-level estimators.
Second Stage. Multiple fitting models have been proposed because different assumptions about the underlying relevance distributions across all systems are plausible, with a risk that no single model covers the true hypothesis space. Indeed, the limited non-random training data means that we may suffer from a high variance if only one model is considered. In this case, it is unclear which model is better in the extrapolation stage. Since all proposed rank-level estimators capture the characteristics of a topic to some extent, a “meta” optimizer is used for combining results from the first stage, as described by steps11to15. Similar to the first stage, we need to define a combining function in order to aggregate outputs from all models in the previous step.
A weighted average is used in this role too, considering each document D, together with the estimated background gains generated by the m previous computations, g1∗[D]. That combiner, h2(·) (step11), is defined via the m-vector w2 as:
∀D ∈ J0, h2(g1∗[D], w2) = m X `=1 g1`[D]· w2`, with m X `=1 w2` = 1and w2`∈ [0, 1] . (6.9)
Both La and Lb can be used in step11, but may not necessarily be the same. Note that the m- vector w2opt, computed at step11as the minimizing value for Equation6.9, provides an indication of the importance of individual optimizers from the previous stage. Previous work has shown that the expected error of combining loss functions is smaller than the average error on results output by each optimizer in isolation [133].
Computing the Coefficient of Covariance. Recall that our entire estimation framework is built on the assumption that a shallow judgment pool cannot find a majority of relevant documents. However, some topics may have only a small number of relevant documents, and a shallow depth may be sufficient to identify most of them, with adjustment unnecessary. When the shallow judg- ment pool is sufficient for identifying relevant documents, then assuming unjudged documents as not relevant is a good estimation for system performance. Only if deeper pooling would identify further relevant documents can score adjustment have an effect on system effectiveness scores. We propose the use of the coefficient of covariance [22] as an indicator, which is computed in step2of Algorithm5.
We consider pooling as a sampling with replacement process, with an unknown probability of a relevant document being sampled. The judgment process removes duplicate documents from the original pool, ignoring the selection frequency of a document. The intuition behind γ is to make use of the frequency information to describe the sample coverage of relevant documents. In other words, if relevant documents have been selected with high frequency, then it is more probable that the pooling process is sufficient. Consider the system matrix S in Figure 6.3and a pooling depth d0. Each document s
j,i(1≤ j ≤ d0, 1≤ i ≤ n) has a multiplicity in Sd0×n; we then group
them by that frequency count. Let fi be the number of relevant documents appearing i times in Sd0×n, R0 =P
ifi the number of relevant documents, and C = Pii· fibe the total occurrence count of relevant documents. For example, if only D8 and D1 in Figure 6.3 are identified as
10 15 20 25 30 35 10 20 30 40 50 60 70 80 90 100 Pooling Depth Mean γ ( 10 −3 ) Dataset TREC5 TREC7 TREC8Rob04
10 15 20 25 30 35 10 20 30 40 50 60 70 80 90 100 Pooling Depth Mean γ ( 10 −3 ) Dataset TREC9 TREC10 TB04TB05 TB06
Figure 6.7: The average γ values on NewsWire (left) and Web (right) collections over topics × runs. The values are plotted against different pooling depth.
relevant documents, then we have f3 = 1, f5 = 1, and R0 = 2 and C = 8. Based on these elements, the coefficient of covariance γ, is estimated via Chao and Lee [22]:
γ2= max |R0| 1−f1/C P ii· (i − 1) · fi C· (C − 1) − 1, 0 . (6.10)
When γ = 0, the probability of sampling a relevant document follows a uniform distribution; and when γ is high, the distribution is skewed, and it is likely that more relevant documents exist due to the low sampling coverage. Based on this, we have two hypotheses:
Hypothesis 1:γtends to decrease as pooling depth increases.
Hypothesis 2: There is a threshold θ, where if γ < θ, then the existence of unjudged documents will only negligibly affect the estimate of the system performance, and they can be ignored.
Average (over topics) γ values are plotted against pool depth in Figure 6.7, showing that γ decreases as the pooling depth increases. This is as expected, since the increasing pooling depth results in a more complete judgment set. Among the plotted datasets, TB04–TB06 have the largest average γ, which corresponds to a high relevance rate (Table6.2). All earlier TREC collections have a smaller γ value since they are relatively “complete” compared to TB04–06. TREC-5 is a relatively complete test collection, and hence has the lowest γ among the datasets plotted.