2.6 Collection selection
2.6.2 Document-surrogate methods
Document-surrogate methods are typically designed for uncooperative environments where the complete lexicon information of collections is not available. However, these techniques, in theory, could be also applied in cooperative environments. Document-surrogate methods do not rank collections solely based on the computed similarities of queries and representation sets, but they also use the ranking of sampled documents for collection selection. This is a step away from treating collections as a large single document or vocabulary distribution (as in lexicon-based methods), and somewhat retains document boundaries.
ReDDE. The relevant document distribution estimation (ReDDE) method [Si and Callan, 2003a] ranks collections according to the estimated number of relevant documents they con- tain. The broker creates a central index of all sampled documents. Each query is compared against this index before it is submitted to the collections. The number of relevant documents in each collection is estimated according to the contribution of collections in the top-ranked documents of the central index, and the goodness value of each collection is calculated as:
Goodness (q, ci) = \ rel (q, ci) P irel (q, c\i) (2.18)
Here, \rel (q, ci) is the estimated number of relevant documents in collection c for the query q and is calculated as:
\ rel (q, ci) = X d∈|Sci| P(rel , d) × |cci| |Sci| (2.19)
Here, P (rel , d) is the probability of relevance of document d in the collection ci; |Sci| is the
number of the sampled documents downloaded by QBS from collection ci, and c|ci| is the estimated number of documents in ci.
UUM. Si and Callan [2004a] proposed a unified utility maximization framework (UUM) for collection selection in uncooperative environments. In UUM, the samples of all collections are gathered in a central index. A set of queries is then used to train a model that maps the score of any given document wd in the central index to its probability of relevance.
P(rel |d) = exp(a + b · wd) 1 + exp(a + b · wd)
The parameters a and b are estimated in the training stage. Once a query is entered, the central scores of all sampled documents are computed. Having the mapping function and the estimated collection sizes, UUM estimates the relevance probabilities of documents returned by collections.
UUM can be tuned to meet the high-recall goal of collection selection algorithms or the high-precision goal of federated information retrieval systems. To achieve higher recall values, UUM selects collections in a way that maximizes the number of relevant documents in selected collections. The utility function of UUM in such a scenario can be computed as:
U = arg max − → U Nc X i=1 I(ci) c |ci| X j=1 Pci(rel , dj) (2.21)
The total number of collections is represented by Nc and I(ci) is a Boolean indicator that is set to one when a collection ci is selected, and to zero otherwise. The estimated size of collection ci is c|ci| and Pci(rel , dj) is the probability of relevance of the jth ranked document
(dj) in collection ci. The set of utility values for all possible combinations of documents that are returned by selected collections is represented by −→U.
The effectiveness of FIR methods is usually measured according to the precision values of the top-ranked documents in their merged results. To maximize the final precision, the UUM utility function can be defined as:
U = arg max − → U Nc X i=1 I(ci) zi X j=1 Pci(rel , dj) (2.22)
where zi is the number of documents that are returned by default from collection ci for any query. There is one major difference between the high-recall and high-precision utility functions. In the high-recall UUM, the utility function is maximized by estimating the probability of relevance for each document in each collection. However, in the high-precision UUM, the function is maximized by estimating the probabilities of relevance of the top- ranked documents for each collection. UUM has been reported to be more effective than ReDDE and CORI [Si and Callan, 2004a].
RUM. The collection selection algorithms discussed so far ignore the search effectiveness of available collections. That is, they usually assume that all collections are using effective retrieval models; ignoring the search effectiveness of available collections may significantly alter the final retrieval performance. Hence, the broker should avoid selecting collections
that are not likely to return good answers to the queries, even if they have high-ranked representation sets.
The returned utility maximization method (RUM) [Si and Callan, 2005b] has been pro- posed to address this issue. Here, the returned ranked list of collections are compared with the output of an effective retrieval model. If the outputs are similar, then the collection is likely to be using an effective retrieval model and vice versa. For this purpose, RUM down- loads a few documents from the returned ranked lists and adds them to the index of sampled documents. RUM operates in six basic steps:
• The samples of all collections are aggregated into a single central index.
• An effective retrieval model is used to run a set of training queries on the central index. • The training queries are also submitted to the available collections.
• Each collection returns its answers from which a few top-ranked documents are down- loaded by the broker.
• The downloaded documents are ranked by the central index using the term statistics of all sampled documents.
• Using the training queries, a mapping function is trained to convert the collection ranks to their approximated global ranks in the central index.
Si and Callan [2005b] have compared RUM with UUM, and have shown that ignoring the collection search effectiveness factor can significantly reduce the final precision.
DTF. The decision-theoretic framework (DTF) aims to minimize the typical costs of col- lection selection such as time and cost, while maximizing the number of relevant documents retrieved. As in UUM, the search effectiveness of collections can be learned by using a set of training queries in advance.
DTF was initially suggested [Fuhr, 1996; 1999b;a] as a promising method for selecting suitable collections. However the method had not been tested in FIR environments until Nottelmann and Fuhr [2003] showed that the effectiveness of DTF can be competitive with that of CORI for long queries. However, for short queries, DTF is usually worse than CORI. More recently, CORI and DTF were combined in a single framework [Nottelmann and Fuhr, 2004a]. The hybrid model still produces poorer results than CORI for shorter queries, but competitive results for longer queries.
DTF has one of the most solid theoretical models among available collection selection tech- niques. It includes costs (monetary, network) along with relevance into a decision-theoretic framework, and has been used in a few real-world federated retrieval architectures such as the MIND project [Berretti et al., 2003; Nottelmann and Fuhr, 2004b;c]. However, DTF requires a large number of training queries.