• No results found

Lexicon-based collection selection

2.6 Collection selection

2.6.1 Lexicon-based collection selection

Early collection selection strategies were often designed for cooperative environments where comprehensive representation sets of collections are available [Baumgarten, 1997; 1999; Callan et al., 1995; de Kretser et al., 1998; D’Souza and Thom, 1999; D’Souza et al., 2004a;b; Gravano, 1997; Gravano et al., 1997; 1999; Yuwono and Lee, 1997; Xu and Callan, 1998; Zobel, 1997]. A broker calculates the similarity of the query with the representation sets using the detailed lexicon statistics of collections.

GlOSS. The initial version of GlOSS—also known as bGlOSS [Craswell, 2000; Craswell et al., 2000]—only supports Boolean queries. bGlOSS ranks collections based on the esti- mated number of documents that satisfy the query. For a given n-term query generated from the terms t1, t2,· · · , tn, and a given collection c, the probability that c contains all the query terms is approximated as:

ft1,c

|c| × · · · × ftn,c

|c| (2.7)

where ftj,c is the frequency of the jth query term in collection c and |c| is the number of

documents inside that collection. The bGlOSS method was designed for cooperative environ- ments and thus, the collection size values and term frequency information were assumed to be available for the broker. Overall, bGlOSS estimates the number of documents containing all the query terms as:

Qn j=1ftj,c

|c|n−1 (2.8)

Collections are ranked according to their estimated number of answers for the query. In the vector-space version of GlOSS (vGlOSS) [Gravano et al., 1999], collections are sorted according to their goodness values, defined as:

Goodness (q, l, c) = X d∈Rank (q,l,c)

sim(q, d) (2.9)

and,

Rank (q, l, c) = {d ∈ c|sim(q, d) > l} (2.10) where sim(q, d) is the Cosine similarity [Salton and McGill, 1983; Salton et al., 1983] of the vectors for document d and query q. In other words, the goodness value of a collection for a query is calculated by summing the similarity values of its documents. To avoid possible noise produced by low-weight documents, vGlOSS uses a weight threshold l.

As with bGlOSS, the broker is provided with information about the lexicon statistics of collections. For any given term t, the broker stores the number of documents in each collection that include that term. However, this information is still incomplete, and not as comprehensive as the original index. For instance, the weight of terms in each document is not available to the broker. Therefore, two versions of vGlOSS are proposed, based on the following assumptions:

• high correlation: If query terms t1 and t2 appear in collection c, respectively in dft1,c

and dft2,c documents (where dft1,c <dft2,c), then any document that contains t1, also

• disjoint: If query terms t1 and t2 appear in collection c, then none of the documents indexed by c contains both terms. The disjoint scenario is also known as the Sum(l) version.

Max(l) increases recall by searching more collections, while Sum(l) produces higher pre- cision values. Sum(l) may miss many collections that include relevant documents, and does not perform as well as Max(l) in terms of recall.

CORI. Turtle and Croft [1990; 1991] applied inference networks to document retrieval. A typical inference network for document retrieval is an acyclic graph in which documents are represented by leaves and the information need (query) is the root. However, the inference network designed for collection selection in CORI [Callan et al., 1995; French et al., 1999] is slightly different. The leaves in a CORI network represent collections and are connected to a group of nodes for representation sets in the second level of the graph. The representation node for each collection contains the terms that occur in that collection.

The similarity of a query and representation sets are measured by the INQUERY re- trieval system [Allan et al., 2000; Callan et al., 1992; 1997; Turtle, 1991; Turtle and Croft, 1990; 1991]. INQUERY was originally designed for ranking documents, but in CORI it is modified slightly to become applicable for collection selection. In the INQUERY formula (Equation 2.4), the term frequency component is changed to document frequency, and in- verse collection frequency is used instead of the inverse document frequency component (idf ). Given that a term t is observed, the belief in collection c is defined as below:

P(t|c) = Φ + (1 − Φ) · T · I (2.11) where, T = dft,c dft,c+ 50 + 150 · cwc cw (2.12) I= logNc+0.5 cft  log(Nc+ 1.0) (2.13)

dft,c: The document frequency of t in c.

cwc: The number of words in collection c.

Nc: The number of collections.

cft: Collection frequency of t (the number of collections containing t).

Φ: The minimum belief component when t is available in c.

In the original CORI paper, Callan et al. [1995] suggested 0.4 as the default value for Φ. D’Souza et al. [2004b] tested CORI on different testbeds and reported that the default value is not always optimal.

Among lexicon-based methods, CORI was suggested to be the most effective by many researchers [Craswell et al., 2000; French et al., 1999; Powell, 2001; Powell and French, 2003; Rasolofo et al., 2001]. However, some researchers argue that the performance of CORI varies on different testbeds and sometimes is significantly worse than the other alternatives [D’Souza et al., 2004b;a]. Further, Si and Callan [2003a] reported that CORI is not effective in envi- ronments where the distribution of collection sizes is skewed.5

CVV. Cue-validity variance (CVV) was proposed by Yuwono and Lee [1997] for collection selection as a part of the WISE index server [Yuwono and Lee, 1996]. The CVV broker only stores the document frequency information of collection lexicons and is thus more efficient than many other lexicon-based methods such as CORI and GlOSS [D’Souza, 2005]. CVV defines the goodness of a given collection c for a m-term query q as below:

Goodness (c, q) = m X j=1

CVVj· dfj,c (2.14)

where dfj,c represents the document frequency of the jth query term in collection c and CVVj is the variance of cue-validity (CVj) [Goldberg, 1995] of that term. CVc,j shows the degree that the jth term in the query can distinguish collection c from other collections and is computed as: CVci,j = dfj,ci |ci| dfj,ci |ci| + PNc k6=idfj,ck PNc k6=i|ck| (2.15) 5

In the original CORI paper [Callan et al., 1995], parameter T (Equation 2.12) is calculated with a different formula. We use the improved version of CORI [French et al., 1999]—implemented in the Lemur framework— for our experiments.

Here, |ck| is the number of documents in collection ckand Ncis the total number of collections. The variance of cue-validity CVVj can be calculated as:

CV Vj = PNc i=1(CVci,j− CVj) 2 Nc (2.16)

where CVj is the average CVci,j over all collections and is defined as below:

CVj = PNc

i=1CVci,j

Nc

(2.17)

Other lexicon-based methods. Several other lexicon-base collection selection strategies have been proposed. Zobel [1997] tested four lexicon-based methods for collection selection. Overall, his Inner-product ranking function was found to produce better results than the other functions such as the Cosine formula [Salton and McGill, 1983; Baeza-Yates, 1992]. CSams [Yu et al., 1999; 2002a; Wu et al., 2001] uses the global frequency of query terms to compute the weights of collections and is proposed for cooperative environments only.

D’Souza and Thom [1999] proposed a n-term indexing method, in which a subset of terms from each document is indexed by the broker. For each document, a subset of terms should be provided by collections to the broker. Thus, a high level of cooperation is needed. A comparison between the lexicon-based methods of Zobel [1997], CORI [Callan et al., 1995], and n-term indexing strategies has been presented by D’Souza et al. [2004a], showing that the performance of collection selection methods varies on different testbeds, and reported that no approach constantly produces the best results.

Baumgarten [1997; 1999] proposed a probabilistic model [Robertson, 1976; 1997] for rank- ing documents in a federated environment. However, the performance of the suggested app- roach has not been compared with alternative techniques. Sogrine et al. [2005] combined a group of collection selection methods such as CORI and CVV with a latent semantic indexing (LSI) strategy [Deerwester et al., 1990]. In their approach, instead of the term frequency information of query terms, elements of an LSI matrix are used in collection selection equa- tions. They showed that their suggested approach can slightly improve the performance of some collection selection methods.

Lexicon-based collection selection methods are analogous to centralized IR models, but documents are now collections. With this approach, though, document boundaries within collections are removed, potentially impacting on the overall performance of such models [Si and Callan, 2003a].