Summary - Combinatoric models of information retrieval ranking methods and performance measures

The general research goal of this dissertation was the use of analytic, as contrasted with retrospective, techniques to construct combinatoric models of IR ranking methods and performance measures for weakly-ordered document collections. These models could be used by researchers to predict system performance, to acquire a deeper understanding of some of the factors that inﬂuence how IR performance measures work, and to develop more accurate formulas for these measures. The main items of interest in this research were the Average Search Length, the normalized average position of a relevant document (A), the quality of a ranking method (A), and the development of performance measures that could be calculated at arbitrary points in a vector of ranked documents and that yielded correct results even when the documents were weakly-ordered.

Chapter 2 Background

Retrieval performance measures attempt to provide some indication of how well an information retrieval system performed (if used in a retrospective manner) or is expected to perform (if used in a predictive manner). The Average Search Length is the major measure that is used in this research. Much terminology and concepts appear as part of this research. Deﬁnitions of many of them are a part of this chapter. It is important to note that the research that is discussed in this document uses a single term model.

One may naturally wonder “Why is this research limited to just single term queries?” The main reason is that this single term limit “allows us to fully understand many retrieval characteristics and options that are far more difficult to understand in a multi- term case” (Losee, 1998). Another very important reason is that multiple term queries may introduce confounding factors (Johnson and Christensen, 2004) in a research model. If the researcher is not cognizant of these factors, or the factors are not identified and taken into account, then the study may have poor internal validity. A third reason is that many queries, especially on the Internet, consist of just a single term (Jansen et al., 1998). A number of issues may arise with multiple term queries — but can be ignored in the single term case. These include the following issues: If the query terms are not assumed to be independent, then how are term dependencies handled or modeled? Is each query term equally important? If not, how are relative weights specified? Must

all of the query terms be present in a document for a match to occur? Do multiple occurrences of a query term mean that they have more weight than a lesser number of occurrences?

Each of the above examples represents issues that have the potential to complicate a retrieval model. The eﬀect of this is that it may hinder the understanding of the characteristics of the information retrieval (IR) model under investigation.

The discussion of the deﬁnitions for the terminology and concepts that are used in this research starts by stating that the formula for the Average Search Length (Losee, 1998) is

ASL =NQA+Q A+ 1/2, (2.0.1)

then proceeds by specifying the roles of the independent variables, followed later with a more in depth treatment of these entities. Briefly, N is the number of documents to be ranked, Q is the probability that the ranking is optimal, and A is the normalized expected position of a relevant document from the front (i.e., document position 1) of the ranking. In the above formula,A is defined as 1− Aand Qis defined as 1− Q. The values of Q and A are in the closed interval [0,1].

The major part of the process of estimating the ASL involves computing the weighted mean ofAandA with the weights beingQ(the proportion of rankings that are optimal) and Q (the proportion of rankings that are worst-case), respectively.

Hence, given an arbitrary system, its collection of documents, the query, the ranking algorithm, and the collective characterization in terms of N, Q, and A, the expected performance of that system can be calculated. There may be other ways, now and in the future, to estimate the performance of diﬀerent ranking schemes. They, most likely, will not be exactly identical to the methods which were the subject of this research. However, if someone is interested in doing this kind of performance prediction research,

the methods they use will likely have much in common with those used in this research. Documents with a binary query feature with frequency d may be presented to the user in 1 of 2 distinct orders: all the documents with feature frequency d precede any document with feature frequency d = 1−d (optimal ranking) or vice versa (worst-case ranking). Furthermore, it is assumed that the term weight ford is greater than that for

d. In essence, this holds when the query terms are positive discriminators. If the terms are not positive discriminators, then the features must be switched (re-parameterized) so that the product of d and the term weight is greater than the product of d and the term weight. If we letd= 1, then, in a best-case (or optimal) ranking, all the documents with feature frequency 1 are retrieved before those with feature frequency 0. Likewise, in a worst-case ranking, all the documents with feature frequency 0 are retrieved before those with feature frequency 1.

The mean position, A, on a unit scale, of a relevant document can be computed as the sum of the weighted positions of those relevant documents with feature frequencies

d and d, respectively. These weighted positions are normed to be in the closed interval [0,1]. A document at the front of the ordering has a position of 0 because it is at the low end of the spectrum (good performance), and a document at the back has a position of 1 because it is at the high end (bad performance). A can be viewed as the expected proportion of all documents that must be examined in the search process to reach the average position of a relevant document in the ordering. It can also be viewed as the mean normalized position of a relevant document in the ordering.

The variable A is computed by noting that documents with feature frequency d are at the low end of theA spectrum (good performance) and those with feature frequency

d are at the high end of the spectrum (poor performance). The formula forA is

A= 1 + Pr(d)−Pr(d|rel)

Notationally, the equation can be simpliﬁed by letting p= Pr(d|rel) and t= Pr(d):

A= 1 +t−p

2 . (2.0.3)

A ranking is an ordering or sequencing. With respect to the ranking of documents, in response to a query, an optimal ranking is a sequence where the documents that contain the query term are at the front of the sequence and any that do not contain the term appear after the last document that contains the term in that sequence. A worst-case ranking is the polar opposite (i.e., all of the documents that contain the term are at the rear of the sequence, all of the other documents are at the front). A random-case ranking is a sequencing where it is equally likely for any document, whether or not it contains the term, to occupy an arbitrary position in that sequence.

2.1 Several Alternative Measures That Are of Inter-

est

Of course, the ASL measure is far from the only measure that can be used to help assess ranking performance. Some of the many other measures are the Expected Search Length (ESL) (Equation 2.1.1 on the following page), the Mean Reciprocal Rank (MRR) (Equa- tion 2.1.3 on page 17), and the MZ-based E measure (MZE) (Equation 2.1.4 on page 17). These three measures are of great interest for the last of the three research questions being addressed by this dissertation. The discussion for this third research question takes place in Chapter 10 (The ASL Measure and Three Frequently-Used Performance Measures).

In Chapter 10, combinatoric-based models are developed for each of these three measures, and for the Average Search Length (ASL) measure. These models provide an analytic way to calculate the values of these measures and are very prominent in the discussions that occur in Chapter 10.

2.1.1 Expected Search Length

The ESL (Cooper, 1968) is similar to the Average Search Length. The major difference is that it counts the mean number of non-relevant documents retrieved before the kth relevant document is retrieved in a rank-ordered vectorV of documents. In other words, it counts the mean number of non-relevant documents retrieved in order to produce a given number k of relevant documents. For a query q, a vector V of ranked documents, and a request for the first x relevant documents, the ESL can be defined as

ESL(V, x) =j + i·s

r+ 1, (2.1.1)

where l is the level at which the xth relevant document occurs, j is the total number of documents irrelevant toq in all levels which precede levell in the weak ordering,iis the number of documents irrelevant toqin levell,sis the number such that the sth relevant document found in level l of the weak ordering would complete the search for request q,

and r is the number of documents level l which are relevant to q.

Caution must be taken when referring to the Expected Search Length (ESL), though, because Cooper’s definition is not universally used (Korfhage, 1997). Some researchers in the IR community have defined the ESL to be the mean number of total documents (i.e., both relevant and non-relevant) retrieved in order to obtain the xth relevant document in a rank-ordered vectorV of documents. In other words, this alternative ESL definition counts the mean number of total documents retrieved in order to produce a given number

x of relevant documents. For example, if the user requests 6 relevant documents and a mean of 4 non-relevant documents are retrieved before the sixth relevant document is retrieved, the Cooper version of the ESL calculates the mean number of retrieved documents as 4 documents whereas the alternate version considers the mean to be 10 documents.

2.1.2 Mean Reciprocal Rank

There are several performance evaluation measures in IR that are based on the concept ofreciprocal rank (RR). The most well-known one is themean reciprocal rank (MRR). It is used very heavily in the TREC Question Answering (QA) tracks (Voorhees and Tice, 1999; Voorhees, 1999) to assess the performance of an IR system on a set of questionsQ.

More formally, the reciprocal rank at document cut-oﬀ value k on a rank-ordered vector V of answers is deﬁned as

RR@k(V) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

1/i, if ∃i≤k, such thatV[i] is a correct answer, and ∀j < i, V[j] is an incorrect answer;

0, otherwise.

(2.1.2)

The above expression indicates that if a correct answer occurs among the ﬁrst k

answers in a rank-ordered vectorV of answers, then the expression’s value is the reciprocal of the rank that corresponds to the first correct answer. If there is no correct answer among the first k answers, then the reciprocal rank is defined to be 0. For example, assume that k = 5 and that correct answers are at ranks 2 and 3. Then the reciprocal rank is 1/2 because the first correct answer was at rank 2. Now, assume that the first correct answer is at rank 7. In this case, the reciprocal rank is 0 because the first correct answer was at a rank that is greater than 5.

According to Lin et al. (2008), two commonly used measures of a QA system’s performance are “the top-1 accuracy and the top-5 mean reciprocal rank.” The top-1 accuracy for a question set Q is the proportion of correct answers that are at rank 1 for the questions in Q. It is deﬁned as

whereVq is a rank-ordered vector of answers for questionq. The mean reciprocal rank at

document cut-oﬀ k for a vector V of answers is deﬁned as

MRR(Q)@k(V) = q∈Q

RR@k(Vq)

|Q| , (2.1.3)

where Q is a set of questions, q ∈ Q, and Vq is the rank-ordered vector of answers for

question q. Expressed another way, the MRR is the mean of the reciprocals of the ranks of the ﬁrst correct answer that occurs among the top k (in TREC, k = 5) answers in a ranking for each question. Note that the sets of answers represented by V and Vq are

identical.

2.1.3 MZ-Based E Measure

This measure is based on measurement theory (Bollmann and Cherniavsky, 1981) (as contrasted to Swets’ E measure which is based on the Receiver Operating Characteristics (ROC) model (Swets, 1969; Pepe, 2003)).

This measurement theory version of the E measure (MZE) (van Rijsbergen, 1979; Baeza-Yates and Ribeiro-Neto, 1999; Manning et al., 2008) is deﬁned as

MZE = 1− 2

P−1₊_R−1, (2.1.4)

where P represents precision and R represents recall.

In document Combinatoric models of information retrieval ranking methods and performance measures for weakly-ordered document collections (Page 46-53)