• No results found

4.2 Query Driven Retrieval Approaches

4.2.1 Content Only Based Retrieval

Research in development of IR approaches which index the terms in documents to facilitate content-based retrieval has been ongoing since its first proposal in the 1950’s [Luhn, 1957]. Luhn proposed that keywords could be either manually or automati-cally extracted from documents in a collection to create a representation of documents.

Each document’s representation (D) can be represented as a vector, as shown in Equa-tion 4.1 where tkis a term in the document representation2. Queries (Q) can similarly be represented as vectors, as shown in Equation 4.2 where qk is a term in the query representation.

D = (t1, t2, ..., tn) (4.1)

Q = (q1, q2, ..., qm) (4.2)

2In modern retrieval systems generating automatic document representations for indexing purposes (to create vector representations of documents in a collection) can consist of automatic tokenization, stop word removal, stemming process, etc. The reader is referred to [Manning et al., 2009] for a good introduction to this topic.

In its simplest form retrieval of relevant textual documents which match a user’s text query operates by searching the document collection for documents containing the query terms, and returning these documents to the user as candidate relevant docu-ments. In other words, documents either contain the query terms (in which case they are retrieved for the user) or they don’t contain the query terms (in which case they are not retrieved). This type of search uses Boolean algebra, and is hence referred to as Boolean search. Naturally, using Boolean algebra individuals can create more com-plex queries, using the Boolean operators, e.g., AND, OR or NOT (incidentally, in this case a query vector might be something like Q = ((q1AND q2) OR q3...). Boolean search is described more fully in [Manning et al., 2009, van Rijsbergen, 1979].

A problem with simple Boolean search is that only queries which exactly match the constraints of the Boolean query are returned as candidate relevant documents to the user. That is, a simple Boolean matching approach performs ‘exact matches’ where it checks for the presence or absence of query terms in combination with the constraints of the specified Boolean query in documents to determine documents which may be relevant to the query. Another problem with this approach is that no weighting of the likely relevance of documents or ranking of the output based on level of relevance occurs.

When determining if a document is relevant to a given query, it is useful to establish a degree of likely relevance of each document and to rank result lists for users based on these degrees of relevance. This is useful since the documents which are perceived to be most relevant to a query will appear at the top of a result list, making them more accessible to the individual performing the query. This led to the development of best (or partial) match retrieval approaches which produce result lists ordered according to the similarity of documents to the user query, where this similarity is some function of the number of search terms a query and document have in common. Using best match approaches documents and queries can be represented as vectors of weighted terms in a t-dimensional space, where t is the number of terms in the document collection representation (i.e. number of indexed terms). Equations 4.3 and 4.4 show the term vectors for a document (D) and query (Q) using this approach, where wtiis the weight assigned to term ti in the document representation and wqi is the weight assigned to term ti in the query. In these vector representations wti (or wqi) is set to 0 when the term does not occur in D’s (or Q’s) representation.

D = (wt1, wt2, ..., wtt) (4.3)

Q = (wq1, wq2, ..., wqt) (4.4)

To calculate the similarity between a query vector and document vector, the vec-tors are compared using for example the dot product, shown in Equation 4.5 [Salton and Buckley, 1988]. In the simplest best match approach terms occurring in a document (or query) are assigned a weight of 1. This results in a query document similarity weight consisting of a count of the number of query terms present in a doc-ument in Equation 4.5. Docdoc-uments are then ranked according to decreasing matching score. This is referred to as coordination-level matching.

Sim(Q, D) =

t

X

i=1

wti· wqi (4.5)

Coordination-level matching assumes that all terms are equally discriminating in a document collection, and hence equally useful in determining the likely relevance of a document to a query, which is not the case. Hence, best match retrieval approaches which weight terms according to their discriminating power were developed. Term weighting allows for selectivity, where a good query term is one which has a high chance of selecting relevant documents from the many which will be non-relevant [S. E. Robertson and S. Jones, 1994]. The three commonly used characteristics of terms and document collections which are used to weight the occurrence of a query term in a document are: term frequency, inverse document frequency and document length.

Term frequency (tf) counts the number of occurrences of a term in a document.

The rationale being that the more times a term occurs in a document the more rep-resentative of a document it is. The term frequency can be normalised, for exam-ple by dividing by the maximum term frequency (maxtf ) as shown in Equation 4.6 [Salton and Buckley, 1988], where tfd,tis the term frequency of term t in a document d. Various variations of this weighting function are possible, such as that shown in Equation 4.7 for example.

tfd,t= tfd,t

maxtf (4.6)

tfd,t= 0.5 + 0.5 · tfd,t

maxtf (4.7)

Inverse document frequency (idf): The concept underlying idf is that query terms which occur in few collection documents are more selective, more useful, than those which occur in many [Sparck-Jones, 1972]. An idf weight considers the number of documents a query term t occurs in (df(t)) relative to the number of documents in the collection (N), as shown in Equation 4.8. This idf weighting function always gives positive weights, with terms which occur in all documents receiving a weight of 0, which can be desirable given that the term offers no distinguishing power. Other approaches, for example as shown in Equation 4.9, do not reduce the idf score to zero for terms which occur in all documents. Some idf approaches, for example as shown in Equation 4.10, give negative weights to terms occurring in more than half the documents in the collection. Depending on the collection format this may or may not be desirable. The logs in idf scoring functions can be taken to any convenient base.

idf (t) = log N

df (t) (4.8)

idf (t) = log( N

df (t) + 1) + 1 (4.9)

idf (t) = logN − df (t) + 0.5

df (t) + 0.5 (4.10)

tf×idf: Idf scores are commonly multiplied by term frequency in term scoring, this is referred to as tf×idf (term frequency times inverse document frequency) [Salton and Yang, 1973, Salton et al., 1975b]. tf×idf weights allow for term discrimina-tion, the idea being that terms which occur frequently in a document, but infrequently in the collection as a whole, allow for the identification of individual documents from within a collection, and hence are the best terms for identifying the content of a docu-ment [Salton and Buckley, 1988].

Document length normalisation: Documents which are longer may have more oc-currences of a query term simply because they are long relative to much shorter doc-uments. This has the potential to result in a long document receiving a higher term score than a shorter document simply because it is longer, as opposed to as a result of it holding greater potential relevance to an individual’s querying need. Several tech-niques to account for this have been proposed, such as normalising using a vector length normalisation factor (as shown next).

The vector space model [Salton et al., 1975a] uses the vector dot product function, shown earlier in Equation 4.5. It performs document length normalisation on the term weight by dividing the dot product by the moduli of the two vectors, as shown in Equation 4.11.

An alternative to the Vector Space Model in IR is the probabilistic retrieval model.

This seeks to measure the probability of a document (item) being relevant to a query given that the document possesses certain attributes (typically words or phrases) occurring in the user’s request. Full details on the theory underly-ing probabilistic model in IR is contained in [Robertson and Sparck-Jones, 1976].

A well proven implementation of probabilistic IR is the Okapi BM25 model [Robertson et al., 1992, Robertson et al., 1993, S. E. Robertson and S. Jones, 1994]. Var-ious term frequency and length normalisation approaches have been explored in the Okapi model. Equation 4.13 shows the term weighting approach used in Okapi BM25 [S. E. Robertson and S. Jones, 1994]. For a given document d and a given query term t the BM25 weighting function (shown in Equation 4.13) calculates a term weight (wt,d).

The overall probability of relevance (matching score ms(q,d)) for a document d is the sum of the weights of the query terms present in the document, shown in Equation 4.12.

ms(q, d) = X

t∈q∩d

wt,d (4.12)

wt,d= idf (t) · tfd,t∗ (k1+ 1) k1∗ ((1-b) + (b ∗ avlld

d)) + tfd,t (4.13) where,

idf(t) = logdf (t)N in the implementation of BM25 presented in [S. E. Robertson and S. Jones, 1994], where N = number of documents in the collection, df(t) = number of documents term t occurs in. Of course other approaches can be used to calculate the idf, as discussed earlier in this section.

tfd,tis the number of occurrences of term t in document d.

ldis the length of d.

avldis the average length of all documents in the collection.

k1= Tunable parameter which modifies the extent of the influence of term frequency.

The higher values of k1increase the influence of tf ; k1= 0 eliminates the influence altogether.

b = Tunable parameter which ranges between 0 and 1. Modifies the effect of docu-ment length normalisation. b=1, the assumption that docudocu-ments are long simply because they are repetitive. b=0, assumption that documents are long because they are multi-topic.

In Chapter 5.3 we explore the use of BM25 for content only based retrieval of items in PL collections. Justification for use of BM25 in these investigations is provided in Chapter 5.2.