Vector Space Model (VSM) - Information Retrieval Models

2.2 Information Retrieval Models

2.2.2 Vector Space Model (VSM)

There are some other approaches to extend the Boolean model, like (Waller and Kraft, 1979) and (Paice, 1984) but with no improvement compared to p-norm. There is also some research on extending the standard Boolean model using Fuzzy theory such as (Bordogna et al., 1995;Kraft and Buell, 1983; Bookstein, 1980;Bordogna and Pasi, 1993). The state-of-the-art research work related to the Boolean model is reported in (Pohl et al.,2012,2010;Smith,1990;Lee,1994).

2.2.2 Vector Space Model (VSM)

Vector Space Model (VSM) is a type of TVM representation. In VSM, the document and the query are represented as vectors in the document space (Baeza-Yates and Ribeiro- Neto, 2011;Salton et al., 1975). Each dimension in the document space represents the weights given to the term in the test collection. In other words, the documents in the IR collection contain text words. After the pre-processing procedure, we obtain index terms or keywords representing each document. Then, we assign weights for each index term in the test collection. This weight represents the importance or the information content of

that index term in a given document. Each document is stored in the following form:

d= (w1, w2, ..., wn)

wheredis a document in the test collection,wi is the weight value of termiin the index terms collection and n is the number of index terms in the index terms collection that represent the information content of the test collection. The term weight can be assigned statistically or manually by trained indexer with expertise in the content of the test collections. When users type their queries as textual data, the IR system automatically assigns weights for each search keyword to build the query vector.

Term-Weighting Schemes in VSM

A good index term is a term that has a high discrimination value or weight that decreases the similarity between documents when assigned to the collection (Salton et al., 1975). The simple term weighting scheme uses the number of term occurrences in a given document which is called term frequency (tf). However, there is a drawback in this scheme. It may be that the term gets high weight value in every document at the same time because the term is repeated in every document and this makes it not a good discriminator term for documents. (Jones, 2004) proposed another weighting scheme called Inverse Document Frequency (idf) represented by log(N/n)whereN is the total number of documents in the collection andn is a number of documents to which a term is assigned.

(Salton and Buckley, 1988) proposed several weighting schemes for automatic text retrieval; these are shown in Figure2.4. Salton and Buckley classified a term weighting scheme according to three main components: term frequency, collection frequency and normalisation components. One of these combinations is Term Frequency-Inverse Docu- ment Frequency (TF-IDF). The TF-IDF weighting scheme is now the most well-known term-weighting scheme in VSM that has been widely used in the literature such as in (Liu,2011;Reed et al.,2006a;Greengrass,2000).

Figure 2.4: Term Weighting Components By (Salton and Buckley,1988).

(Lee, 1995) discussed each term weighting component proposed by Salton and Buckley and experimentally investigated whether cosine normalisation played an important role in retrieving different sets of documents or not. Lee concluded that cosine normalisation is a more important factor than maximum normalisation in retrieving different set of documents. Additionally, Lee studied the properties of different weighting schemes and showed that the significant improvements are obtained by combining the results retrieved from different properties of weighting schemes.Prior studies presented by (Fox and Shaw, 1994) used multiple document representations and multiple query representations for improving the retrieval effectiveness. (Belkin et al., 1993) achieved improvement in the effectiveness by using multiple query representations using different Boolean query formulations. (Harman,1993) suggested that using multiple retrieval runs and combining them can be used for improving the retrieval effectiveness.

Lee also classified weighting schemes into three classes according to the term- weighting component used. These classes are as follows:

1. Class C: weighting schemes of that class perform cosine normalisation. The advantage of that class is in retrieving single topic documents being relevant in the collections with varying document length. However, its disadvantage is the diffi- culty in retrieving relevant multiple topic documents and longer documents.

2. Class M: weighting schemes of that class perform maximum normalisation but do not perform cosine normalisation. The advantage of this class is that it may allevi- ate the problem of cosine normalisation in retrieving relevant documents that have multiple topics but it cannot normalise documents length. Moreover, it will retrieve longer documents regardless of their relevance.

3. Class N: this class of weighting schemes does not perform either cosine normalisation or maximum normalisation. This class favours longer documents to be retrieved instead of short documents and this may have an effect on retrieving relevant documents corresponding to users’ queries.

(Reed et al., 2006a) studied the relationship between the number of terms in the test collection and the document frequency distribution, where the document frequency is the number of documents in which the term occurs in the test collection. They concluded that there is a major effect on document frequencies by adding new documents to small test collections. Whereas, there is a minor effect on document frequencies when adding new documents to large test collections. Their studies led them to name the idf in large test collection as inverse corpus frequency ICF and they conducted experiments by assigning term weight scheme as follows:

wij = log(1 +tfij) · log(

N + 1

nj+ 1

) (2.2.11)

wheretfij is the term frequency of termjin documentiandICF = log(₍(_njN+1)₊₁₎)sinceN is the number of documents in the corpus and nj is the number of documents in which term j appears in the corpus. Reed et al. (Reed et al., 2006a) conducted their studies on three test collections: Reuters-21578 (Lewis, 1997), SMART (Salton, 2013) and 20

Newsgroup (Rennie,2015). In addition, in their work they used two similarity functions, Euclidean distance and Cosine similarity.

Similarity Matching Between Document and Query in VSM

Once the document vectors and query vector of term-weight have been computed using a TWS, the following step is to calculate the similarity matching value between the query vector and the document vectors in the test collection. Then, the documents are retrieved in descending order of their similarity values. The highest ranking document will be the most similar document to the query. The similarity matching procedure simulate the automatic system measurement for the relevance levels of the documents to the query. The more accurate similarity matching function is the more of effective IR accuracy obtained. There are five well-known similarity matching functions that are widely used in VSM. These functions are inner product, cosine similarity, Dice, Jaccard and Euclidean distance between document/query vectors (Greengrass, 2000; Baeza-Yates and Ribeiro- Neto, 2011). Several research have been reported about these functions in (Greengrass, 2000). Chapters 5 and 6 uses the most widely used and the most efficient similarity function in VSM which is the cosine similarity. The cosine similarity function between document d and query q (Cosine Similarity(d, q)) is defined by:

Cosine Similarity(d, q) = Σ n i=1Wid · Wiq q Σn i=1Wid2 · Σni=1Wiq2 (2.2.12)

In the above equation,nis the number of index terms that exist in the documentdand queryq,Widis the weight of termiin documentdandWiqis the weight of the same term

iin queryq.

In document Evolutionary algorithms and machine learning techniques for information retrieval (Page 52-56)