## MODULE 18

## Application - Document Recognition

## LESSON 41

### Document Classification and Retrieval

### Keywords: Vector Space, Cosine Similarity, Information

### Retrieval, Document Classification

Document Classification and Retrieval

• Classification: We consider the following set of documents. The first three of these were used in the previous lesson. Here, we add class labels in addition.

– D1: The good old teacher teaches several courses (from class

teacher)

– D2: In the big old college in the big old town (from class big)

– D3: The good teacher teaches in the evenings (from class teacher)

– D4: It is a big town (from class big)

– D5: The old college is the biggest in the town (from class big)

– D6: Courses taught by the teacher are good (from class teacher)

We represent these documents using the binary vectors shown in Ta-ble 1. This is similar to the representations shown in the previous lesson; here, we need to use the class labels also.

Doc big college course evening good old several teach town class label

1 0 0 1 0 1 1 1 1 0 teacher 2 1 1 0 0 0 1 0 0 1 big 3 0 0 0 1 1 0 0 1 0 teacher 4 1 0 0 0 0 0 0 0 1 big 5 1 1 0 0 0 1 0 0 1 big 6 0 0 1 0 1 0 0 1 0 teacher

• Now let us consider a document D given by D: Teachers in the evening college are good.

Note that the corresponding vector in the 9-dimensional space based on the terms shown in Table 1 is ‘0 1 0 1 1 0 0 1 0’.

• NNC: Now the distances between D and D1, D2, D3, D4, D5, and D6,

using Hamming distance, are 5, 6, 1, 6, 6 and 3 units respectively. So, the nearest neighbor is D3 from class teacher and using NNC we classify

D as a member of class teacher.

• KNNC: The first three nearest neighbors of D are D3, D6, and D1

all are from class teacher. So, KNNC with K = 3 assigns D to class teacher.

• Document retrieval: In document retrieval, the user poses a query and documents are retrieved based on the matching between the query terms and terms in the documents.

• Let us consider the query good teacher. In the simplest scheme based on the boolean model for retrieval, we retrieve the documents that have occurrences of both good and teacher. Note that D1, D3 and D6 are

retrieved against this query.

• If we strictly insist on good and teacher to be present, next to each other in that order, then only D3 should be retrieved. Such retrieval is

possible when phrased queries are used.

• For example, the query “good teacher” could mean, get documents where the phrase good teacher has occurred leading to the retrieval of only D3.

• However, using the simple boolean model, we retrieve D1, D3 and D6

because all the words in the query appear, perhaps not in the phrased manner, in these documents.

Vector Space Representation

• So, far we have considered whether a term is present or absent in a document to arrive at the representation of the document. Even though this is simple and meaningful to use in some applications, it has the following problems:

1. It has no mechanism to rank order the retrieved documents. Most of the practical needs require the retrieved documents to be ranked based on some notion of similarity.

2. Users may find it difficult to pose appropriate queries in a formal manner. This may impose on users a requirement to have an understanding of the boolean logic to pose the query.

3. However, in practice, users would prefer to pose queries in a nat-ural language freely.

4. Further, an inherent weakness of the boolean model is that it gives the same importance to a term irrespective of whether it occurred once or more frequently in a document.

• Vector space model offers a solution to ranking because the notion of similarity between vectors is well understood and popularly used in areas like data mining, machine learning, and pattern recognition. • It is possible to exploit the vector space model, in document retrieval,

if both the documents and queries are represented as vectors.

• The most popular scheme for representing documents as vectors is based on term frequency (TF) and inverse document frequency (IDF). • Intuitively, TF and IDF together are meant to give more importance to terms that occur frequently in a document but not across the entire collection of documents.

• Term frequency of the ith _{term (t}

i) in jth document (Dj), denoted by

T Fij, is defined as the the number of times ti occurs in Dj. The TF

values of the 9 terms in the six documents, shown in Table 1, are given in Table 2.

• The inverse document frequency of term ti in a document collection, D

of size n, is given by

IDFi = log(

n dfi

Doc big college course evening good old several teach town class label 1 0 0 1 0 1 1 1 2 0 teacher 2 2 1 0 0 0 2 0 0 1 big 3 0 0 0 1 1 0 0 2 0 teacher 4 1 0 0 0 0 0 0 0 1 big 5 1 1 0 0 0 1 0 0 1 big 6 0 0 1 0 1 0 0 2 0 teacher Table 2: TF values

big college course evening good old several teach town 0.30 0.48 0.48 0.78 0.48 0.30 0.78 0.30 0.30

Table 3: IDF values

• The IDF values of the 9 terms in the collection of 6 documents is given in Table 3.

• The weight or importance of ti in document Dj, characterized by T Fij∗

IDFi (the product of TF and IDF values), is given in Table 4.

• Now each of the six documents is represented as a 9-dimensional vector. • A popular similarity value between two vectors Di and Dj is given by

the cosine of the angle (θ) between the two vectors and is cos(θ) = D

t iDj

Doc big college course evening good old several teach town class label 1 0 0 0.48 0 0.48 0.30 0.78 0.60 0 teacher 2 0.60 0.48 0 0 0 0.60 0 0 0.30 big 3 0 0 0 0.78 0.48 0 0 0.60 0 teacher 4 0.30 0 0 0 0 0 0 0 0.30 big 5 0.30 0.48 0 0 0 0.30 0 0 0.30 big 6 0 0 0.48 0 0.48 0 0 0.60 0 teacher

Table 4: TF-IDF values

where || Di || is the euclidean distance of Di from the origin.

• The similarity values between pairs of documents is given in Table 5. • Clustering: Using the similarity, in the form of cosine of the angle

between two vectors, we group D1, D3 and D6 as the similarity between

any pair is at least 0.6. Similarly, D2, D4 and D5 form the cluster

with a minimum similarity of 0.6 among pairs. So, the clusters are {D1, D3, D6}, {D2, D4, D5}.

• Classification: Let us consider again document D given by D: Teachers in the evening college are good.

Its representation using TF-IDF weights is D: 0, 0.48, 0, 0.78, 0.48, 0, 0, 0.30, 0

• NNC: The cosine similarity between D and the documents D1, D2, D3, D4, D5, D6

is given respectively by 0.49, 0.21, 0.87, 0.0, 0.5, 0.42. So, D3 is the

nearest neighbor with a similarity of 0.87; class label of D3 is teacher.

As a consequence, D is classified as a member of class teacher using the NNC.

Doc/Doc D1 D2 D3 D4 D5 D6 D1 1.0 0.23 0.7 0.0 0.16 1.17 D2 0.23 1.0 0.0 0.61 0.95 0.0 D3 0.7 0.0 1.0 0.0 0.0 0.6 D4 0.0 0.61 0.0 1.0 0.6 0.0 D5 0.16 0.95 0.0 0.6 1.0 0.0 D6 1.17 0.0 0.6 0.0 0.0 1.0

Table 5: Similarity between the three documents

• KNNC: The first three nearest neighbors of D are D3,D5 and D1. So,

KNNC (K=3) assigns D to class teacher. • MDC: The centroids of the two classes are:

teacher: 0.0, 0.0, 0.32, 0.26, 0.48, 0.1, 0.26, 0.6, 0.0 big : 0.4, 0.32, 0.0, 0.0, 0.0, 0.3, 0.0, 0.0, 0.3

The similarities between D and these centroids are: 0.62 and 0.21 re-spectively. So, based on MDC we assign D to the class teacher.

• Decision Tree Classifier: In this example, the terms big and town characterize the class big perfectly and the terms good and teach are present only in teacher and are absent in the class big. So, a decision tree based on any one node (rule) of the form big > 0 → class big, town >0 → class big, good > 0 → class teacher, and teach > 0 → classteacher is adequate.

• Linear Discriminant: Consider the weight vector wt _{= (−1, −1, 1, 1, 1, 0, 1, 1, −1).}

The wt_{D}

i values are 2.34, -1.38, 1.86, -0.6, -1.08, 1.56 corresponding

positive values and the patterns of big have negative values. Further,
for the test pattern D, wt_{D} _{= 1.08 > 0; so, D is assigned to the}

positive class, that is the class teacher.

• Document Retrieval: Here, we use the similarity between the query Q and each document to rank the documents against a query. Higher the similarity, the document is ranked better. Let us consider the query Q = good teacher. So, the corresponding vector based on TF-IDF representation is

Q = (0.0, 0.0, 0.0, 0.0, 0.48, 0.0, 0.0, 0.3, 0.0).

• Now the cosine similarity values between Q and the documents D1 to

D6 are 0.92, 0.0, 0.66, 0.0, 0.0, 0.8 respectively.

• So, D1 is the most similar to Q with the largest similarity value of 0.92.

The next similar document is D6 with a similarity of 0.8 followed by

D3 with a similarity of 0.66 with Q.

• Note that the vector space representation is helpful in ranking the re-sults unlike in the case of boolean model.

• However, document D3, which has the occurrence of good teacher, is

ranked third. Documents D1 and D6 are ranked better. D3 may be

ranked better if retrieval based on phrases is used. Further Reading

• Here, we have illustrated processing and recognition of documents using a small collection of documents.

• Specifically, we have considered the boolean and vector space models. • We have considered clustering, classification, and document retrieval

tasks.

• In practice the document collections considered are much larger. For example, the Reuters RCV1 benchmark dataset has 806,791 documents and the number of terms is 391,523. In such cases, the decision tree classifier may not be the best choice as it has to consider each of the 391,523 terms for a possible split. Similarly, classifiers like the NNC and

KNNC may not perform well as nearest neighbors in high-dimensional spaces may may not be separated from the farthest neighbors of a point. • There are several other benchmark datasets.

• Also, there are several multimedia benchmarks. However, we have considered only text documents.

• Further, we have considered a situation where each document is as-signed to a single class. However, in practice, it is more realistic to consider situations where a document is assigned to more than one class. For example, one or more paragraphs in a document may deal with sports and some other paragraphs may be from politics.

• More Details on vector space representations and the popular classifiers used can be obtained from the references given below.

Assignment

1. Consider the Following 3 documents drawn from classes Keeper and big.

D1: The old night keeper keeps the keep in the town (from class Keeper)

D2:In the big old house in the big old gown (from class big)

D3:Where the old night keeper never did sleep (from class Keeper)

Consider the following 2 test documents. How do you classify them using the Boolean model?

T est1: The night keeper keeps the keep in the night T est2: And keeps in the dark and sleeps in the light

2. Solve problem 1 using the vector space model. Use TF-IDF based weights and cosine similarity.

3. Let us consider the following example documents from the book Pat-tern Recognition: An Introduction, V. Susheela Devi, M. Narasimha Murty, Universities Press, Hyderabad, 2011. Here, each document is a part of a paragraph in a chapter of the book. The first 5 docu-ments (D1 to D5) are from the chapter Support Vector Machines and

the remaining 5 documents (D6 to D10) are from the chapter Bayes

representations. Using D5 and D10as test documents classify them

us-ing NNC, KNNC (K=3), MDC, Perceptron, and SVM. You may use a subset of features (terms) that occurred at least twice in the collection after ignoring the stop words.

(a) D1 : Support Vector machine (SVM) is a binary classifier. It

ab-stracts a decision boundary in the multi-dimensional space using an appropriate subset of the training set of vectors.

(b) D2 : It is easy to observe that there are possibly infinite ways

of realizing the decision boundary, a line in this two-dimensional case, which can separate patterns belonging to the two classes. (c) D3 : Another useful notion is the distance between a point x and

the decision boundary.

(d) D4 : In general, it is possible to obtain the decision boundary in

the form of a hyperplane if we learn the vector w.

(e) D5 : A linear discriminant function is ideally suited to separate

patterns belonging to two classes that are linearly separable. (f) D6 : Bayes classifier is popular in pattern recognition because it

is an optimal classifier. It is possible to show that the resultant classification minimizes the average probability of error.

(g) D7 : It is based on the assumption that information about classes

in the form of prior probabilities and distributions of patterns in the class are known.

(h) D8 : It employs Bayes theorem to convert the prior probability

into posterior probability based on the pattern to be classified and using the likelihood values.

(i) D9 : In general, in pattern classification, we are interested in

av-erage probability of error which corresponds to a weighted proba-bility of errors corresponding to patterns in a collection.

(j) D10 : There are several schemes for estimating the probabilities

from the training data. Two popular schemes are the Bayesian scheme and the maximum likelihood scheme.

1. C. D. Manning, P. Raghavan, and H. Sch¨utze, Introduction to Infor-mation Retrieval, Cambridge University Press, 2008.

2. D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization”, Nature, Issue 6755, pages 788-791, 1999. 3. Datasets for single-label text categorization: