Document classification - Results analysis agent

10.2 Results analysis agent

10.2.3 Document classification

After retrieval of result pages, feature extraction and representation the next step for the results analysis agent is to attempt to automatically discover general patterns present across the multiple results and group the results according to similarities. In other words, the agent attempts to classify and group similar documents based on extracted features.

In this subsection, a discussion of one of the most well established methods for text classification in information retrieval as well as a very brief summary of some well known clustering methods that could potentially be employed by the results analysis agent will be given.

Term frequency inverse document frequency (TFIDF) document classification

The idea behind TFIDF is to represent each document with its bag of words as a vector in a multidimensional Euclidian space. Each component of the document vector dt represents a word or token. Each component of the document vector is calculated as the product of the term frequency T F(d, t)–the number of times term t appeared in document d and the inverse document frequency IDF(t) = log[DF^D_t]–where D represents the number of documents in the collection and DFt represents the number of documents in the collection term t appears in at least once.

The final coordinate of document d in axis t is then given by

dt= T F (d, t)IDF (t) (10.1)

This approach represents new documents as a vector in the same vector-space as the document vector generated above and measures the distance between them to classify the new documents.

A cosine similarity measure could be used to measure the distance between the two. The same idea follows for user queries. A query q is treated as a document, transformed into a vector in the same vector-space as the document-vector d and the distance between q and d determined by a cosine similarity measure. This scheme could also be adapted to handle queries containing search phrases (”terms”), word exclusions/inclusions(-term,+term) and single worded queries [93].

The TFIDF scheme briefly summarized above represents a key classification method and variants

of it is used by many content mining and meta-search systems [87, 91].

Clustering methods

In addition to the TFIDF scheme described above, other methods of document classification and categorization are also available. Document clustering techniques can be used to group similar documents into clusters, with each cluster representing a topic or subtopic. This idea of automatically grouping similar documents is of key importance to the results analysis agent. If the agent is able to group topically related documents, it could automatically create a taxonomy of results grouped by topic. The clustered results could then be presented to the user in a much more structured way and/or the generated taxonomy could be used as the basis for ranking and filtering operations on the result set or even further query specialization for submission to the query agent [68].

Clustering algorithms can be classified into two broad classes: bottom-up and top-down. Bottom-up clustering initially considers each document in a groBottom-up of its own. GroBottom-ups are then merged according to some similarity measure until the desired amount of clusters are formed. Top-down clustering requires that the desired number of clusters are declared beforehand. When this is done, documents are then assigned to the clusters until all documents have been assigned.

The most promising clustering method for the purposes of the result analysis agent is top-down clustering. Using this method, the agent could supply the algorithm with partitions for cluster-ing similar to the topics defined in the ODP-taxonomy introduced in chapter 6. The document vectors of the various results can then be compared and assignments made to these predefined topic clusters. Additionally, after all documents in the result set have been assigned to topic clusters, the results analysis agent could access the individual ODP-tree kept for a specific user and rank the results in each cluster according to the frequency discriminating words indicated in the individual ODP-tree appear inside the specific document. This is quite easily done, as it would merely be a lookup in the document vector created for the document. By combining clus-tering and results re-ranking in this way, the results analysis agent to can present a structured, personalized view of results to the user agent for display to the end user.

Two potential top-down clustering methods that can be used by the agent of this section for the

creation of the ODP-based results taxonomy is briefly introduced below [93, 95]:

• k-Means clustering. In this algorithm, documents are represented using document vectors and each cluster is represented as the centroid of the documents belonging to that cluster.

Documents are initially grouped into k groups and k corresponding document centroids are computed accordingly. The algorithm then proceeds as shown in figure 10.3.

If this method is chosen for use by the results analysis agent, the ODP taxonomy could supply the groups documents should be grouped in. Next, a representative set of default documents can be chosen from the ODP for each group and the centroids calculated. This initial training data can then be used by the algorithm for document assignment.

Initialize cluster centroids to arbitrary vectors while further improvement is possible

for each document d do

find the cluster c whose centroid is most similar to d assign d to cluster c

end for

for each cluster c do

recalculate the centroids of cluster c based on the documents assigned to it end for

end while

Figure 10.3: The k-Means algorithm [93].

• Principle direction divisive partitioning. In his paper, Boley presents a top-down clustering algorithm that could be used by the results analysis agent [95]. The algorithm constructs a binary tree, each node in the tree representing a data structure containing the documents associated with that node. Leaf nodes in the tree represents unsplit clusters. The algorithm repeatedly selects leaf nodes in the binary tree and splits these leaf nodes into two child subclusters. The algorithm terminates when the amount of leaf nodes in the binary tree, created in the fashion described, is equal to the desired number of partitions. The algorithm then returns the entire tree for further processing. The interested reader is referred to the work by Boley for a comprehensive discussion on this algorithm [95].

10.2.4 Topics-links database

After the clustering process has been completed, a taxonomy of web documents into different clusters or topics will have been generated. This taxonomy can be seen as new knowledge gained by the agent and is stored in the topics-links database for future reference and/or possible knowledge sharing between agents.

The representation format most appropriate for explicitly representing this taxonomy is the XTM topic map representation discussed in chapter 3. The taxonomy generated by the clustering procedure described in the previous subsection can be encoded into the XTM format by using the convention that clusters represent topics and documents represent occurrences of the topic.

Associations between topics can also be represented in the topic map.

The results analysis agent could maintain a XTM topic map for each user of the search system, representing the topics and associated URLs found by the system for the user’s queries. Space does not permit a complete discussion of the XTM syntax, but the interested reader is referred to the XTM topic map specification for a complete description thereof [28].

In document A multi-agent collaborative personalized web mining system model. (Page 176-179)