vocabulary. Note that although this attribute is important for determining the performance of se- mantic indexing, the performance of semantic indexing is not dependent exclusively on this single attribute. From Table D.1 in Appendix D, we can see that there are datasets with similar Nearest Neighbour Similarity that have contrasting performance with respect to semantic indexing. Hence, the use of meta-learning allows us to leverage the other attributes to improve the accuracy of our prediction.
3.3
Chapter Summary
In this chapter, we investigated the benefit of semantic indexing for text classification. We used the GVSM framework in our study to test four different knowledge-resource based approaches and three different distributional approaches for computing semantic relatedness. The performance of the semantic indexing with the knowledge-resource-based approaches showed very little improve- ment with many of the results being significantly worse than not using semantic indexing. Note that while these WordNet based metrics have been widely evaluated on linguistic tasks such as synonymy detection and word pair association, to the best of our knowledge, this is the first time such a comprehensive evaluation has been reported using these metrics on text classification.
In contrast however, distributional approaches showed more potential for semantic indexing with substantial gains in text classification performance. However, the performance of distri- butional semantic relatedness approaches also revealed that semantic indexing does not always improve text classification performance and may sometimes even be harmful. Our results suggest that datasets with documents written in a more professional and consistent style benefit more from semantic indexing. We also observed that datasets with fewer and shorter documents benefited less from semantic indexing.
Considering that semantic indexing introduces additional overhead to the process of text rep- resentation, we set out to determine when and when not to apply semantic indexing using meta- learning. Accordingly, we presented a case-based approach for predicting when to use semantic indexing. Results show that our case-based approach is able to correctly predict the performance of semantic indexing on a range of datasets with over 80% accuracy. Again, to the best of our knowledge, this is the first time any attempt has been made to predict when to apply semantic
3.3. Chapter Summary 68 indexing.
An important consideration when building a case-based system is the choice of attributes for case representation. The attributes we used were obtained from several statistical metrics that capture various important characteristics of text datasets. These range from statistics of document frequencies of terms to measures of clustering of document neighbourhood. The high accuracy achieved in predicting when to use semantic indexing indicates that the attributes used for case representation capture characteristics of text datasets that are predictive of the performance of semantic indexing.
We further used a genetic algorithm to learn the relative importance of our attributes. The high weight assigned to the Nearest Neighbour Similarity attribute indicates the importance of the structure of a dataset is in determining the performance of semantic indexing. From Table D.1 in Appendix D, we observe that the incident report datasets for which semantic indexing did not work, all datasets had a much lower Nearest Neighbour Similarity compared to the other datasets. This implies that for the incident report datasets in particular, the sparseness in the datasets affected the quality of semantic relatedness extracted. Sparseness in these datasets can be attributed to the short length of the documents which means that any one document contains only a few terms from the vocabulary, thereby reducing the similarity between documents.
Chapter 4
Relevance Weighted Semantic Indexing
Semantic indexing has not resulted in consistent improvement in text classification performance. Our intuition on this is that the semantic indexing process does not properly capture the relevance of terms in document representations. It is well known that all terms in a corpus do not have the same importance with some terms being better at discriminating between classes, making them more relevant to the classification task. For example, to identify documents that belong to the class Sports, the terms “goal”, “match”, “team” and “football” are more relevant than terms like “rain”, “happy” and “glass”. Thus, it is important for semantic indexing that such class-indicative terms are recognised and assigned higher importance or weight in document representations. While semantic indexing captures the semantic relatedness between terms, we argue that it is not good at capturing the class-indicativeness or relevance of terms.In this chapter, we introduce a novel framework called Relevance Weighted Semantic In- dexing (RWSI) which extends the GVSM by capturing both local (within-document) and global (collection-wide) term relevance for semantic indexing. Global relevance of terms can be learned directly from the training corpus using supervised term weighting functions.
A second aim of this chapter is to demonstrate the utility of supervised indexing for text clas- sification. Accordingly, we demonstrate how the RWSI framework can be used exclusively for supervised document indexing, using an approach we call Relevance Weighted Indexing (RWI). A comparative evaluation of our RWI with the standard tf-δ(t)(see Section 2.4) approach shows RWI to lead much more consistent improvement in text classification performance.
This chapter is organised as follows: in Section 4.1 we provide a detailed analysis of the
4.1. Analysis of GVSM 70 inner workings of the GVSM. In Section 4.2 we present an analysis of how term weights can be adversely affected by semantic indexing and demonstrate how this can be addressed using vector normalisation. In Section 4.3 we highlight the need for relevance weighting and present the RWSI framework which extends the GVSM framework by introducing relevance weights of terms for semantic indexing. In Section 4.5, we demonstrate the RWI approach which utilises the RWSI framework for supervised document indexing. Evaluations are presented in Section 4.6. We conclude this chapter with a summary in Section 4.7.
4.1
Analysis of GVSM
The traditional vector space model (VSM) assumes independence between terms. However, this independence assumption is an over simplification because different terms within an indexing vocabulary often have related or even identical meanings. The implication of the term indepen- dence assumption is that the similarity between related documents can only be correctly estimated if these documents share the exact same lexical terms. The GVSM framework was proposed for capturing the relevant dependencies between term in document representations (Wong et al. 1987). In this section, we provide a comprehensive analysis of semantic indexing using the GVSM. In Section 2.2.4 we formally presented the GVSM. For the sake of completeness, we repeat some of the mathematical equations that are the basis for the GVSM. Given any two documents q and d, their similarity can be computed in the GVSM as:
Sim(q, d) = n X i n X j ui~tiwj~tj (4.1)
Where n is the dimension of the vector space (i.e. the number of terms in the indexing vocab- ulary), ui and wj are the initial (tf-idf, binary e.t.c.) weights for the terms ti and tj in the query
q and document d respectively, and ~ti and ~tj are vector representations of ti and tj respectively.
The product of the two term vectors, ~ti and ~tj, provides the relatedness between the correspond-
ing terms ti and tj. Thus, the product of the two term vectors, ~ti and ~tj, in Equation 4.1 can
4.1. Analysis of GVSM 71 Accordingly equation 4.1 can be rewritten as follows:
Sim(q, d) = n X i n X j uiwjRel(ti, tj) (4.2) Sim(q, d) = n X i ui n X j wjRel(ti, tj) (4.3)
Introducing the function Rel(ti, tj) allows for using any approach for computing the relat-
edness between terms ti and tj without restricting to the vector product of term vectors. Recall
that document d is represented as a vector ~d in euclidean space with dimension the size of the vocabulary V as shown in Equation 4.4.
~
d = (w1, w2, ..., wn) (4.4)
Where the corresponding weight, wi ∈ ~d, of each term ti ∈ V is non-zero only if ti occurs
in d, and zero otherwise. The same applies for ~q. Therefore, from Equation 4.3, for each term ti ∈ V , the original weight of ti in ~d (including zero weight if ti is absent in d) is replaced by
Pn
j wjRel(ti, tj). Accordingly, even if tidoes not occur in d, it now gets a corresponding weight
w0i = Pn
j wjRel(ti, tj) in the new semantic representation of d, if ti is related to one or more
terms tj ∈ d with non-zero weight. This is illustrated in Equation 4.5.
d0 = ( n X j wjRel(t1, tj), n X j wjRel(t2, tj), ..., n X j wjRel(tn, tj)) (4.5) wi0 = n X j wjRel(ti, tj) (4.6) d0 = (w10, w20, ..., wn0) (4.7)
Where w0i is the new semantic weight of term ti in d0. Observe from Equation 4.5 that d0 is
simply the product of the document vector d and an n × n matrix which we will call T where each entry τi,jin T corresponds to the value Rel(ti, tj). In other words, the matrix T captures the
semantic relatedness of all pairs of terms tiand tjin V . Each column j of T correspond to a vector
4.2. Preserving Local (Within-Document) Relevance 72