which we will call D. Hence from Equation 4.5, the transformation of the entire document-term matrix can be expressed as:
D0 = D × T (4.8)
Equation 4.8 requires semantic relatedness values to be computed for all pairs of terms tiand
tj in V , and used to populate the term-term semantic relatedness matrix T . Each vector ~τi ∈ T
provides the semantic relatedness of the corresponding term ti with all terms tj ∈ V . Because
any term can be at most similar to itself, all entries on the leading diagonal of T (i.e. i = j) are consequently assigned a value of 1. Thus, all other entries in T are required to be normalised between 0 and 1, with the value 1 in any cell corresponding to identical term pairs and 0 to dissimilar. The normalisation of the values of T ensures that a term can never be more related to another term than it is to itself. The impact of equation 4.8 will be to boost the presence of related terms that were not contained in the original documents, which in turn has the beneficial effect of making the vector representations of documents that belong to the same class more similar.
4.2
Preserving Local (Within-Document) Relevance
The initial weight wiassigned to a term tiin a document d, is designed to reflect the importance
or relevance of ti to d. However, note from Equation 4.5 that the weight wi0 of ti in the semantic
document representation d0 is not exclusively determined by the original weight wi of term ti.
Rather, w0iis strongly influenced by the weight wj of the term tj ∈ d that tiis semantically related
to, and also by the strength of this semantic relatedness (Rel(ti, tj)). This means that if ti is
strongly related to many other terms tj ∈ d, then tireceives a relatively high weight w0i, regardless
of its original relevance to d. The reverse is also the case, i.e., if tiis related to only a few terms
tj ∈ d, then ti receives a relatively low weight. This is certainly an undesired consequence of
semantic indexing because, if tiwas initially assigned a relatively low weight wi due to it being
less important or relevant to document d, the aggregation of the semantic relatedness of ti, if tiis
related to enough other terms, could result in a high weight wi0in d0. In other words, the relevance of tito d is easily lost during semantic indexing, in favour the semantic relatedness between tiand
4.2. Preserving Local (Within-Document) Relevance 73 This problem of loss of local relevance in term weights is of particular concern in situations where semantic relatedness is computed from corpus co-occurrence statistics. In a typical corpus, any term ti is likely to have non-zero co-occurrence with many other terms tj in the collection.
Thus, a term ti which is initially absent, or assigned a low weight in the vector of a document
dj can easily end up having the highest weight after semantic indexing if it co-occurs often with
many other terms in the corpus. Hence, by computing term weights as an aggregation of semantic relatedness, the cumulative effect of less important terms can result in significant amounts of noise being added to document representations.
D = t1 t2 t3 t4 t5 d1 0.0 0.7 0.6 0.0 0.0 d2 0.8 0.0 0.0 0.5 0.0 d3 0.1 0.9 1.0 0.0 0.0 d4 0.3 1.0 0.7 0.0 0.0 T = t1 t2 t3 t4 t5 t1 1.0 0.5 0.8 0.7 0.3 t2 0.5 1.0 0.2 0.2 0.3 t3 0.8 0.2 1.0 0.0 0.1 t4 0.7 0.2 0.0 1.0 0.3 t5 0.3 0.3 0.1 0.3 1.0 D0 = t1 t2 t3 t4 t5 d1 0.83 0.82 0.74 0.14 0.27 d2 1.15 0.5 0.64 1.06 0.39 d3 1.35 1.15 1.26 0.25 0.40 d4 1.36 1.29 1.14 0.41 0.46
Figure 4.1: Example of semantic indexing using the GVSM
We illustrate this point further with the aid of an example. Figure 4.1 shows a sample document- term matrix D with 4 documents and 5 terms, a matrix T which captures the semantic relatedness between all pairs of terms in the vocabulary, and a semantic document-term matrix D0 containing semantic document representations derived from D and T using Equation 4.8. Note from Fig- ure 4.1 that document d1 in D does not contain the term t1. However after semantic indexing,
term t1has the highest weight in d01. A similar result is seen in d3and d4where t1has low weights
of 0.1 and 0.3 respectively. However, after semantic indexing, t1again has the highest weight in
4.2. Preserving Local (Within-Document) Relevance 74 to all the terms in d1, d3, and d4. However, t1 could have been absent or assigned low weights
in d1, d2 and d4 because it is not directly important to these documents. For example if these
documents had been about cars and t1 was the term ’Honda’, even though ’Honda’ is relevant
to the topic of cars, it is certainly overrated to think that ’Honda’ should be the most important term in d1, d3 and d4, simply because ’Honda’ is semantically related to the other terms in these
documents. Indeed many documents about cars will have nothing to do with ’Honda’. Likewise, many documents containing the term ’Honda’ could also be about the company or motorcycles and have nothing to do with cars. It is clear then that local (within-document) term importance is ignored using the approach in Equation 4.8, resulting in noisy representations. This problem is even more acute in real-world situations where, because of the high dimensionality of document vectors, larger discrepancies can easily result from aggregating semantic relatedness over all terms in a document.
To address the problem of loss of local relevance from semantic relatedness aggregation, we introduce a modification to the approach in Equation 4.8 which is to normalise all row vectors
~
d ∈ D and all column vectors ~τ ∈ T to unit length before taking their product. Normalisation is achieved by taking the L2 norm of the corresponding vectors ~d and ~τ . This ensures that the length of the vectors are taken into account i.e. terms that are semantically related to many document terms now get penalised to prevent such terms from dominating document representation. The computation of the L2 norm of a vector v is given in equation 4.9.
k v k= q
Σn
i=1vi2 (4.9)
Thus, we can modify Equation 4.8 to reflect this normalisation as follows:
D0 = Drn× Tcn (4.10)
Where Drn is the term document matrix D with all rows L2 normalised, and Tcn is the se- mantic relatedness matrix T with all columns L2 normalised. Figure 4.2 shows the semantic document-term matrix D0from Figure 4.1 with L2 normalisation applied before taking the prod- uct of the matrices D and T . Note that the distribution of terms in the document vectors of D0 better reflect their original distribution in D. For example t1 no longer has the highest weight in
4.3. Global Term Relevance Weighting 75