6.5 Evaluation with text clustering
7.1.2 Utilizing relations among concepts
In general, there are two ways to utilize semantic relations among concepts: using a specific type of relation, or quantifying the overall relatedness based on all types of relations. This thesis takes the second approach, which is more generic and does not require any parameterization (see Section 1.3). For example, although dieting and smoking are related, it is difficult to define the exact type of relation that connects them.
This approach requires an effective concept relatedness measure. For WordNet, Katoa uses Leacock and Chodorow (1997)’s path-length based measure (LCH) that computes relatedness among concepts based on the length of the shortest path between them in WordNet. For Wikipedia it uses a measure based on the Wikipedia hyperlink structure (Milne and Witten, 2008b). Both measures have been tested against human judgement on semantic relatedness, and have been
7.1. REVISITING THE THESIS HYPOTHESIS
shown to be both effective and efficient (Strube and Ponzetto, 2006; Milne and Witten, 2008a).
Katoa implements three methods for considering concept relatedness during clustering, the purpose being to enhance plain concept-based clustering. The first two assess a concept’s representativeness of a context—its centrality with respect to that context—and highlight more central ones by reweighting each concept by its centrality. We investigated two types of centrality: local centrality—a concept’s centrality with respect to its surrounding context—and relative central- ity with respect to another document. Local centrality is reweighted for each document before clustering begins, while relative centrality is reweighted during clustering for each document pair. Centrality reflects the variation in a concept’s importance when mentioned in different contexts, by taking its semantic relations with the context into account, whereas the traditional bag-of-words model simply equates importance with the number of occurrences.
However, the problem of connecting texts with different surface forms still ex- ists. Reweighting by relative centrality only influences pairs with some overlap; those with no concepts in common still receive zero similarity. The third method targets this orthogonality problem by altering the similarity measure to take re- latedness among different concepts into account. Given a pair of texts, it bridges the surface difference by enriching each text’s representation with concepts that are missing in that text but are mentioned in the other, using concept relations to determine the weight of the enriched concepts. Basically, an enriched concept receives greater weight if it has a strong connection with the current text and is closely related in general. This means that if an enriched concept is unrelated to any concept mentioned in the current text (i.e., it has zero relatedness to all of them), its weight will be zero, meaning that it will not be enriched at all.
Empirical results provide strong support for the second part the thesis hy- pothesis: considering the relations among concepts can significantly improve text clustering over using the bag-of-words model. Furthermore, they show that the plain concept-based clustering method can be further improved (see Table 5.3).
Reweighting by local centrality is the most effective of the three methods: it achieves statistically significant improvements over the plain method in 11 out of 16 cases. The success of local centrality suggests that Katoa’s concept-based
representation models are indeed quite exhaustive and distinctive in capturing im- portant thematic information in the texts. By unifying synonyms and eliminating semantic ambiguity, they provide a strong basis for clustering, and stressing the representative concepts makes the concept-based models even stronger.
The enriched similarity measure is the second most effective method. It consis- tently improves upon the plain method, but only with the hierarchical clustering algorithm, and less consistent performance was observed with the k-means al- gorithm (see Tables 5.4 and 5.5). Reweighting by relative centrality—centrality with respect to the other document—is the least effective, although intuitively it should highlight the most relevant aspects between two documents.
Comparing performance with each concept system, we found that Wikipedia was again more effective than WordNet: the number of times the plain method is statistically significantly improved is three times as great as the number of improvements with WordNet concepts (see Table 5.3). Analysis revealed two possible reasons. First is the effectiveness of the concept relatedness measures: the LCH measure for WordNet concepts seems less informative than the WLM measure for Wikipedia concepts. Second, lexical concepts are not necessarily relevant to a text’s theme, which biases the computation of context centrality and thus impacts all three methods.
It is not clear whether or not a concept’s number of occurrences in a text should be taken into account when calculating its context centrality. If it is, relatedness with the more frequently mentioned concepts will be emphasized over those men- tioned only occasionally. Our investigation showed that, somewhat surprisingly, the binary scheme—only considering the presence or absence of concepts—is more effective than taking frequency into account (i.e., the weighted scheme). The en- riched similarity measure with Wikipedia is consistently more effective with the weighted scheme than with the binary one. For WordNet concepts, considering occurrence frequencies is likely to do more harm, because lexical features that are thematically unrelated can occur frequently throughout a text, as a common expression for example, which introduces even more bias.
7.1. REVISITING THE THESIS HYPOTHESIS