6.5 Evaluation with text clustering
7.1.3 Learning document similarity with concepts
Concepts, and the relations among them, provide additional perspectives for mod- elling the thematic similarity between two texts. Instead of handcrafting an ad- hoc formula to combine the information from different aspects, we used machine learning techniques—more specifically regression algorithms—to learn the right combination from a small amount of training data—texts whose similarities to each other are already known.
Four types of feature were designed, each capturing a single perspective, the overall document similarity measured based on different representation models and similarity measures, the one-to-many, the one-to-one, and the many-to-many relations among the concepts in each text (see Table 6.1). The last three perspec- tives were represented by the distribution of both types of centrality, the strongest semantic connection between individual concepts, and the distribution of related- ness among their concept cliques respectively. All these features are generic: they are independent of any specific dataset, and can apply to any texts.
Features were evaluated both individually and in combination. Not every fea- ture is equally informative. For example, features that describe the distribution of a concept’s local centrality are less informative because they focus on characteris- tics of the texts themselves rather the relations among them. In contrast, relative centrality features turned out to be more predictive (see Table 6.4). This indicates that relative centrality, which directly depicts the relation between texts, can be indicative, when utilized appropriately.
Two types of evaluation were conducted: against manually assigned similar- ities and in the task of clustering. For the former, the goal is to test whether the learned similarity measure can predict similarity as consistently with an av- erage human labeler as they are amongst themselves. The average inter-labeler consistency, in terms of the Pearson correlation coefficient, served as the baseline. Empirical results showed that the learned similarity measure could be even more consistent with humans than they are with each other.
The second evaluation is of greater interest from the point of view of practical application. First, the training dataset is tiny, whereas the test datasets used in this evaluation—the four experimental datasets described in Section 3.1—are
much bigger. Second, and more importantly, the learned model was tested on previously unseen documents, which come from different sources and domains to those in the training dataset. The learned similarity measure was used to re- place the standard cosine similarity measure for predicting the similarity between any documents or cluster centroids during clustering. Empirical results showed that it consistently and effectively improves the enriched clustering method (with reweighting by binary local context centrality), which was also the best baseline in this evaluation. We also found that for the learned similarity measure to be effective with the k-means algorithm, the standard mean vector representation of clusters needs to be adapted to represent a cluster by its members instead. This is because the learned measure was trained on relations between individual documents, and the latter representation is a better fit with the underlying model. There are three options regarding the set of features. The first relates to concept groups. Concepts mentioned in the same document can form tight groups according to their relatedness—the concept groups section in Table 6.1. Whether or not stray concepts that cannot be assigned to any existing groups are treated as singleton groups produces the full and strict models respectively. The fact that the latter always outperformed the former (see Section 6.4.4) suggests that a certain abstraction is necessary: it is beneficial to focus on the major topics and ignore the less important ones—the singleton groups.
The second option concerns occurrence frequencies. Whether or not to take the number of occurrences of a concept into account affects most features, and empirical results showed no significant differences between the binary and weighted schemes (see Table 6.4).
The third option relates to the concept systems. Empirical results again
showed Wikipedia’s advantages over WordNet. With WordNet the learned simi- larity measure only approximated the average consistency between human raters, whereas with Wikipedia a significantly greater consistency was achieved.
Last but not least, the choice of regression algorithm can also affect the learned similarity measure’s effectiveness. We tested four commonly used regression al- gorithms, and the one that uses support vector machines for regression with the RBF kernel turned out to be the most effective.
7.1. REVISITING THE THESIS HYPOTHESIS
7.1.4
Summary
Based on the findings in each investigation, we can draw the following conclusions: • Katoa’s concept-based representation models are consistently more effective for text clustering than the traditional bag-of-words model, with the most effective clustering algorithms tested (k-means and hierarchical clustering with group-average-link criterion).
• Prior knowledge about topic distributions in a given collection can help select the most appropriate concept system to use.
• Wikipedia is more effective than WordNet in general, and should be the default choice unless the given text collection is known to be well separable. • Using concepts from both systems does not usually improve clustering, due
to the additional redundancy that is introduced.
• The k-means algorithm and hierarchical agglomerative clustering using the group-average-link criterion are the most effective clustering algorithms. • Reweighting concepts based on their binary local centrality consistently im-
proves the clustering performance of the Wikipedia-based representation. • Using the enriched similarity measure with weighted centrality consistently
improves the clustering performance of the Wikipedia-based representation if hierarchical clustering (using group-average-link) is used.
• The learned similarity measure is more consistent with an average human than humans are with themselves, with Wikipedia as the concept system. • The learned similarity measure is the most effective method for concept-
based clustering, for both hierarchical agglomerative clustering (using group- average-link) and the adapted k-means method that represents a cluster by its members.
These conclusions strongly favour the concept-based representation models over the bag-of-words model. They also provide a useful guide for applying Katoa
in real-world clustering tasks. Furthermore, Katoa’s representation models and similarity measures are not restricted to clustering, but are applicable to any tasks that involve representing texts and computing their thematic similarities.