Effectiveness of Topic Labeling - Variations of Topic Labeling

Chapter 9 Postprocessing Automatic Labeling of Topic Models

9.4 Variations of Topic Labeling

9.5.2 Effectiveness of Topic Labeling

We first show some sample results of our topic labeling method in Table 9.3; for comparison, we also show the human-generated labels for the same topics. It is clear that the automatically generated labels can all capture the meaning of the topic to some extent; indeed, most of them are as good as human generated labels (e.g., “clustering algorithm” and “data streams”), though some are not (e.g., “air force”). Some topics are difficult to interpret even by human (e.g., “death sentence”).

To quantitatively evaluate the effectiveness of the automatic labeling methods, we ask three human assessors to compare the results generated by different methods. Specifically, for each of the most salient topics generated with PLSA (12 topics from SIGMOD and 18 topics from AP), we present to the annotators the labels generated by different methods in a random order, together with the word distribution and the most relevant documents to this topic to help a human assessor interpret the topic. A baseline method is included in comparison, which simply uses the top k terms in the word distribution as the topic labels. The other methods included in the comparison are shown in Table 9.4.

System Cand. Labels Relevance Score

NGram-1 Ngrams First-order

NGram-0-U Ngrams 0-order, uniform normaliz.

NGram-0-B Ngrams 0-order, norm. with p (w|B) .

Chunk-1 NP Chunks First-order

Table 9.4: Systems Compared in Human Evaluation

Default parameter setting: λ = 0.2, µ = 0.7 for AP; λ = 0.2, µ = 1 for SIGMOD

it is ranked at the k’th place. If the labels from several systems are difficult to be distinguished/ranked, we first give them an arbitrary ranking, and then equal their scores. For example, if there are three systems, one is significantly better, and the other two are hard to tell, we will assign score 2 to the first system, and 0.5 to each of the rest two systems. We then average the scores of each system over all topics.

Baseline v.s. Zero-order v.s. First-order

Dataset #Label Baseline Ngram-0-B Ngram-1

SIGMOD 1 0.76 0.75 1.49

SIGMOD 5 0.36 1.15 1.51

AP 1 0.97 0.99 1.02

AP 5 0.85 0.66 1.48

Table 9.5: Effectiveness of topic labeling

A higher score means that the system tends to be ranked higher.

Basic results: In Table 9.5, we compare the labels generated using the baseline method (i.e., picking high probability words), 0-order relevance (ngrams, normalized with background probability p(w|B)), and 1st- order relevance. For each group of systems, we compare both the top 1 label, and the top 5 labels they generate. From this table, we can make several observations: (1) In all cases, the labels extracted with 1st-order relevance are most preferred by the assessors, indicating that the first-order relevance method is overall the best presumably due to the fact that it can capture the overall topic distribution through context. (2) The preference of first-order relevance labels over the baseline labels is more significant on SIGMOD than on AP. This is likely because phrases are more frequently used and more discriminative in scientific literature than in the news domain. For example, informative phrases such as “mining association rules” are quite common in database literature, whereas common phrases in news articles tend to be general terms such as “united states” and “last year.” This suggests that phrases are generally good labels for scientific topics, but for other genres of text, it may be interesting to explore other candidate labels, such as short sentences. (3) The preference of the first-order relevance labels over the baseline labels is stronger when five labels are considered than when one label is considered. This may be because the preference is amplified when more labels are considered. (4) The labels of 0-order relevance seem to be comparable with those of the baseline except in one case (i.e., 5 labels on SIGMOD) when the 0-order relevance labels are strongly preferred. This again suggests that phrases are not so useful for labeling topics in the news domain, but they are more useful

for the literature domain. Overall these results show that the first-order relevance method for automatic labeling of topic models is the best among all the methods.

Noun Phrases v.s. Ngrams

Dataset #Label Chunk-1 Ngram-1

SIGMOD 1 0.40 0.60

SIGMOD 5 0.16 0.83

AP 1 0.41 0.59

AP 5 0.55 0.44

Table 9.6: Ngrams vs. noun phrases as labels

Ngrams vs. noun phrases as candidate labels: To see which of the two methods for generating candidate labels (i.e., ngrams and noun phrases) is better, we compare them on both data sets in Table 9.6. Interestingly, for the SIGMOD dataset, using statistically significant ngrams as labels is much better than using noun phrases generated by the NLP Chunker, while on the AP dataset the performance of the two types of candidate labels is closer, and in some cases the noun phrases perform even better than ngrams. This may be because the models used by the NLP Chunker are trained on general domains, and not tuned for parsing scientific literature. In general, using significant ngrams appears to be more robust; moreover, this method can also be applied to any genre of texts.

Normalization in 0-order relevance: We now look into the influence of the normalization strategy on the performance of 0-order relevance. In Table 9.7, we compare 1st-order relevance with 0-order relevance when using two different normalization strategies – uniform normalization (normalization with uniform distribution) and background normalization (normalization with a background distribution). We see that background normalization, although intuitively appealing, does not really help here. The using of background normalization fails to decrease the difference between the 0-order relevance to the better method, the 1-order relevance. Indeed, it even makes the 0-order labels worse when applied on AP. The reason is because the topic models extracted with PLSA are already discriminative due to the use of a background component model (see Section 9.5.1), thus further normalization with background is not useful, and uniform normalization is actually more robust in this case.

Normalization Dataset NGram-0 NGram-1 Diff.

Using SIGMOD 0.36 0.64 0.28

Uniform AP 0.43 0.57 0.14

Using SIGMOD 0.37 0.63 0.26

Background AP 0.26 0.74 0.48

Table 9.7: Uniform vs. background normailzation for 0-order relevance (5 labels)

models extracted with LDA, where neither is a background model included, nor are the stopwords pruned. The results are selectively presented in Table 9.8.

Model Labels Model Labels

the, of, 1st-order: the, of 1st-order:

a, and, foreign key, data integration, a, to iceberg cube, data cube,

in, data, schema matching, query rewrite, and, is data cubes, two types,

... web sites, deep web ... fact table

constraints 0-order(u): data 0-order(u):

database data integration, data sources, cube data cube, user query,

integration real data, their data, query large data, data cubes,

sources data model, database design system data structure, over data

content 0-order(b): information 0-order(b):

design integrity constraints, dynamic olap iceberg cube, m se,

information content, sql statements, foreign multimedia data cubes, data cube,

schema key, schema matching algorithm line analytical

Model Labels

the, of, 1st-order:

a, and clustering algorithm, clustering

to, data structure, data bubbles, distance

... function, very large

clustering 0-order(u):

time large data, data quality,

clusters series data, data applications,

databases high data, clustering algorithm

large 0-order(b):

performance transitive closure, subsequence

quality matching, data bubbles, clustering

algorithm algorithm, pattern matching

Table 9.8: Labeling LDA topics: 1st-order relevance is most robust

In Table 9.8, we show three sample topics extracted with LDA and their corresponding labels generated using 1st-order relevance and 0-order relevance with different normalization methods. Without pruning stopwords or using a background model, we end up having many non-informative words on the top of each topic model; to better illustrate the meaning of each topic, at the bottom part of the topic word distribution, we also present some more discriminative terms. We see that the 1st-order relevance still generates good discriminative labels even though the high probability words of the original topic model are all non-informative words. This is because the 1st-order relevance captures the entire context of the topic model. In contrast, with uniform normalization, the 0-order relevance would be biased to assign high scores to non-informative phrases such as “real data” and “their data” when the top probability terms of the topic model are non-informative (e.g., “the”, “their”) or too general (e.g., “data”, “large”). With normalization by background model p(w), we can penalize a phrase with non-informative or general words, thus alleviate this problem. However, in this way, the top ranked labels tend to be too specific to cover the general meaning of the topic (e.g., “integrity constraints”, “transitive closure”, etc). Thus overall we see that modeling the semantic relevance with first-order relevance is most robust because the semantics is inferred based on the context of the entire distribution.

Upper bound analysis: How much room is there to further improve the topic labeling method? We can answer this question by looking into how much worse the automatically generated labels are than those generated manually. Thus we ask a human annotator to generate topic labels manually, and ask two different

assessors to compare the system-generated labels with the human-generated labels. In Table 9.9, we see that although the system-generated labels are good, the assessors still consider human-generated labels to be better. This implies that there is still much room to improve the quality of the automatically generated topic labels. Interestingly, the difference between system generated and human generated labels is less significant on the SIGMOD data than on the AP data, suggesting that literature topics may be easier to label than news topics.

System v.s. Human

Dataset #Label Ngram-1 Human

SIGMOD 1 0.35 0.65

SIGMOD 5 0.25 0.75

AP 1 0.24 0.76

AP 5 0.21 0.79

Table 9.9: Comparison with human generated labels

9.5.3 Labeling Document Clusters

In Section 9.4.1, we discussed that the topic labeling method could actually be applied to any text information management tasks where a multinomial word distribution is involved. Here we look into one such application – labeling document clusters. In this experiment, we cluster the SIGMOD abstracts with the K-Medoids algorithm [69], and try to utilize the topic labeling method to label the clusters. Specifically, we estimate a multinomial word distribution for each cluster based on its member documents using the maximum likelihood estimator. The proposed topic labeling technique can then be applied on the estimated term distributions, and the top ranked phrases are used to label the original cluster.

Cluster Labels |d| Cluster Medoids (Title)

multivalued A complete axiomatization for

dependencies, 167 functional and multivalued dep-

functional dependencies endencies in database relations

two locking, 86 Performance of B-tree

concurrency control concurrency control algorithms

nearest neighbor, 69 Optimal multi-step k-nearest

similarity search neighbor search

approximate answering, 184 Approximate XML query

approximate query answers

Table 9.10: Labeling document clusters: K-Medoids

The generated cluster labels, along with the number of documents and the title of the medoid document of each cluster are shown in Table 9.10. By comparing the cluster labels with the medoid documents, we see that the topic labeling technique is also effective to label document clusters. This experiment shows that

in which such a multinomial term distribution is involved.

In document Contextual text mining (Page 159-164)