Chapter 8 Postprocessing Discovering Evolutionary Theme Patterns
8.2 Problem Definition
8.4.4 Parameter Tuning
In our methods for ETP discovery, there are a few parameters that are meant to provide the necessary flexibility for a user to control the pattern analysis results. We now discuss them in some detail.
In the mixture model for theme extraction, λB controls the strength of our background model, which is
believe our data set has, the larger λB should be. Our experiments have shown that, in ordinary English
text, the value of λB can be set to a value between 0.9 and 0.95. Within this range, the setting of λB does
not affect the extracted themes significantly, but it does affect the top words with the highest probabilities; a smaller λB tend to cause non-informative common words to show up in the top word list. Parameter k
represents the number of subtopics in a subcollection that that one believes prior to theme extraction. In our experiments, we determine the number of themes by enumerating multiple possible values of k and drop the themes with significant low value of 1
|Di|
P
d∈Diπd,j.
Another parameter is the evolution distance threshold ξ. This parameter has a “zooming” effect for the discovered theme evolution graph. A tighter (smaller) ξ would only allow us to see the strongest evolutionary transitions, whereas a looser (larger) ξ would allow us to examine some weaker evolutionary transitions as well.
Yet another parameter is the size of sliding window W , which is used in analyzing theme life cycles. W controls the amount of supporting documents when computing the strength of theme θ at time t. When
W is set smaller, the strength curve of theme θ over time would be more precise and sensitive. However,
the sharp variations make it difficult to see the overall trend of themes. When W is set larger, the strength curve would be smoother. However, some local variation patterns, which may be useful sometimes, would also be smoothed out. Regarding the “report delay” problem in the news domain, a reasonable value for W appears to be 7-15 days (3-7 days at each side of time t).
8.5
Related Work
While temporal text mining has not been well studied, there are several lines of research related to our work. Temporal data mining on numerical data has been well studied [19, 70, 160]. However, these techniques cannot be used for discovering theme evolution patterns from a text stream. Novelty Detection and Event Tracking [13, 144, 99] aims to detect the emerging of a new topic and identify boundaries of existing events. Morinaga and Yamanishi’s work in 2004 tracks the dynamics of topic trends in real time text stream [124]. They assume the dynamics of each topic bears a Gaussian distribution and use a finite mixture model to learn the distribution online. The major differences between our work and their work are that (1) they do not consider relations between topics; (2) we aim to mine ETP in large static text collections while they focus on detecting topic trends online. Trend Detection [74, 150] detects emerging trends of topics from text. However, those works either don’t represent topics with themes, or don’t use unsupervised methods to discover themes and trends.
A related work to theme life cycle analysis is [137], where Perkio and others used a Multinomial PCA model to extract themes from a text collection and a summation of a hidden theme-document weight, which is similar to πd,j in Section 8.3.1, over all documents in a time period to represent the strength of theme j in
this time period. The major difference between our work and theirs is that we model the theme transitions with an HMM which explicitly segments the whole text stream into corresponding themes, and use the size of segments of each theme to measure the theme strength.
Text clustering is another well studied problem relevant to our work. Specifically, the aspect models studied in [61, 184, 10] are related to the mixture theme model we use to extract themes. However, these works do not consider temporal structures in text. Nallapati and others studied how to discover sub-clusters in a news event and structure them by their dependency, which could also generate a graph structure [126]. A major difference between our work and theirs is that they perform document level clustering, while we perform theme level word clustering. Another difference is that they do not consider the variations of subtopics in different time periods while we analyze life cycles of themes.
Since a theme evolution graph and theme life cycle can serve as a good summary of a collection, our work is also partially related to document summarization (e.g., [77, 2]). Allan and others presented a news summarization method based on ranking and selecting sentences obeying temporal order [2]. However, summarization intends to retain the explicit information in text in order to maintain fidelity, while we aim at extracting non-obvious implicit themes and their evolutionary patterns.
8.6
Summary
Text streams often contain latent temporal theme structures which reflect how different themes influence each other and evolve over time. Discovering such evolutionary theme patterns can not only reveal the hidden topic structures, but also facilitate navigation and digest of information based on meaningful thematic threads. In this chapter, we propose general probabilistic approaches to discover evolutionary theme patterns from text streams in a completely unsupervised way. To discover the evolutionary theme graph, our method would first generate word clusters (i.e., themes) for each time period and then use the Kullback-Leibler divergence measure to discover coherent themes over time. Such an evolution graph can reveal how themes change over time and how one theme in one time period has influenced other themes in later periods. We also propose a method based on hidden Markov models for analyzing the life cycle of each theme. This method would first discover the globally interesting themes and then compute the strength of a theme in each time period. This allows us to not only see the trends of strength variations of themes, but also compare the relative strengths
of different themes over time.
We evaluated our methods using two different data sets. One is a stream of 50 days’ news articles about the tsunami disaster that happened recently in Asia, and the other is the abstracts of the KDD conference proceedings from 1999 to 2004. In both cases, the proposed methods can generate meaningful temporal theme structures and allow us to summarize and analyze the text data from temporal perspective. Our methods are generally applicable to any text stream data and thus have many potential applications in temporal text mining.
This chapter serves as an illustration on how to postprocess the basic contextual patterns extracted from the context modeling phase of contextual text mining, in order to extract refined contextual patterns.
Evolutionary theme patterns, including theme evolution transition, theme evolution thread, evolutionary theme graph, and theme life cycles are good examples of refined contextual patterns. Using contextual topic models such as PLSA, what we can get are basic contextual patterns, the conditional distributions like
p(w|θ) and p(θ|d). In this chapter, we presented how to further analyze these basic patterns and extract the
refined patterns such as the evolutionary theme graph and theme life cycle. The former is constructed by comparing the similarity between context models using KL-Divergence. The latter is constructed by figuring out the active domain of each context with the help of a hidden Markov model. These two ways of patten postprocessing not only apply in this particular application, but also apply to other tasks with different contexts and contextual language models involved.