• No results found

Related Work

In document Contextual text mining (Page 77-81)

Chapter 5 Contextual Topic Analysis with Explicit Context

5.4 Related Work

The aim of this chapter is to present the effectiveness of the CPLSA model and its special versions on a variety of real world text mining tasks.

The most relevant work to CPLSA is the Probabilistic Latent Semantic Analysis model (PLSA) proposed by Hofmann [59, 61]. Our CPLSA model is a natural extension of PLSA to incorporate context other than topics. To avoid overfitting in PLSA, Blei and co-authors proposed a generative aspect model called Latent Dirichlet Allocation (LDA), which could also extract a set of topics from a document collection [10]. The same contextual extension can be expected to apply on LDA.

Some recent work of topic modeling have considered some specific types of context. For example, temporal context is considered in [51, 137, 20, 113]. Multi-collection context is analyzed in [184]. Author-topic

analysis is proposed in [162]. Li et al. proposed a probabilistic model to detect retrospective news events by explaining the generation of “four Ws3” from each news article [91]. Our work is a generalization of

these studies of specific context and provides a general probabilistic model which can incorporate all kinds of context with topic context.

Besides the general CPLSA model, the particular text mining applications we select is related to the following work.

Spatiotemporal Text Mining

Temporal context is also addressed in Kleinberg’s work on discovering bursty and hierarchical structures in streams [72] and some work on topic/event/trend detection and tracking (e.g., [1, 13, 99, 74, 126]). However, most of this work assumes one document only belongs to one topic and cannot be easily generalized to analyze other contexts.

To the best of our knowledge, the problem of spatiotemporal text mining has not been well studied in existing work.

Most existing text mining work (e.g., [124, 113]) does not consider the temporal and location context of text. Li and others proposed a probabilistic model to detect retrospective news events by explaining the generation of “four Ws4” from each news article [91]. However, their work considers time and location

as independent variables, and aims at discovering the reoccurring peaks of events rather than extract the spatiotemporal patterns of topics.

[51] presents a generative model similar to the one in [10] to extract scientific topics from PNAS abstracts and a post hoc analysis on the popularity of topics to detect the hot and cold topics. Our work differs from theirs in that we model the temporal dynamics of topics simultaneously with topic extraction in the statistical model. Another related work to topic life cycle analysis is [137], where a Multinomial PCA model is used to extract topics from text and analyze temporal trends. However, none of this work models the spatial information of a text collection.

Spatiotemporal data mining on numerical data and moving objects has been well studied [41, 101]. [128] present a spatiotemporal clustering method to detect the emerging space-time clusters. However, these techniques aim at analyzing explicit data objects, which cannot be used for extracting and analyzing latent topic patterns from a text collection.

3who, when, where and what (keywords)

Weblog Analysis

Another line of research related to our work is weblog analysis and mining. Existing work has explored either structural analysis on communities [167, 76] and temporal analysis on blog contents [50, 53]. Our work differs from the existing work in two aspects: (1) we model the multiple topics within each blog article; (2) we correlate the contents, location and time of articles in a unified probabilistic model. None of these has been done in the previous work of weblog analysis.

Kumer and others showed that the structure and interest clusters on blogspace are highly correlated to the locality property of weblogs [77]. Although this work considered temporal and spatial distribution of weblogs, neither this work nor any other previous work has addressed the content analysis with spatiotemporal information. We consider this work as an important evidence that spatial analysis on weblog content is desired.

Some existing work further explored content and structure evolutions of weblogs for higher level tasks. For example, Gruhl and others’ work in 2004 modeled information diffusion through blogspage by categorizing temporal topic patterns into spikes and chatter [53]. Their following work in 2005 explored the spike patterns of discussion of books to predict spikes in their sales rank [52].

The general spatiotemporal topic analysis methods proposed in our work can provide fundamental utilities to facilitate such higher-level predictions.

Author-Topic Analysis

Author-topic analysis is first introduced in [162], which extends LDA to model the association between authors and topics. [159] extends this work to model the senders and receivers in a collection of emails. Unlike CPLSA, this work cannot be generalized to handle contexts other than the author. In the Temporal- Author-Topic analysis we presented, a new time context is incorporated with the context of author and topics.

Event-Impact Analysis

Topic detection and tracking [13, 144, 99] aims at detecting emerging new topics and identifying boundaries of existing events. However, most of those works focus on the detection of “events” rather than summarizing the impact of the events.

5.5

Summary

In this chapter, we presented the effectiveness of the general contextual topic model (i.e., CPLSA) in real world text mining applications. Three real applications are presented, including spatiotemporal topic analysis in weblogs, temporal-author-topic analysis, and event-impact analysis in scientific literature.

Although these tasks are quite different, they share the same characteristics: all of them instantiations of the general contextual text mining problem, more specifically instantiations of the contextual topic analysis problem. All the tasks involve topic and additional explicit contexts.

With empirical experiments on real world data, we see that the CPLSA model and its two special versions - Fixed-View CPLSA and Fixed-Coverage CPLSA are quite effective in these real world tasks. The findings have proved the effectiveness of the functional framework - contextual topic analysis, and the contextual topic model - CPLSA in Chapter 4.

Besides proving the effectiveness of the CPLSA model, the solution of the three applications also makes significant contribution to text mining.

Weblogs usually have a mixture of subtopics and exhibit spatiotemporal content patterns. Discovering topics and modeling their spatiotemporal patterns are beneficial not only for weblog analysis, but also for many other applications and domains. The results show that our method can effectively discover major interesting topics from text and model their spatiotemporal distributions and evolutions. The mining results can be used to further support higher level analysis tasks such as user behavior prediction, information diffusion and blogspace evolution analysis.

The proposed spatiotemporal topic model is completely unsupervised and can be applied to any text collection with time stamps and location labels, such as news articles and customer reviews. The method thus has many potential applications, such as (1) Search result summarization: Provide a summary for blogsearch results, which consists of topics, snapshots of spatial distributions of topics, and temporal evolu- tion patterns of topics. (2) Public opinion monitoring: Extract the major public concerns for a given event, compare the spatial distributions of these concerns, and monitor their changes over time. (3) Web analysis: Extract major topics and model the macro-level information spreading and evolution patterns on the blogspace. (4) Business intelligence: Facilitate the discovery of customer opinions/concerns and the analysis of their spatial distributions and temporal evolutions.

Besides the spatiotemporal topic analysis in weblogs, the temporal-author-topic analysis and the event- impact analysis also provides us a novel and effective way to digest the topics and contextual patterns of topics in scientific literature.

In document Contextual text mining (Page 77-81)