Time series analysis for event detection in text data

(1)

c

(2)

TIME SERIES ANALYSIS FOR EVENT DETECTION IN TEXT DATA

BY RONGDA ZHU

THESIS

Submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science

in the Graduate College of the

University of Illinois at Urbana-Champaign, 2016

Urbana, Illinois Adviser:

(3)

ABSTRACT

With the rise of social media and online newswire, text streams are attracting more and more research interest. These streams are presented in the form of time series by nature, therefore, how to efficiently analyze these time series and extract useful information from them are of great importance. Modern time series analysis (TSA) has been applied widely in areas such as finance, physics and signal processing, however, there is not so much working explor-ing time series analysis in the field of text minexplor-ing. While traditional time series analysis tasks are relatively well defined such as modeling and fore-casting, we now need to adapt the tasks to meet the requirement of different text mining problems.

Event detection is the general task of finding any emerging events, such as significant changes in stock price, anomalies in climate data, and outbreaks of a certain disease, depending on the data we are interested in. While in text mining, event detection, which is identifying the significant new stories, is attracting more research attention given the increasing popularity of social media and digital journalism. In time series analysis, there is also a common task, change point detection, which focuses on a similar challenge. In this thesis work, we first examine the features presented by the time series of counts of terms in corpus. We then explore applying existing change point detection methods to event detection, and also propose a novel TSA based method for event detection.

(4)

(5)

ACKNOWLEDGMENTS

First and foremost, I would like to express my sincerest gratitude and respect to my advisor, Professor Chengxiang Zhai, who is a great researcher, and al-ways encouraging and helping me along my way. I will alal-ways remember and appreciate his help. I would like to thank University of Illinois for its support and rich research resources. I would also like to thank all my friends who contributed to my efforts. Finally, I want to address my deepest gratitude to my families, whose love and care are pure and unconditional. This thesis work is dedicated to all of them.

(6)

CHAPTER 1

INTRODUCTION

1.1

Background

The increasing popularity of social media leads to a vast amount of text stream data, which can be viewed as time series by nature. The emergence of this rich data makes it necessary for us to find efficient ways to mine knowledge from it. Although text stream mining [1] has been extensively studied in a lot of works [2, 3, 4, 5], and time series analysis (TSA) is also a well-established research area [6, 7], there is very limited work that really uncovers the relation between the two.

There are several interesting tasks in text stream mining. Text stream clustering [8, 9, 10] can be useful for news filtering, topic detection & tracking (TDT) and recommender systems. Text stream classification is also studied on different kinds of data [8, 11, 12, 13] with many applications such as spam filtering and sentiment analysis. Though work in time series clustering and classification can also be applied to these tasks, our work mainly focuses on

event detection in text streams.

Event detection, or first story detection (FSD) is the process to track and identify the first article that covers a specific event or topic. This problem can be formulated as marking each story as “first story” or “non-first story”. A common way is to find a way to score each document, where the score is the confidence of a story being a first story [14]. For example, when the Paris Attacks occurred in November 2015, we want to know what the first story on CNN reporting or what the first tweet about this event was. In a large text corpus, we would also like to get an idea on the events happened without reading each article. Or in a real time text stream, we want to classify each new story as first or not in an online fashion.

(8)

the batch setting. We want to start our observation and develop our method from the perspective of time series data, which has not been well studied before, and then compare our results with the baseline methods.

1.2

Organization

The remainder of this thesis will be organized as follows: Chapter 2 will give a overview of time series analysis including concepts and applications. Details of the event detection problem and our proposed approach will be presented in Chapter 3. We demonstrate the experimental results in Chapter 4 and conclusions will be drawn in Chapter 5.

(9)

CHAPTER 2

TIME SERIES ANALYSIS OVERVIEW

In this chapter, we briefly present time series analysis as an independent research problem. We will also discuss different time series data in text streams and modern social media. Specifically, we introduce change point detection in TSA to shed some light on the event detection problem.

2.1

Introduction

Time series analysis (TSA) is the system approach of analyzing time series data in order to extract meaningful statistics and other characteristics [7]. A time series is a sequence of data points that:

1. Consists of successive measurements made over a time interval. 2. The time interval is continuous.

3. The distance in this time interval between any two consecutive data point is the same.

4. Each time unit in the time interval has at most one data point.

The most important TSA tasks include regression (modeling) [15, 16, 17], forecasting [18, 19], classification and clustering [20, 21, 22], and change detection [23, 24, 25, 26].

As an example, the time series data given in Figure 2.1 is the closing price of Alphabet stock each trade day1. In this series, each data point represents a price at a specific date, and the next data point is the value for the next trading date. We can see that the general trend of this series is increasing at first and then decreasing. Modeling and predicting the stock price will be of

(10)

great importance. Change detection will also be useful if we want to identify some factors that significantly influenced the stock price.

0 20 40 60 80 100 120 580 600 620 640 660 680 700 720 740 760 780 day Price

Figure 2.1: Closing price of Alphabet Inc. from Sept 1, 2015 to Feb 19, 2016

Time series modeling is the process of analyzing the past observations of a time series, and finding the best model to describe the series [7]. Therefore, it is also associated with regression in many cases. One of the most frequent-ly used model is the Autoregressive Integrated Moving Average (ARIMA) model, popularized in [27]. The model incorporates the autoregressive mod-el and the moving average modmod-el, with the assumption that the value of time series data points should depend on its previous value, and that the random noises from each point are independent and identically distributed. The model is also extended to Multiplicative Seasonal ARIMA (SARIMA) Models [28, 29, 30], where a season factor is added to account for the season-al or periodicseason-al behaviors. For example, monthly data may present a strong seasonal character with lag s = 12, for some underlying annual factors [7]. Forecasting works based on ARIMA models have also gained success across

(11)

various communities [31, 32, 33, 34].

Time series clustering aims at grouping time series into homogeneous clus-ters based on their statistic features. The k-means [35] and k-medoids [36] algorithms are still valid if we define reasonable distance measures, and they can achieve pretty good results. Similarly, we also have fuzzy c-means [37] and fuzzyc-medoids methods [38], where we allow one instance belong to two or more clusters. Mixtures of Markov chains [39] and HMM [40, 41] have also been utilized in this clustering problem, and in an interesting work [42], the authors propose a mixture of ARMA models to do clustering. They use an expectation maximization (EM) algorithm to learn the mixture coefficients and the number of clusters can also be automatically determined.

Time series classification is the process of assigning time series to prede-fined classes. We have supervised learning [43, 44, 45] and semi-supervised learning approach [46]. Features are extracted from time series, and conven-tional classification methods such as SVM can be applied.

Frequency domain approaches are another group of methods attracting research attention. Periodicity can always be efficiently detected in frequency domain [5, 47], and Wavelet transform [48, 49] has been proved to be useful in different TSA tasks.

2.2

Time Series Data in Text

With the ubiquity of text stream data, we have a lot of data in the form of time series in text. For example, the number of hashtags of a certain term on Twitter has already been viewed as a sign of popularity. Analysts may use that data to estimate the value of Superbowl commercials2_{, or predict}

the result of a presidential election3_.

Since external time series data is more ready to get, there are also works that use text mining techniques to help analyze the external time series. Blog content is used in [50] to predict the influenza and disease. Query logs of search engine can also be used similarly for this problem [51]. While in [52], the authors discuss and explore the domains that search can really help predict, and for those with rich domain knowledge, the prediction may not

2_{http://marketingland.com/hashtags-super-bowl-163101}

(12)

be that powerful. They use query volume to predict the opening weekend box-office revenue and the rank of songs on the Billboard Hot 100 chart. In [53], the authors propose to correlate financial time series, such as stock price, with micro-blogging activities. They collect information from several companies and detect events that lead to change in financial market. They also take user interaction into consideration.

The number of hashtags is a feature special for Twitter, however, we still have other more general time series in other text data such as term counts. In fact, in this work we will mainly focus on the time series of term counts, while there is very little work utilizing term counts as time series data.

2.3

Change Point Detection

We will intuitively come up with the problem of change point detection in TSA for its similarity to the event detection problem by nature. Change detection tries to identify times of the changes in probability distribution of a certain time series. There are also batch and online settings in time series change point detection. In the batch setting, we can examine the whole series and try to decide when it changes. While in online setting, we have new data coming at a rate and we need to determine whether a new data point is a change point or not using the information from past data. We may not have enough memory to store all the past data either, so sometimes we need to apply a window. Other variation includes single or multiple change point, and univariate or multivariate series.

Early change detection methods include Bayesian or sample theory based hypothesis testing, model parameter estimation using Bayesian, maximum likelihood or minimum mean square error methods [23]. Most of these meth-ods target directly at the classification of change point of some significant statistics, such as mean or variance [54], by more traditional statistical meth-ods [25, 26, 55, 56]. The advantage of these methmeth-ods is straightforwardness and simplicity, however, they are largely limited by the statistical framework. Another survey on Bayesian methods is [57]. The author mainly considers estimating the position of the change point and applying Bayesian hypothe-sis testing methods under the assumption that there is at most one change in the time series. There are more works studying change point of different

(13)

statistics [58, 59, 60], but still limited to the scope of hypothesis testing and other statistical methods.

Another category of methods is segmentation-based. The general idea is trying to cut the time series into segments where the sequence within each segment is coherent and the segments are homogeneous. Change points will affect the homogeneity of adjacent segments. For example, in [24], the authors consider thek-segmentation problem where we want to cut the series into k segments. The users may define the number of segments k. They use the mean of each segment to model homogeneity. The cost function is just the sum of squared deviation from the mean. [61] extends the study using a mean vector for the multivariate case. In [62], an extensive review on different strategies including sliding window and top-down. They also propose another algorithm which uses a buffer and incorporates sliding window and bottom-up processes. There are also methods using non-linear functions. Polynomial basis functions are used in [63].

Time series modeling can also be helpful in change point detection problem. One assumption is that the series is generated by an underlying model, how-ever, the change of model parameters results change in time series. Partition regression is proposed in [64]. The study is extended to multiple regression in [65]. Since regression model can also be incorporated into segmentation, there is a modified approach in [66]. In [67], the detection of change point is done by based on the likelihood ratio test for changes in model parameters. The authors also study the problem of choosing appropriate threshold values. Other important approaches include non-parametric methods, as described in [25, 68]. These methods don’t require any priori information about the data, therefore, are more useful for many real applications. There are also methods based on the assumption that the data should be forecasted well if there is no change taking place [69]. Therefore, when some data point is deviating very much from the forecast, we will identify a change [70]. This method can also be extended to Bayesian [71, 72], non-Bayesian [73], and other statistical models [74]. Subspace methods [75, 76, 77] are also attracting a lot of attention, where a subspace is generated based on principal component analysis (PCA) or other subspace learning algorithms.

(14)

CHAPTER 3

EVENT DETECTION IN TEXT DATA

In this chapter, we first look at the time series data in text streams and give the formal problem setting. We also discuss the problem of event detection in text data. We then introduce fused lasso in time series, which our approach is based on. Finally, the details of our method is described.

3.1

Time series in Text Streams

3.1.1

Formal Problem Setting

We now describe the formal setting of our data and problem. Given a doc-ument sequence of size D, {d1, d2, . . . , dD}, we first mix all the documents from the same date together to form a concatenated document. Assume there are T time stamps t1, t2, . . . , tT in the total time span of the data set, we can denote the concatenated documents as {S1, S2, . . . , ST}. Given the vocabulary of corpus as V, we denote the terms as w1, w2, . . . , wV. We use the counts of each term at each time pointcount(ti, wj) = ci,j to formV time series of each term:

ts1 ={c1,1,c2,1, . . . , cT,1},

ts2 ={c1,2,c2,2, . . . , cT ,2},

.. .

tsV ={c1,V,c2,V, . . . , cT ,V}.

The output of an event detection model should be a set of events, where each event is denoted either by

(15)

dis-cusses this event, or

2. a set of terms that can describe the event.

Assume there areM-labeled events in the corpora, we want to come up with the top documents, which are labeled as first story, where the events first happened. There are typically two settings in the problem:

1. Batch Setting: We have all term counts data and we want to detect all the documents which contains first story of an event.

2. Online Setting: For a given time point ti, we will have all the data points from t1, t2, . . . , ti−1, and we need to decide whether or not a

document on ti is a first story.

Note that the number of documents detected by the algorithm is not nec-essarily equal toM. Intuitively, if we identify more documents, then we tend to make more false alarms; if we identify less documents, we tend to make more misses. As a result, we need to compromise between false alarm and missing rates.

3.1.2

Normalization

When we are looking at the data, the total term counts of different time stamps can vary. If on a specific time stamp there are much more terms than other stamps, more counts for a specific term will not be indicating that this term is more popular. Therefore, normalization is added to neutralize this undesired effect. We will let

Ci,j =ci,j/ V X k=1 ci,k =p(wj|ti), T Si ={C1,i, C2,i, . . . , CT ,i}.

In this way, the normalized data now represents the probability of a specific term on a specific time stamp. Later we will show that for the three datasets we use, since the total term counts of each day are relatively close, so the results will not change very much with or without normalization. From now

(16)

on, our results we refer to will be generated by experiments with normal-ization. We will also adopt other text preprocessing techniques such as stop word removal and stemming, which will described later.

3.2

Event Detection

The problem of event detection in text data has been studied extensively. Conventional event detection aims at finding anomaly or trends from data including surveillance, physical data, scientific exposure, newswire or social media, and financial data. While in text mining, we need to use the features from text data. Among the early efforts, the New Event Detection (NED) or First Story Detection (FSD) problem is defined and gets popular in the topic detection and tracking (TDT) project [78], where the authors studied several TDT problems, and new event detection is one of the tasks. A decision must be made if a story discusses a new event. Vector Space Model (VSM) is used in early works such as [79, 80], where the authors present the document as a feature vector, and define distance measurement such as lexical or cosine similarity. If a new document doesn’t have a high similarity with any previous document, then it is identified as a new document. However, this approach relies on the search for the nearest neighbor, which is time consuming.

An efficient algorithm of FSD on Twitter is proposed in [81]. The authors focus on locality-sensitive hashing (LSH) and limit the number of documents in a bucket to a constant. In this way, they manage to achieve a comparable performance with much less time consumption. The authors also consider variance reduction to make the results more stable. Document classifiers based on semantic and syntactic analysis are utilized in [82], where the user modify the similarity measure using the two classifiers.

With the rapid growth of social media, there are increasing efforts focusing on blogs [83, 84] and especially Twitter [81, 85, 86]. In [87], the authors apply conventional event detection with various heuristics. However, since Tweets are posted and commented in real time, and the content is not always of high news quality, including personal status update, mood and conversation. These will bring a lot of noise to the detection process. Therefore, in [81], the authors study threads of tweets, which are a group of tweets discussing the same topic. A new event is more likely to happen when certain news

(17)

threads are growing very fast.

3.3

Fused Lasso in Time Series

Fused Lasso is first proposed in [88]. While conventional lasso penalizes a least squares The basic idea is adding a smoothing regularization to the conventional lasso. The conventional lasso is given as the problem:

b β = arg minX i yi− X j xijβj 2 s.t. X j |βj| ≤s

The fused lasso is proposed as: b β = arg minX i yi− X j xijβj 2 s.t. p X j=1 |βj| ≤s1 and p X j=2 |βj−βj−1| ≤s2

We can also write it in the Lagrange form:

b β = arg minX i yi− X j xijβj 2 +λ1 p X j=1 |βj|+λ2 p X j=2 |βj−βj−1|

The fused lasso is applied to the scenario where sparsity lies in the difference of coefficients. This enhances the local consistency in coefficients.

If we view it from the perspective of time series, we will let βj be a time series, then fused lasso can smooth the change of this time series along time. The fused lasso is applied to compressed sensing of discrete time series in [89], where the authors model time-varying signals as vector time series. Change point detection algorithms based on group fused lasso are proposed in [90, 91]. While the authors don’t particularly focus on time series data, they manage to detect multiple change points. However, what they do is only conventional change detection, which is different from document-level event detection.

(18)

3.4

Our Approach

3.4.1

Framework

We now present our proposed method. Instead of using a square loss like conventional fused lasso, we use a negative log likelihood of generating the documents as our loss function to minimize [92]. The negative log likelihood together with the regularization term forms our objective function.

Let Ci,j be the normalized count of wj on ti, and we will get the log-likelihood of generating the concatenated documentSiincluding all the terms on this time stamp will be

logp(Si) = X

j

Ci,jlogp(wj, ti)

We must model the probability of generating a term wj on ti. We will use a mixture model. Assume there are K topics z1, z2, . . . , zK, where each topic is featured by a probability distribution over all the terms. With the assumption that the timestamp ti is drawn uniformly, we have

We let θi,1, θi,k, . . . , θi,K whereθi,k =p(zk|ti) as the topic mixture coefficients onti, andβk,j =p(wj|zk) is the probability of generating term wj by a topic zk. We now can write the log likelihood of the corpora and our objective function in the form of fused lasso as follows:

(19)

Note that here we use θi to denote the vector (θi,1, . . . , θi,k). We don’t have same `-1 regularizer with [88], since we don’t actually need to emphasize on the sparsity of θ itself. Instead, we add regularizer on the difference between adjacent θto do smoothing. Since events have time span, the topic mixtures over time stamps should keep some extent of consistency. When the topic mixture present sharp changes, some events are very likely to happen.

Intuitively, the value of λ controls the penalization on difference of θ. When λ gets smaller, we will not put penalty on the difference of θ. There-fore, the problem will be maximization of the log likelihood. For example, if we use any topic model such as LDA to generateβ, then the solution will just be the topic mixture from the topic model, with maximal number of change points. Whenλ gets larger, we will put heavy penalty on the difference of θ, making less change points in the optimal solution. When λ goes to infinity, we will just have the optimal solution θ∗_i,k = 1/K.

0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 20 40 60 80 100 120 140 160 180 λ

Number of change points detected

(20)

We also notice that there should be the constraints V X j=1 βk,j = 1, k= 1,2, . . . K K X k=1 θi,k = 1, i= 1,2, . . . , T

Instead of writing the Lagrange form with this constraint optimization, we will use the following transform

θ0_i,k = e θi,k

PK

m=1eθi,m

Finally, our problem is formalized as the following optimization problem (βb,θ) = arg minb β,θ − T X i=1 V X j=1 Ci,jlog K X k=1 eθi,k PK m=1eθi,m βk,j +λ T−1 X i=1 K X k=1 |θi,k−θi+1,k| (3.1) This may require a two-step optimization process on β and θ respectively. We fix β and optimize over θ, and then optimize over β with θ fixed. We will just use gradient descent for each step.

We use L(β,θ) to denote the our loss function and the gradient over θ is given as: ∂L ∂θi,k =− V X j=1 Ci,j eθi,k_β k,j PK m=1eθi,mβm,j − e θi,k PK m=1eθi,m (3.2)

After updating θ, we get the gradient over β as:

∂L ∂βk,j =− T X i=1 Ci,j eθi,k PK m=1eθi,mβm,j (3.3)

(21)

We will update θ and β θ(n+1) =θ(n)−γ1 ∂L ∂θ β(n+1) =β(n)−γ2 ∂L ∂β

The stopping criterion is the difference of objective function between itera-tions.

3.4.2

Batch Setting

In batch setting, we want to detect the first stories in the corpora with all data available. In problem 3.1, the optimization is overβandθ. We consider a two-step process over β and θ. However, when the number of documents in the corpora is large, V will also be large and Ci,j, i = {1, . . . , T}, j = {1, . . . , V} as a matrix will not be sparse. This will make our algorithm time consuming updating too many entries in β. In real experiments, we also use

AdaGrad [93] to help with the choice of step size γ1.

Since we are in a batch setting and the full dataset is available for us, we can first get fixedβfrom a topic model, such as LDA, and omit the updating process of β in the iterations. Since the size of θ is much smaller thanβ, in this way, we are able to reduce time consumption significantly.

After getting the optimal solutionθ∗, we will just use the simple criterion, whether θi,k > 3θi−1,k, to detect a change point in the k-th topic. After identifying the changing time stamp, we match the documents with the k -th topic and select -the top documents represented by -the topic. Since -the popularity of an event can grow very fast and the discussion on it may keep going up for a long time, we only identify the first change point in a row as valid. While this sometimes makes it difficult to identify two events of the same topic (e.g. politics or sports) whose time spans have overlap, it helps us to reduce the false alarm rate significantly. We can also choose the number of documents considering tradeoff between false alarm and missing rate.

(22)

3.4.3

Event Detection Score

In TDT or FSD tasks, we typically need to assign a score for the detected documents. The higher the score is, the more confidence we put in it as a first story. For example in [79], this score of a document is called novelty score, which is the document’s distance from the closest previous document. The higher this score is, the more different the document is from any previous document. Therefore, this distance is a valid score.

We consider the regularization parameterλin 3.1. As discussed in previous sections, a larger λ will result in less change points. Therefore, significant changes can still be detected when λ gets larger, while insignificant ones cannot. Then the value of λis indicating the significance of the event. Also, the detected document should be relevant to the topic where change point is found. We will use the topic mixture of a document. We will use the following process to generate the score:

1. Start from λ= 0 and record all the change points. 2. Increase the value of λ by a certain step size.

3. Find the last λvalueλ1 where each change point can last be detected.

4. For a document d on the detected time stamp t for topic k, we let πk d is the topic mixture coefficient if d onk, and we will let

score(d) =λ1∗πdk∗e −ηt_.

We can see thatηis a time decaying factor where a later published document will get heavier penalty to prefer the early documents. We can set a threshold on the score to control the number of events detected, thus compromising between false alarm and missing rate.

(23)

CHAPTER 4

EVALUATION

4.1

Dataset

We first describe the dataset we use in our experiments. We have three time-stamped data sets, including CNN TV transcripts and standard TDT5 dataset. To generate time series, we use day as our time unit and first put all the stories of the same day together, as described previously. Then we do stemming and generate term counts for each day with normalization. We will introduce the datasets respectively.

4.1.1

CNN TV Transcripts

We collect TV Transcripts from CNN throughout the year 2008 to the year 2012. TV transcripts are the on-screen text in a television program. S-ince those scripts are not always clean or well punctuated, we use Stan-ford CoreNLP1 tools to detect the sentence boundaries. We also build an SVM-based classifier to segment the transcripts into stories, which is the well-studied story segmentation problem [94, 95, 96, 97]. Our training da-ta is labeled manually and the features we use include semantic similarity, sentence length, and key phrases.

The transcripts include several CNN programs including new broadcast and interviews. We mark 10 events each year from 2009 to 2012, and man-ually label the first stories of each event, which we will use as ground truth. The reason why we may mark more than one document for an event as first story is that, in this newswire dataset, timeliness is relatively less important, and some documents are reporting the same events with their published time very close. A list of all marked events is given in Table 4.1.

(24)

News Event Date Labeled

Death of Whitney Houston Feb 11, 2012 Shooting of Trayvon Martin Feb 26, 2012 Jerry Sandusky’s Trial Jun 11, 2012 Aurora Shooting July 20, 2012 London Olympics July 27, 2012 Hurricane Isaac Aug 21, 2012 Benghazi Attack Sept 11, 2012 Hurricane Sandy Oct 22, 2012 Presidential Election Nov 6, 2012 Sandy Hook Shooting Dec 14, 2012 Table 4.1: Labeled Events in 2012 CNN Transcripts

We can see that for some of the new events, they are emergencies bursting at some point. For these events, we will have no story report them before this point. However, for the other events, the dates of them are scheduled and there will be news reports long before the date on the same topic, such as London Olympics or Presidential Election. Then these reports will bring potential noise for detection of the events.

4.1.2

TDT5 Dataset

TDT5 dataset is one of the most popular datasets used for event detection (first story detection). In our experiment, we use the stories from TDT5 dataset, containing 407503 news stories in text, spanning six months from April to September 2003. The stories are in different languages including English, Mandarin Chinese and Arabic, and we are using the English part here. There are 278109 English documents from several different news media including New York Times, CNN, LA Times and Xinhua. Here is a summary of the content: We also have the English translation of the Chinese and Arabic news. There are 250 annotated topics in the stories, and each topic is assigned with a “first story”, or “seed story”, which features the first article that covered that topic. We will use this as the ground truth.

Note that first story detection is just one of the tasks in TDT 2004. In their notation, a topic is basically featured by an event, i.e. one topic can only have one event or first story to detect, while there could possibly be other directly related events. They have emphasized the directness of the

(25)

relation to avoid mixing too many events together to a single topic.

4.2

Evaluation Metric

As proposed in [98], we will introduce the penalty function

Cdet =CmissP(miss)P(target) +Cf aP(f a)(1−P(target)) (4.1) where Cmiss and Cf a are the costs of missed detection and false alarm re-spectively. We also have

Pmiss = #Missed Detection/#Target Pf a = #False Alarm/#Non-Target

P(target) is the prior probability of finding a first story. According to [99], we should put more penalty on a miss than on a false alarm. We choose

Cmiss = 1 andCf a = 0.1. Assume we have 250 first stories in the corpora of 250000 documents, then P(target) = 0.001.

In real applications, we also need to normalize the cost to a more mean-ingful range, in order to better interpret the results [99]. We will divide Cdet by the minimum expected cost by classifying all documents as “first story” or “non-first story”. The first one will give Pmiss = 0 and Pf a = 1, while the second one means Pmiss = 1 and Pf a = 0. After normalization, the cost

News Media Doc Count

AFE (Agence France Presse) 95432 APE (Associated Press) 104,941

CNE (CNN) 1117

LAT (LA Times/Washington Post) 6692

NYT (New York Times) 12,024

UME (Ummah) 1101

XIE (Xinhua) 56,802

Total 278,109

(26)

function is given as:

(Cdet)norm =Cdet/min(CmissPtarget, Cf a(1−Ptarget))

The normalizedDET of a reasonable algorithm will fall into the range [0,1). When the score is equal to 1, it means that the algorithm is not performing better than simply guessing all the documents as first or non-first stories. We will now just use Cdet to denote the normalized DET.

We will compare our algorithm with three baselines: 1. Original topic mixtures by time stamp

2. UMASS system [79] for TDT-5

3. Locality sensitive hashing (LSH) based method proposed in [81] The first one is simply using the topic mixture. As we know, the mixture of a topic can indicate how relevant the document are to this topic at a given time stamp, so it’s to some extent an indicator of a topic’s “strength”. When a change point is detected, we can assume that there is a relevant event going on. And we directly detect the change points of the topic mixtures along time and use these points as the time stamp of first stories.

4.3

Event Detection in CNN News

Before conducting experiments on the standard TDT5 dataset, we first use the CNN newswire to get an idea of the performance of our method. We first look at the average of the topic mixtures along time, which can indicate an event to some extent. However, there are far too many change points, most of which represent no real events and cause false alarm, just like what we can see from Figure 4.1(a).

Then we present the result of our algorithm on the same topic. When λ

is getting larger from 0.04 to 0.1, the change points detected are less and we have a clearer idea where the real events are. When we chooseλ= 0.2, we can see that the only detected events are Hurricane Isaac2 _{and Hurricane Sandy}3_.

2_{https://en.wikipedia.org/wiki/Hurricane Isaac (2012)} 3_{https://en.wikipedia.org/wiki/Hurricane Sandy}

(27)

0 50 100 150 200 250 300 350 400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Day Topic Mixture

(a) Original Topic Mixture.

0 50 100 150 200 250 300 350 400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Day c1 ( θ ) (b) λ= 0.04. 0 50 100 150 200 250 300 350 400 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Day c1 ( θ ) (c) λ= 0.1. 0 50 100 150 200 250 300 350 400 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Day c1 ( θ ) (d) λ= 0.2.

Figure 4.1: Mixture of a topic along the timeline

In Figure 4.3, we can see that the value ofλ has a clear effect on the missing and false alarm rate, under the setting of a certain threshold. In fact, this topic is about hurricanes and disasters. We can see that with a proper value of regularization parameter λ, our algorithm manages to filter the “noise” and keep only the most significant events. Thescoresof Hurricane Isaac and Hurricane Sandy are 0.28 and 0.46 respectively.

We now show the false alarm and missing rate against λ we use. For the baseline, these two measurements don’t depend on any other parameters.

We can see that generally speaking, our algorithm can achieve reasonable performance on the CNN dataset. If we look into the labeled events in Table 4.1, we can get the score as the last value of λ where the event can still be detected, in Table 4.4.

(28)

λ False Alarm (%) Missing Rate (%) 0.01 32.8 7.5 0.02 27.8 10.0 0.04 17.7 12.5 0.06 14.7 27.5 0.09 9.0 37.5 0.12 8.4 45.0 0.16 6.6 70.0 0.20 6.4 72.5 0.30 5.4 75.0 0.50 3.2 82.5

Table 4.3: False alarm and missing rate against λ

News Event Score

Death of Whitney Houston 0.24 Shooting of Trayvon Martin 0.41 Jerry Sandusky’s Trial 0.31

Aurora Shooting 0.40 London Olympics 0.27 Hurricane Isaac 0.28 Benghazi Attack 0.27 Hurricane Sandy 0.46 Presidential Election 0.52 Sandy Hook Shooting 0.45

Table 4.4: Labeled Events in 2012 CNN Transcripts

and missing rate and find the optimal value. We will list the optimal values

Cmin of different algorithms in Table 4.5. We can see that our algorithm beat the original topic mixture by a significant margin on CNN dataset, achieving a comparable performance with the other two baselines.

Method Ours Original UMASS LSH

Cmin 0.805 0.911 0.792 0.797

(29)

4.4

Event Detection in TDT5 Dataset

The task of first story detection on this standard dataset has drawn a lot of research attention. Among all the works, UMASS system [79] is outperform-ing others in the early efforts and the method proposed in [81] is an effective and time efficient one which can perform comparably to the UMASS sys-ten on TDT datasets. We have added the variance reduction part proposed in [81].

We use the evaluation metrics defined in 4.1. With this metric, we need to find the best threshold for our method and the baseline method to minimize the final penalty. We show the curves of missing and false alarm rate of dif-ferent algorithms in Figure 4.2. Since the original mixture always gives very high false alarm rate, we don’t show its curve in the figure. The LSH method produces very similar results to the UMASS system, with their contribution mainly in the efficiency aspect. Therefore, we also omit their results. We also show the normalized cost Cmin in Table 4.6.

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50 60 70 80 90

False Alarm Rate in %

Missing Rate in %

Our Method UMASS Random

(30)

Method Ours Original UMASS LSH

Cmin 0.67 0.95 0.69 0.71

Table 4.6: Summary of results on TDT

From Figure 4.2 ,we can see that generally, our method can produce per-formance comparable with the UMASS system. When we are allowed to detect only very few candidate events, i.e. the missing rate is high, our method manages to achieve a much lower false alarm rate than the baseline. Therefore, our method has a lower minimum normalized cost, as shown in Table 4.6. However, since our method is not generally designed to work at a document level, the performance drops when the model is supposed to de-tect more first stories in the corpus. Even though, the direct comparison between our method and the document-level baseline shows that our method is performing reasonably well.

(31)

CHAPTER 5

CONCLUSION

In this thesis work, we explore the application of time series analysis in the problem of event detection in text mining. We compare the similarities and differences between traditional TSA and event detection, and propose a novel method in batch setting. Experiments on various datasets show that our method can generate performance which is competitive with other baseline methods.

There are also many aspects where our method can be improved and prob-lems that are worthy of study in the future. First of all, we can extend our batch setting to online setting to better detect events in a real time fashion, modifying the updating process in gradient descent. Another potential di-rection is applying our method to other time series data. Since change point detection is an important problem in many fields of research, it is meaningful that we adapt our algorithm to take new challenges from other disciplines.

(32)

REFERENCES

[1] C. C. Aggarwal and C. Zhai, Eds., Mining Text Data. Springer, 2012. [2] L. AlSumait, D. Barbar´a, and C. Domeniconi, “On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking,” in Data Mining, 2008. ICDM’08. Eighth IEEE Interna-tional Conference on. IEEE, 2008, pp. 3–12.

[3] G. P. C. Fung, J. X. Yu, P. S. Yu, and H. Lu, “Parameter free bursty events detection in text streams,” inProceedings of the 31st international conference on Very large data bases. VLDB Endowment, 2005, pp. 181– 192.

[4] Q. Mei and C. Zhai, “Discovering evolutionary theme patterns from text: an exploration of temporal text mining,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005, pp. 198–207.

[5] Q. He, K. Chang, and E.-P. Lim, “Analyzing feature trajectories for event detection,” in Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR ’07. New York, NY, USA: ACM, 2007. [Online]. Available: http://doi.acm.org/10.1145/1277741.1277779 pp. 207–214. [6] J. D. Hamilton,Time series analysis. Princeton university press

Prince-ton, 1994, vol. 2.

[7] R. H. Shumway and D. S. Stoffer, Time series analysis and its applica-tions: with R examples. Springer Science & Business Media, 2010. [8] C. C. Aggarwal,Data streams: models and algorithms. Springer Science

& Business Media, 2007, vol. 31.

[9] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “A framework for clus-tering evolving data streams,” in Proceedings of the 29th international conference on Very large data bases-Volume 29. VLDB Endowment, 2003, pp. 81–92.

(33)

[10] C. C. Aggarwal and S. Y. Philip, “A framework for clustering massive text and categorical data streams.” in SDM. SIAM, 2006, pp. 479–483. [11] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, “On demand classifica-tion of data streams,” in Proceedings of the tenth ACM SIGKDD inter-national conference on Knowledge discovery and data mining. ACM, 2004, pp. 503–508.

[12] S. Pan, Y. Zhang, and X. Li, “Dynamic classifier ensemble for posi-tive unlabeled text stream classification,” Knowledge and information systems, vol. 33, no. 2, pp. 267–287, 2012.

[13] X. Li, S. Y. Philip, B. Liu, and S.-K. Ng, “Positive unlabeled learning for data stream classification.” in SDM, vol. 9. SIAM, 2009, pp. 257–268. [14] J. Allan, V. Lavrenko, and H. Jin, “First story detection in tdt is hard,” inProceedings of the Ninth International Conference on Information and

Knowledge Management, ser. CIKM ’00. New York, NY, USA: ACM,

2000. [Online]. Available: http://doi.acm.org/10.1145/354756.354843 pp. 374–381.

[15] L. C. Alwan and H. V. Roberts, “Time-series modeling for statistical process control,” Journal of Business & Economic Statistics, vol. 6, no. 1, pp. 87–95, 1988.

[16] E. Parzen, “Some recent advances in time series modeling,” Automatic Control, IEEE Transactions on, vol. 19, no. 6, pp. 723–730, 1974. [17] A. D. Back and A. C. Tsoi, “Fir and iir synapses, a new neural network

architecture for time series modeling,”Neural Computation, vol. 3, no. 3, pp. 375–385, 1991.

[18] J. D. Farmer and J. J. Sidorowich, “Predicting chaotic time series,”

Physical review letters, vol. 59, no. 8, p. 845, 1987.

[19] J. J. Choi, S. Hauser, and K. J. Kopecky, “Does the stock market predict real activity? time series evidence from the g-7 countries,” Journal of Banking & Finance, vol. 23, no. 12, pp. 1771–1792, 1999.

[20] X. Xi, E. Keogh, C. Shelton, L. Wei, and C. A. Ratanamahatana, “Fast time series classification using numerosity reduction,” in Proceedings of the 23rd international conference on Machine learning. ACM, 2006, pp. 1033–1040.

[21] P. Geurts, “Pattern extraction for time series classification,” in Prin-ciples of data mining and knowledge discovery. Springer, 2001, pp. 115–127.

(34)

[22] T. W. Liao, “Clustering of time series dataa survey,”Pattern recognition, vol. 38, no. 11, pp. 1857–1874, 2005.

[23] S. Shaban, “Change point problem and two-phase regression: an anno-tated bibliography,” International statistical review, vol. 48, no. 1, pp. 83–93, 1980.

[24] D. Hawkins and D. Merriam, “Optimal zonation of digitized sequential data,” Mathematical Geology, vol. 5, no. 4, pp. 389–395, 1973.

[25] G. Bhattacharyya and R. A. Johnson, “Nonparametric tests for shift at an unknown time point,” The Annals of Mathematical Statistics, pp. 1731–1743, 1968.

[26] L. D. Broemeling, “Bayesian inferences about a changing sequence of random variables,” Communications in Statistics-Theory and Methods, vol. 3, no. 3, pp. 243–255, 1974.

[27] G. E. Box and G. M. Jenkins, Time series analysis: forecasting and control, revised ed. Holden-Day, 1976.

[28] K. Irvine and A. Eberhardt, “Multiplicative, seasonal arima models for lake erie and lake ontario water levels1,” 1992.

[29] L.-M. Liu, “Identification of seasonal arima models using a filtering method,” Communications in Statistics-Theory and Methods, vol. 18, no. 6, pp. 2279–2288, 1989.

[30] S. Mohan and S. Vedula, “Multiplicative seasonal arima model for longterm forecasting of inflows,” Water resources management, vol. 9, no. 2, pp. 115–126, 1995.

[31] V. S¸. Ediger and S. Akar, “Arima forecasting of primary energy demand by fuel in turkey,” Energy Policy, vol. 35, no. 3, pp. 1701–1708, 2007. [32] U. Kumar and V. Jain, “Arima forecasting of ambient air pollutants

(o3, no, no2 and co),” Stochastic Environmental Research and Risk As-sessment, vol. 24, no. 5, pp. 751–760, 2010.

[33] R. G. Kavasseri and K. Seetharaman, “Day-ahead wind speed forecast-ing usforecast-ing f-arima models,” Renewable Energy, vol. 34, no. 5, pp. 1388– 1393, 2009.

[34] F.-M. Tseng, G.-H. Tzeng, H.-C. Yu, and B. J. Yuan, “Fuzzy arima model for forecasting the foreign exchange market,” Fuzzy sets and sys-tems, vol. 118, no. 1, pp. 9–19, 2001.

(35)

[35] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley sympo-sium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA., 1967, pp. 281–297.

[36] W. S. Sarle, “Finding groups in data: An introduction to cluster anal-ysis,” Journal of the American Statistical Association, vol. 86, no. 415, pp. 830–833, 1991.

[37] J. C. Bezdek,Pattern recognition with fuzzy objective function algorithm-s. Springer Science & Business Media, 2013.

[38] R. Krishnapuram, A. Joshi, O. Nasraoui, and L. Yi, “Low-complexity fuzzy relational clustering algorithms for web mining,” Fuzzy Systems, IEEE Transactions on, vol. 9, no. 4, pp. 595–607, 2001.

[39] I. Cadez, D. Heckerman, C. Meek, P. Smyth, and S. White, “Visualiza-tion of naviga“Visualiza-tion patterns on a web site using model-based clustering,” in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2000, pp. 280–284. [40] P. Smyth et al., “Clustering sequences with hidden markov models,”

Advances in neural information processing systems, pp. 648–654, 1997. [41] C. Li and G. Biswas, “A bayesian approach to temporal data clustering

using hidden markov models.” in ICML, 2000, pp. 543–550.

[42] Y. Xiong and D.-Y. Yeung, “Mixtures of arma models for model-based time series clustering,” in Data Mining, 2002. ICDM 2003. Proceedings. 2002 IEEE International Conference on. IEEE, 2002, pp. 717–720. [43] L. Chen and M. S. Kamel, “Design of multiple classifier systems for

time series data,” in Multiple Classifier Systems. Springer, 2005, pp. 216–225.

[44] A. Nanopoulos, R. Alcock, and Y. Manolopoulos, “Feature-based clas-sification of time-series data,” International Journal of Computer Re-search, vol. 10, no. 3, pp. 49–61, 2001.

[45] E. Keogh and S. Kasetty, “On the need for time series data mining benchmarks: a survey and empirical demonstration,” Data Mining and knowledge discovery, vol. 7, no. 4, pp. 349–371, 2003.

[46] L. Wei and E. Keogh, “Semi-supervised time series classification,” in

Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 748–753.

(36)

[47] J. H. Horne and S. L. Baliunas, “A prescription for period analysis of unevenly sampled time series,” The Astrophysical Journal, vol. 302, pp. 757–763, 1986.

[48] A. Grinsted, J. C. Moore, and S. Jevrejeva, “Application of the cross wavelet transform and wavelet coherence to geophysical time series,”

Nonlinear processes in geophysics, vol. 11, no. 5/6, pp. 561–566, 2004. [49] D. B. Percival and A. T. Walden,Wavelet methods for time series

anal-ysis. Cambridge university press, 2006, vol. 4.

[50] C. Corley, A. R. Mikler, K. P. Singh, and D. J. Cook, “Monitoring influenza trends through mining social media.” inBIOCOMP, 2009, pp. 340–346.

[51] J. Ginsberg, M. H. Mohebbi, R. S. Patel, L. Brammer, M. S. Smolins-ki, and L. Brilliant, “Detecting influenza epidemics using search engine query data,” Nature, vol. 457, no. 7232, pp. 1012–1014, 2009.

[52] S. Goel, J. M. Hofman, S. Lahaie, D. M. Pennock, and D. J. Watts, “What can search predict,” WWW10, 2010.

[53] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes, “Correlat-ing financial time series with micro-blogg“Correlat-ing activity,” inProceedings of the fifth ACM international conference on Web search and data mining. ACM, 2012, pp. 513–522.

[54] E. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2, pp. 100–115, 1954.

[55] R. L. Brown, J. Durbin, and J. M. Evans, “Techniques for testing the constancy of regression relationships over time,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 149–192, 1975.

[56] B. Darkhovskh, “A nonparametric method for the a posteriori detection of the disorder time of a sequence of independent random variables,”

Theory of Probability & Its Applications, vol. 21, no. 1, pp. 178–183, 1976.

[57] S. Zacks, “Survey of classical and bayesian approaches to the change-point problem: fixed sample and sequential procedures of testing and estimation,” Recent advances in statistics, vol. 25, pp. 245–269, 1983. [58] T. L. Lai, “Sequential changepoint detection in quality control and

dy-namical systems,” Journal of the Royal Statistical Society. Series B (Methodological), pp. 613–658, 1995.

(37)

[59] P. Hrishnaiah and B. Miao, “Review about estimation of change point,”

Handbook in Statistics, Quality Control and Reliability, North-Holland, vol. 7, pp. 375–402, 1988.

[60] J. Chen and A. K. Gupta, “On change point detection and estimation,”

Communications in statistics-simulation and computation, vol. 30, no. 3, pp. 665–697, 2001.

[61] J. Himberg, K. Korpiaho, H. Mannila, J. Tikanmaki, and H. T. Toivo-nen, “Time series segmentation for context recognition in mobile de-vices,” in Data Mining, 2001. ICDM 2001, Proceedings IEEE Interna-tional Conference on. IEEE, 2001, pp. 203–210.

[62] E. Keogh, S. Chu, D. Hart, and M. Pazzani, “An online algorithm for segmenting time series,” inData Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001, pp. 289–296.

[63] V. Guralnik and J. Srivastava, “Event detection from time series data,” in Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 1999, pp. 33–42.

[64] S. B. Guthery, “Partition regression,” Journal of the American Statisti-cal Association, vol. 69, no. 348, pp. 945–947, 1974.

[65] J. Bai, “Estimation of a change point in multiple regression models,”

Review of Economics and Statistics, vol. 79, no. 4, pp. 551–563, 1997. [66] D. M. Hawkins, “Point estimation of the parameters of piecewise

regres-sion models,” Applied Statistics, pp. 51–57, 1976.

[67] M. A. Mahmoud, P. A. Parker, W. H. Woodall, and D. M. Hawkins, “A change point method for linear profile data,” Quality and Reliability Engineering International, vol. 23, no. 2, pp. 247–268, 2007.

[68] E. Brodsky and B. S. Darkhovsky, Nonparametric methods in change point problems. Springer Science & Business Media, 2013, vol. 243. [69] S. Boriah, “Time series change detection: Algorithms for land cover

change,” Ph.D. dissertation, University of Minnesota, 2010.

[70] D. Lu, P. Mausel, E. Brondizio, and E. Moran, “Change detection tech-niques,” International journal of remote sensing, vol. 25, no. 12, pp. 2365–2401, 2004.

[71] R. Garnett, M. A. Osborne, S. Reece, A. Rogers, and S. J. Roberts, “Se-quential bayesian prediction in the presence of changepoints and faults,”

(38)

[72] R. Garnett, M. A. Osborne, and S. J. Roberts, “Sequential bayesian prediction in the presence of changepoints,” in Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009, pp. 345–352.

[73] J. Antoch and M. Huˇskov´a, “Permutation tests in change point analy-sis,” Statistics & probability letters, vol. 53, no. 1, pp. 37–46, 2001. [74] M. Steyvers and S. Brown, “Prediction and change detection,” in

Ad-vances in neural information processing systems, 2005, pp. 1281–1288. [75] V. Moskvina and A. Zhigljavsky, “Application of the singular spectrum

analysis for change-point detection in time series,” Journal of Time Se-ries Analysis, submitted, 2001.

[76] T. Id´e and K. Tsuda, “Change-point detection using krylov subspace learning.” in SDM. SIAM, 2007, pp. 515–520.

[77] Y. Kawahara, T. Yairi, and K. Machida, “Change-point detection in time-series data based on subspace identification,” in Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 2007, pp. 559–564.

[78] J. Allan, J. G. Carbonell, G. Doddington, J. Yamron, and Y. Yang, “Topic detection and tracking pilot study final report,” 1998.

[79] J. Allan, V. Lavrenko, D. Malin, and R. Swan, “Detections, bounds, and timelines: Umass and tdt-3,” in Proceedings of topic detection and tracking workshop, 2000, pp. 167–174.

[80] Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospective and on-line event detection,” inProceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998, pp. 28–36.

[81] S. Petrovi´c, M. Osborne, and V. Lavrenko, “Streaming first story detec-tion with applicadetec-tion to twitter,” inHuman Language Technologies: The 2010 Annual Conference of the North American Chapter of the Asso-ciation for Computational Linguistics. Association for Computational Linguistics, 2010, pp. 181–189.

[82] N. Stokes and J. Carthy, “Combining semantic and syntactic document classifiers to improve first story detection,” in Proceedings of the 24th annual international ACM SIGIR conference on Research and develop-ment in information retrieval. ACM, 2001, pp. 424–425.

(39)

[83] N. Bansal and N. Koudas, “Blogscope: a system for online analysis of high volume text streams,” in Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment, 2007, pp. 1410–1413.

[84] S. Nakajima, J. Tatemura, Y. Hino, Y. Hara, and K. Tanaka, “Discov-ering important bloggers based on analyzing blog threads,” in Annual Workshop on the Weblogging Ecosystem, 2005.

[85] M. Osborne, S. Petrovic, R. McCreadie, C. Macdonald, and I. Ounis, “Bieber no more: First story detection using twitter and wikipedia,” in

SIGIR 2012 Workshop on Time-aware Information Access, 2012. [86] S. Petrovi´c, M. Osborne, and V. Lavrenko, “Using paraphrases for

im-proving first story detection in news and twitter,” in Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Associa-tion for ComputaAssocia-tional Linguistics, 2012, pp. 338–346.

[87] G. Luo, C. Tang, and P. S. Yu, “Resource-adaptive real-time new even-t deeven-teceven-tion,” in Proceedings of the 2007 ACM SIGMOD international conference on Management of data. ACM, 2007, pp. 497–508.

[88] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, “Sparsity and smoothness via the fused lasso,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 67, no. 1, pp. 91–108, 2005.

[89] D. Angelosante, G. B. Giannakis, and E. Grossi, “Compressed sensing of time-varying signals,” in Digital Signal Processing, 2009 16th Inter-national Conference on. IEEE, 2009, pp. 1–8.

[90] K. Bleakley and J.-P. Vert, “The group fused lasso for multiple change-point detection,” arXiv preprint arXiv:1106.4199, 2011.

[91] C. Levy-leduc and Z. Harchaoui, “Catching change-points with lasso,” in

Advances in Neural Information Processing Systems, 2008, pp. 617–624. [92] X. Wang, C. Zhai, X. Hu, and R. Sproat, “Mining correlated bursty topic patterns from coordinated text streams,” in Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2007, pp. 784–793.

[93] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” The Journal of Machine Learning Research, vol. 12, pp. 2121–2159, 2011.

(40)

[94] A. G. Hauptmann and M. J. Witbrock, “Story segmentation and de-tection of commercials in broadcast news video,” in Research and Tech-nology Advances in Digital Libraries, 1998. ADL 98. Proceedings. IEEE International Forum on. IEEE, 1998, pp. 168–179.

[95] A. Merlino, D. Morey, and M. Maybury, “Broadcast news navigation using story segmentation,” inProceedings of the fifth ACM international conference on Multimedia. ACM, 1997, pp. 381–391.

[96] N. Stokes, J. Carthy, and A. F. Smeaton, “Select: a lexical cohesion based news story segmentation system,” AI Communications, vol. 17, no. 1, pp. 3–12, 2004.

[97] W. Hsu, S.-F. Chang, C.-W. Huang, L. Kennedy, C.-Y. Lin, and G. Iyen-gar, “Discovery and fusion of salient multimodal features toward news story segmentation,” in Electronic Imaging 2004. International Society for Optics and Photonics, 2003, pp. 244–258.

[98] R. Manmatha, A. Feng, and J. Allan, “A critical examination of tdt’s cost function,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2002, pp. 403–404.

[99] J. G. Fiscus and G. R. Doddington, “Topic detection and tracking eval-uation overview,” Topic detection and tracking, pp. 17–31, 2002.

Time series analysis for event detection in text data