Evaluation on Clusterings using Reactive Approach

5.3 Automated Change Detection by Clustering

5.5.4 Evaluation on Clusterings using Reactive Approach

5.6 Summary . . . 98

5.1 Introduction

Clustering evolving data streams not only finds new emerging patterns in the data, but it can also detect changes in the patterns, and data distribution. Detecting changes in the clustering structure can be seen in many real world applications.

For example, understanding how clustering model evolves can help us find out the sources responsible for spreading diseases [Barbará 2001]. Clustering evolving data streams can help to analyse the evolution of group behavior in social networks. Analysis of the evolution of clustering structure can provide the foundation for promptly making decisions. This chapter presents a method for detecting changes in multivariate streaming data by using the geometric and clustering approach. We then present method for building and maintaining of clustering by using the reactive approach in which this clustering-based change detection method is used to determine when to rebuild clustering.

The concept of change in this Chapter is understood as follows. Our change detection method uses the model fitting approach in which a change occurs when a new data item or block of data items do not fit the existing clustering.

5.2 Formal Model

The problem is formulated as follows. A data stream is an infinite sequence of elements

S = {(X1, T1) , .., (Xj, Tj) , ...} (5.2.1)

Each element is a pair (Xj, Tj) where Xj is a d-dimensional vector Xj =

(x1, x2, ..., xd) arriving at the time stamp Tj. Change detection can reduce to the

problem of hypothesis testing below (

H0 ¬change

H1 change

(5.2.2)

where the hypothesis is a logical expression change which indicates whether a change occurs.

Let W denote a sliding window. Let SC be the stream of clusterings that are continuously created by the stream clustering algorithm over the window W . As such, A : S, W → SC is a mapping that creates a stream of clusterings SC from the stream of data points S by continuously applying the algorithm A on each window of data items W as the window slides on the data stream S.

The first truly scalable algorithm for clustering data stream called BIRCH [Zhang 1996] constructs a clustering structure in a single scan over the data with limited memory. BIRCH can work with limited resources. BIRCH consists of the following characteristics: incremental clustering, compact representation of clusters, and the ability to process data in a single pass. With the above characteristics, BIRCH is well suited for clustering large database as well as the evolving data streams. The underlying concept behind BIRCH called the cluster feature vector is defined as follows

5.2. Formal Model 85

Definition 5.2.1. Clustering Feature [Zhang 1996] Given d-dimensional data points in a cluster:n ~Xowherei = 1, 2, .., N , the Clustering Feature (CF) vector of the cluster is a triple:CF = (N, LS, SS) where N is the number of data points in the cluster,LS =

N −1

i=0

Xi is the linear sum of the data points in the cluster, and

SS =

N −1

i=0

X_i2is the squared sum of the N data points. The cluster created by merg- ing two above disjoint clustersCF1 and, CF2 has the cluster feature is defined as

follows

CF = (N1+ N2, LS1+ LS2, SS1 + SS2) (5.2.3)

The advantages of this CF summary are:

• it does not require to store all the data points in the cluster.

• it provide sufficient information for computing all the measurements neces- sary for making clustering decisions [Zhang 1996].

Aggarwal et al. [Aggarwal 2003] extends this concept cluster feature vector for streaming context by adding the temporal components.

Definition 5.2.2. Micro-cluster [Aggarwal 2003]. A micro-cluster for a set of d−dimensional points Xi1, ..., XiN with time stamps Ti1, ..., Tin is the (2d + 3)-

tuple

CF 2x_{, CF 1}x_{, CF 2}t_{, CF 1}t_{, N}_{, wherein CF 2}x_and_{CF 1}x_{each corresponds to a}

vector ofd entries. The definition of each of these entries is as follows

• For each dimension, the sum of the squares of the data values is maintained inCF 2x_{. Thus,}_{CF 2}x_contains_{d values. The p − th entry of CF 2}x_{is equal}

to N P j=1 X_ip j 2 .

• For each dimension, the sum of the data values is maintained in CF 1x_{. Thus,}

CF 1x _contains_{d values. The p − th entry of CF 1}x_{is equal to} N P j=1 X_ip j .

• The sum of the squares of the time stamps Ti1, ..., Tin is maintained inCF 2

t_.

• The sum of the time stamps Ti1, ..., TiN is maintained inCF 1

t_.

• The number of data points is maintained in N .

Definition 5.2.2is the first definition of micro-cluster. In the later work, vari- ants of micro-clusters are proposed to meet the specific requirements of the stream clustering algorithms. For example, in DenStream [Cao 2006], micro-clusters fall

x r1 r2 r3 d2 d3 d1 d3’ d2’ d1’ y C1 C2 C3 x is a change point y is not a change point

Figure 5.1: Change detection by clustering

into types such as core-micro-clusters, potential-c-micro-clusters, or outlier micro- clusters, which are used for density-based clustering.

Recently, Agarwal et al. have introduced the concept mergable summaries that are useful for many problems [Agarwal 2012]. As micro clusters can be merged into new one based on the additive property, micro clusters are mergable summaries that are useful for the problem of clustering streaming data as well as for the other problems.

5.3 Automated Change Detection by Clustering

This section presents an automated change detection algorithm in streaming multivariate data by clustering.

5.3.1 Automated Change Detection

As described in Section 5.2, depending on how to establish the logical expression change, we can derive the different change detection methods. This section presents an automated method for detecting changes in multivariate streaming data by using clustering and geometric approach.

Let x denote a recently arriving data item. A data point can be considered a change point if it is not a member of any cluster. In other words, data point x is a

5.3. Automated Change Detection by Clustering 87

change point if the following logical expression is true: change = K∧

i=1[d (x, center (Ci)) > radius (Ci)] (5.3.1)

where d (x, center (Ci)) is the distance between a newly incoming data point x

and the center of cluster Ci, and radius (Ci) is the radius of the cluster Ci, for

i ∈ 1, .., K.

Figure 5.1 illustrates how the clustering-based method for detecting changes works. There is a clustering of three clusters C1, C2, C3 in the reference window. As all the distances from data point x to three centers of clusters C1, C2, C3 are greater than the radiuses of three corresponding clusters, data point x is a change point while data point y is non-change point, because it is a member of cluster C1. In terms of model fitting, data point x does not fit to the existing clustering while data point y fits to the existing clustering.

Expression 5.3.1shows how to check whether a data point is a change point. However, this logical expression is only used for one incoming data point. We now design a logical expression for checking whether the changes occur when sliding step is a block of b data points x1, x2, .., xb. A block of data points is considered

change if at least one change point belongs to this block. Therefore, the logical expression for checking whether a block of data points change in comparison with the reference window is as follows.

change (x1, x2, .., xb) = change (x1) ∨ change (x2) .. ∨ change (xb) (5.3.2)

where change (xi) is determined by Expression5.3.1.

All the data points are encoded as micro-clusters. A micro-cluster is a temporal cluster feature vector that is extended from the cluster feature vector. The center and the radius of each cluster can be easily computed from its micro-cluster. As described in Section 5.3.1, as the shape of cluster should be sphere cluster, the clustering algorithm used in the reference window is K −means, and the similarity measure is the Euclidean distance. The clustering-based algorithm for detecting change works as follows.

• Initialization: Read the first N data items from the incoming stream into the reference window w1. The current window is the content of the reference

window that slides one step to capture new data item. Windows w1 and w2 are overlapping by N − 1 data points. The next step is building a clustering the reference window w1.

• Continuous monitoring: Check whether a change occurs by testing criteria for change as in Expression5.3.1or5.3.2. If a change is detected, the detector makes an alarm, at the current time tc, and the current window becomes

the reference window, construct the new clustering for the new reference window. The current window w2 always slides one step forward whether the recently arriving data item changes or does not change.

Algorithm6describes an clustering-based change detection algorithm with the arbitrary sliding step step, where 1 < step < N − 1 and N is the window width. The input to the algorithm is a multivariate data stream S and the number of clusters K. If a change occurs, detector reports the change point. Our clustering-based change detector is based on the overlapping windows model. As the value of the data decreases over time, instead of storing for later analysis, data is immediately analyzed as it is produced. In particular, there are two windows: the reference window w1 and the current window w2. The current window is used to capture new items.

We note that the boolean function change(C1, blk) determined by Expression

5.3.2checking whether at least a change point exists in the block of data points blk.

In a naive approach, we would compute the distances of the data points in the window with a newly arriving data item. Based these distances, we determine whether the recently arriving data point is a change point. As such, if the window of size N , we need to computes the distance function N times.

By using change condition specified in Expression 5.3.1 and 5.3.2, the number of comparisons reduces from N to K, where N is the size of sliding window, K is the number of clusters, K N . The clustering-based method for detecting change is an automated method. Furthermore ,it can detect the changes in multivariate streaming data. If the changes frequently occur, we need to run the clustering algorithm K − means in order to create the clustering from the reference window many times. This algorithm is efficient for the rarely occurring changes such as anomaly changes.

If the purpose is only to report whether a change exists in the recently incoming block of data, we can improve the performance of the clustering-based change detection as follows. Whenever the first change occurs in the newly incoming block, the algorithm makes an alarm that a change occurs without checking the rest of the block.

5.3.2 Maintenance of Clustering using Reactive Approach

The goal of clustering maintenance is to maintain a streaming clustering structure undergoing insertion and deletion of items when the sliding window moves on the data stream. This section describes how to builds and to maintain the clusterings emerging from data stream over a sliding window of fixed size using the reactive approach.

5.3. Automated Change Detection by Clustering 89

Algorithm 6: Clustering-based algorithm for detecting changes by using overlapping windows model

input : A data stream S, N is the window width, slide is sliding step, K is the number of clusters

output: The messages reporting the changes occurred, and time t at which changes occurred

1. Initialization: begin

t ←− 0;

w1 ←− first N points from time t; _{/* Each data point is} encoded as a micro-cluster */

C1 ←− Kmeans(w1, K);

w2 ←− slide(w1, step); blk ←− newItemBlock(w2); 2. Continuous monitoring:

while not at the end of the stream do if change(C1, blk) then

t ←− current time;

Report change occurred at time t; w1 ←− w2;

C1 ←− Kmeans(w1, K);

w2 ←− slide(w2, step); blk ←− newItemBlock(w2);

A difficult problem in clustering of streaming data is that the underlying distribution of data stream evolves overtime. These changes can induce more or less changes in the clustering structure emerging from the data stream. Algorithm7de- scribes how to build and maintain a clustering over sliding window by using the reactive approach. The input to the algorithm is a multivariate data stream S. If a change occurs, it rebuilds a new clustering. This clustering is returned as the result of some clustering query, or is stored in the history of clustering in streaming data warehouse.

Algorithm 7: Algorithm for building and maintenance of clustering over sliding window using the reactive approach

input : A data stream S and number of clusters K, the sliding step step, the window width N

output: Return clustering over sliding window at anytime 1. Initialization:

begin t ←− 0;

w1 ←− first N points from time t ; /* Each data point is encoded as a micro-cluster */

C1 ←− K − means(w1, K);

2. Continuous maintenance of clustering: while not at the end of the stream do

w2 ←− slide(w1, step); x ←− newItemBlock(w2); if change then t ←− current time; w1 ←− w2; C1 ←− Kmeans(w1, K);

5.4 Related Work

The clustering-based change detection method proposed here is related to the work on automatic change detection and change detection in multivariate streaming data, and clustering-based change detection, and the reactive work on building and main- taing of model (as introduced in Section 1.3.2. Related work on the clustering- based methods for detecting changes were presented in Section 1.1.2. The following sections present related work on automatic change detection and change detection in multivariate streaming data.

5.4. Related Work 91

5.4.1 Automated Change Detection

As automated systems require the capability of realtime processing, and adapt- ing to the changing environments, automated change detection plays an important role in many automated systems. One of the models of realtime processing is data stream processing. For example, sensor networks need the automated change detection methods in which detection threshold must be adaptive to these changes of the environment. Automated change detection method also plays important role in many mobile robotic applications [Neuman 2011]. For example, Neuman et al. [Neuman 2011] have proposed an online change detection method for mobile robots based on the segmentation approach. Recently Hirte et al. [Hirte 2012] have developed Data3, a Kinect interface for human motion detection. In fact, Data3 is a kind of system capable of detecting the changes in spatial-temporal streaming data. Automated systems should be capable of automatically detecting the changes without the given detection threshold. Some change detection methods can automatically tune the detection thresholds so that the rate of false alarms is not greater than a given rate of false alarms. Gustafson and Palmquist deal with the problem of automated tuning of change detectors with given false alarm rate [Gustafsson 1997]. Their approach computes the detection threshold by estimat- ing a parametric distribution. The advantage of this method is that they can predict detection threshold with no or few false alarms from the used data. However, it is parametric method.

The automatic selection of threshold is of special importance. Alippi et al. [Alippi 2012] have presented an automated change detection method for streaming data based on Hidden Markov Models. This HMM-based method is an automated change detection method by thresholding. Their algorithm consists of the following steps: model the relationships among data streams a sequence of time invariant linear dynamic system; model the evolution of the estimated parameters of the model by Hidden Markov Model; evaluate the likelihood of new parameters; detect change based on a given threshold. If the likelihood is less than a given threshold, a change is detected.

5.4.2 Change Detection in Multivariate Streaming Data

Most methods for detecting changes in streaming multivariate data are based on the multivariate tests. As the problem of testing statistical hypotheses in high dimensional data is particularly challenging and the change detection in streaming data requires to respond to the changes in nearly real-time, change detection in streaming multivariate data is challenging.

There are two approaches to dealing with the problem of change detection in streaming multivariate data. The first approach is based on the transformation that

converts a multivariate data stream into a univariate data stream. Change detection is then performed on the univariate data stream. Dasu et al. [Dasu 2009] have used this approach to design a change detection method for streaming multivariate data. In particular, a multivariate data stream is converted into a univariate data stream. The change detection task is performed by using Kolmogorov-Smirnov test. Similarly, Kim et al. [Kim 2009] have recently proposed the concept called the detection stream. A detection stream is a univariate stream that is generated by mapping a multivariate stream into stream of dissimilarity measures quantifying the difference between two windows.

The second approach is developing new methods for detecting the changes in multivariate streaming data. Kuncheva has recently presented a general method for detecting changes in multivariate streaming data by using likelihood ratio-based test [Kuncheva 2011]. Closely related to our work is the segment-based method for detecting in multivariate data stream proposed by Chen et al. [CHEN 2009]. Gret- ton et al. [Gretton 2012] present a kernel two-sample test that can check whether two multivariate samples coming from the same distribution. Furthermore, a kernel two-sample test with the linear computational complexity suitable for streaming environment is proposed.

In contrast to the previous work, Section5.3 introduces a new method for detecting the changes in multivariate streaming data by using the geometric and clustering approach.

5.5 Evaluation

The clustering-based change detector was written in Java. An algorithm for detecting changes in streaming data should be evaluated in three aspects scalability, accuracy, and monitoring capability. The detection accuracy of a change detection method depends on the window width and the number of clusters. The experiments on change detector using clustering were divided into two groups in terms of detection accuracy, running time, and memory consumption. The first group of experiments evaluated the effectiveness of window width on the clustering-based change detector. The next group of experiments studied the effectiveness of the number of clusters on the performance of the clustering-based change detector. We analyzed the effect of the window width, number of clusters on the number of change points detected by the clustering-based detectors for change.

One of the challenges in assessing a change detection algorithm is the lack of ground truth data. In particular, the evaluation of a method for detecting changes in multivariate streaming data is more challenging. Therefore, the synthetic data was used for evaluating the performance of the change detection algorithms. The synthetic data set to evaluate the accuracy of our clustering-based change detec-

5.5. Evaluation 93 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 60

Window width in number of tuples

Running time in seconds

continuous change detector restarting detector

Figure 5.2: Effectiveness of window width on the performance of clustering-based method for detecting changes in terms of running time

tion algorithm is the streaming data set HyperP [Zhu 2010]. HyperP stream is a synthetic data stream of gradually evolving (drifting) concepts. The Hyper Plane stream consists of 100000 instances, and each instance consists of 10 attributes. These instances may fall into one of 5 classes (clusters). To see the advantages of the continuous detector of change over the restarting detector of change, we ran both continuous detector and restarting detector. Restarting detection of changes means that a detector will be reset if a change is detected while non-restarting detector continuously detects the changes. Non-restarting detection of change is also called continuous change detection.

We ran the detectors by clustering for two cases. First, to see the effect of the window width on detection performance of the detectors, we fixed the number of clusters while changing the window width. Second, to see the effect of the number of clusters on the detection performance of the detectors, we fixed the window width while the number of clusters changes.

5.5.1 Effectiveness of Window Width

To study the effectiveness of window width on the performance of clustering-based change detector, the number of clusters was fixed to 5, and the window width was

In document Change Detection in Streaming Data (Page 103-121)