5.3 Automated Change Detection by Clustering
5.5.4 Evaluation on Clusterings using Reactive Approach
5.6 Summary . . . 98
5.1
Introduction
Clustering evolving data streams not only finds new emerging patterns in the data, but it can also detect changes in the patterns, and data distribution. Detecting changes in the clustering structure can be seen in many real world applications.
For example, understanding how clustering model evolves can help us find out the sources responsible for spreading diseases [Barbará 2001]. Clustering evolving data streams can help to analyse the evolution of group behavior in social networks. Analysis of the evolution of clustering structure can provide the foundation for promptly making decisions. This chapter presents a method for detecting changes in multivariate streaming data by using the geometric and clustering approach. We then present method for building and maintaining of clustering by using the reac- tive approach in which this clustering-based change detection method is used to determine when to rebuild clustering.
The concept of change in this Chapter is understood as follows. Our change detection method uses the model fitting approach in which a change occurs when a new data item or block of data items do not fit the existing clustering.
5.2
Formal Model
The problem is formulated as follows. A data stream is an infinite sequence of elements
S = {(X1, T1) , .., (Xj, Tj) , ...} (5.2.1)
Each element is a pair (Xj, Tj) where Xj is a d-dimensional vector Xj =
(x1, x2, ..., xd) arriving at the time stamp Tj. Change detection can reduce to the
problem of hypothesis testing below (
H0 ¬change
H1 change
(5.2.2)
where the hypothesis is a logical expression change which indicates whether a change occurs.
Let W denote a sliding window. Let SC be the stream of clusterings that are continuously created by the stream clustering algorithm over the window W . As such, A : S, W → SC is a mapping that creates a stream of clusterings SC from the stream of data points S by continuously applying the algorithm A on each window of data items W as the window slides on the data stream S.
The first truly scalable algorithm for clustering data stream called BIRCH [Zhang 1996] constructs a clustering structure in a single scan over the data with limited memory. BIRCH can work with limited resources. BIRCH consists of the following characteristics: incremental clustering, compact representation of clus- ters, and the ability to process data in a single pass. With the above characteristics, BIRCH is well suited for clustering large database as well as the evolving data streams. The underlying concept behind BIRCH called the cluster feature vector is defined as follows
5.2. Formal Model 85
Definition 5.2.1. Clustering Feature [Zhang 1996] Given d-dimensional data points in a cluster:n ~Xowherei = 1, 2, .., N , the Clustering Feature (CF) vector of the cluster is a triple:CF = (N, LS, SS) where N is the number of data points in the cluster,LS =
N −1
P
i=0
Xi is the linear sum of the data points in the cluster, and
SS =
N −1
P
i=0
Xi2is the squared sum of the N data points. The cluster created by merg- ing two above disjoint clustersCF1 and, CF2 has the cluster feature is defined as
follows
CF = (N1+ N2, LS1+ LS2, SS1 + SS2) (5.2.3)
The advantages of this CF summary are:
• it does not require to store all the data points in the cluster.
• it provide sufficient information for computing all the measurements neces- sary for making clustering decisions [Zhang 1996].
Aggarwal et al. [Aggarwal 2003] extends this concept cluster feature vector for streaming context by adding the temporal components.
Definition 5.2.2. Micro-cluster [Aggarwal 2003]. A micro-cluster for a set of d−dimensional points Xi1, ..., XiN with time stamps Ti1, ..., Tin is the (2d + 3)-
tuple
CF 2x, CF 1x, CF 2t, CF 1t, N, wherein CF 2xandCF 1xeach corresponds to a
vector ofd entries. The definition of each of these entries is as follows
• For each dimension, the sum of the squares of the data values is maintained inCF 2x. Thus,CF 2xcontainsd values. The p − th entry of CF 2xis equal
to N P j=1 Xip j 2 .
• For each dimension, the sum of the data values is maintained in CF 1x. Thus,
CF 1x containsd values. The p − th entry of CF 1xis equal to N P j=1 Xip j .
• The sum of the squares of the time stamps Ti1, ..., Tin is maintained inCF 2
t.
• The sum of the time stamps Ti1, ..., TiN is maintained inCF 1
t.
• The number of data points is maintained in N .
Definition 5.2.2is the first definition of micro-cluster. In the later work, vari- ants of micro-clusters are proposed to meet the specific requirements of the stream clustering algorithms. For example, in DenStream [Cao 2006], micro-clusters fall
x r1 r2 r3 d2 d3 d1 d3’ d2’ d1’ y C1 C2 C3 x is a change point y is not a change point
Figure 5.1: Change detection by clustering
into types such as core-micro-clusters, potential-c-micro-clusters, or outlier micro- clusters, which are used for density-based clustering.
Recently, Agarwal et al. have introduced the concept mergable summaries that are useful for many problems [Agarwal 2012]. As micro clusters can be merged into new one based on the additive property, micro clusters are mergable summaries that are useful for the problem of clustering streaming data as well as for the other problems.
5.3
Automated Change Detection by Clustering
This section presents an automated change detection algorithm in streaming multi- variate data by clustering.
5.3.1
Automated Change Detection
As described in Section 5.2, depending on how to establish the logical expres- sion change, we can derive the different change detection methods. This section presents an automated method for detecting changes in multivariate streaming data by using clustering and geometric approach.
Let x denote a recently arriving data item. A data point can be considered a change point if it is not a member of any cluster. In other words, data point x is a
5.3. Automated Change Detection by Clustering 87
change point if the following logical expression is true: change = K∧
i=1[d (x, center (Ci)) > radius (Ci)] (5.3.1)
where d (x, center (Ci)) is the distance between a newly incoming data point x
and the center of cluster Ci, and radius (Ci) is the radius of the cluster Ci, for
i ∈ 1, .., K.
Figure 5.1 illustrates how the clustering-based method for detecting changes works. There is a clustering of three clusters C1, C2, C3 in the reference window. As all the distances from data point x to three centers of clusters C1, C2, C3 are greater than the radiuses of three corresponding clusters, data point x is a change point while data point y is non-change point, because it is a member of cluster C1. In terms of model fitting, data point x does not fit to the existing clustering while data point y fits to the existing clustering.
Expression 5.3.1shows how to check whether a data point is a change point. However, this logical expression is only used for one incoming data point. We now design a logical expression for checking whether the changes occur when sliding step is a block of b data points x1, x2, .., xb. A block of data points is considered
change if at least one change point belongs to this block. Therefore, the logical expression for checking whether a block of data points change in comparison with the reference window is as follows.
change (x1, x2, .., xb) = change (x1) ∨ change (x2) .. ∨ change (xb) (5.3.2)
where change (xi) is determined by Expression5.3.1.
All the data points are encoded as micro-clusters. A micro-cluster is a temporal cluster feature vector that is extended from the cluster feature vector. The center and the radius of each cluster can be easily computed from its micro-cluster. As described in Section 5.3.1, as the shape of cluster should be sphere cluster, the clustering algorithm used in the reference window is K −means, and the similarity measure is the Euclidean distance. The clustering-based algorithm for detecting change works as follows.
• Initialization: Read the first N data items from the incoming stream into the reference window w1. The current window is the content of the reference
window that slides one step to capture new data item. Windows w1 and w2 are overlapping by N − 1 data points. The next step is building a clustering the reference window w1.
• Continuous monitoring: Check whether a change occurs by testing criteria for change as in Expression5.3.1or5.3.2. If a change is detected, the detector makes an alarm, at the current time tc, and the current window becomes
the reference window, construct the new clustering for the new reference window. The current window w2 always slides one step forward whether the recently arriving data item changes or does not change.
Algorithm6describes an clustering-based change detection algorithm with the arbitrary sliding step step, where 1 < step < N − 1 and N is the window width. The input to the algorithm is a multivariate data stream S and the number of clus- ters K. If a change occurs, detector reports the change point. Our clustering-based change detector is based on the overlapping windows model. As the value of the data decreases over time, instead of storing for later analysis, data is immediately analyzed as it is produced. In particular, there are two windows: the reference win- dow w1 and the current window w2. The current window is used to capture new items.
We note that the boolean function change(C1, blk) determined by Expression
5.3.2checking whether at least a change point exists in the block of data points blk.
In a naive approach, we would compute the distances of the data points in the window with a newly arriving data item. Based these distances, we determine whether the recently arriving data point is a change point. As such, if the window of size N , we need to computes the distance function N times.
By using change condition specified in Expression 5.3.1 and 5.3.2, the num- ber of comparisons reduces from N to K, where N is the size of sliding window, K is the number of clusters, K N . The clustering-based method for detecting change is an automated method. Furthermore ,it can detect the changes in multivari- ate streaming data. If the changes frequently occur, we need to run the clustering algorithm K − means in order to create the clustering from the reference window many times. This algorithm is efficient for the rarely occurring changes such as anomaly changes.
If the purpose is only to report whether a change exists in the recently incoming block of data, we can improve the performance of the clustering-based change de- tection as follows. Whenever the first change occurs in the newly incoming block, the algorithm makes an alarm that a change occurs without checking the rest of the block.
5.3.2
Maintenance of Clustering using Reactive Approach
The goal of clustering maintenance is to maintain a streaming clustering structure undergoing insertion and deletion of items when the sliding window moves on the data stream. This section describes how to builds and to maintain the clusterings emerging from data stream over a sliding window of fixed size using the reactive approach.
5.3. Automated Change Detection by Clustering 89
Algorithm 6: Clustering-based algorithm for detecting changes by using overlapping windows model
input : A data stream S, N is the window width, slide is sliding step, K is the number of clusters
output: The messages reporting the changes occurred, and time t at which changes occurred
1. Initialization: begin
t ←− 0;
w1 ←− first N points from time t; /* Each data point is encoded as a micro-cluster */
C1 ←− Kmeans(w1, K);
w2 ←− slide(w1, step); blk ←− newItemBlock(w2); 2. Continuous monitoring:
while not at the end of the stream do if change(C1, blk) then
t ←− current time;
Report change occurred at time t; w1 ←− w2;
C1 ←− Kmeans(w1, K);
w2 ←− slide(w2, step); blk ←− newItemBlock(w2);
A difficult problem in clustering of streaming data is that the underlying dis- tribution of data stream evolves overtime. These changes can induce more or less changes in the clustering structure emerging from the data stream. Algorithm7de- scribes how to build and maintain a clustering over sliding window by using the reactive approach. The input to the algorithm is a multivariate data stream S. If a change occurs, it rebuilds a new clustering. This clustering is returned as the result of some clustering query, or is stored in the history of clustering in streaming data warehouse.
Algorithm 7: Algorithm for building and maintenance of clustering over slid- ing window using the reactive approach
input : A data stream S and number of clusters K, the sliding step step, the window width N
output: Return clustering over sliding window at anytime 1. Initialization:
begin t ←− 0;
w1 ←− first N points from time t ; /* Each data point is encoded as a micro-cluster */
C1 ←− K − means(w1, K);
2. Continuous maintenance of clustering: while not at the end of the stream do
w2 ←− slide(w1, step); x ←− newItemBlock(w2); if change then t ←− current time; w1 ←− w2; C1 ←− Kmeans(w1, K);
5.4
Related Work
The clustering-based change detection method proposed here is related to the work on automatic change detection and change detection in multivariate streaming data, and clustering-based change detection, and the reactive work on building and main- taing of model (as introduced in Section 1.3.2. Related work on the clustering- based methods for detecting changes were presented in Section 1.1.2. The fol- lowing sections present related work on automatic change detection and change detection in multivariate streaming data.
5.4. Related Work 91
5.4.1
Automated Change Detection
As automated systems require the capability of realtime processing, and adapt- ing to the changing environments, automated change detection plays an important role in many automated systems. One of the models of realtime processing is data stream processing. For example, sensor networks need the automated change de- tection methods in which detection threshold must be adaptive to these changes of the environment. Automated change detection method also plays important role in many mobile robotic applications [Neuman 2011]. For example, Neuman et al. [Neuman 2011] have proposed an online change detection method for mobile robots based on the segmentation approach. Recently Hirte et al. [Hirte 2012] have developed Data3, a Kinect interface for human motion detection. In fact, Data3 is a kind of system capable of detecting the changes in spatial-temporal streaming data. Automated systems should be capable of automatically detecting the changes without the given detection threshold. Some change detection methods can au- tomatically tune the detection thresholds so that the rate of false alarms is not greater than a given rate of false alarms. Gustafson and Palmquist deal with the problem of automated tuning of change detectors with given false alarm rate [Gustafsson 1997]. Their approach computes the detection threshold by estimat- ing a parametric distribution. The advantage of this method is that they can predict detection threshold with no or few false alarms from the used data. However, it is parametric method.
The automatic selection of threshold is of special importance. Alippi et al. [Alippi 2012] have presented an automated change detection method for stream- ing data based on Hidden Markov Models. This HMM-based method is an auto- mated change detection method by thresholding. Their algorithm consists of the following steps: model the relationships among data streams a sequence of time invariant linear dynamic system; model the evolution of the estimated parameters of the model by Hidden Markov Model; evaluate the likelihood of new parame- ters; detect change based on a given threshold. If the likelihood is less than a given threshold, a change is detected.
5.4.2
Change Detection in Multivariate Streaming Data
Most methods for detecting changes in streaming multivariate data are based on the multivariate tests. As the problem of testing statistical hypotheses in high di- mensional data is particularly challenging and the change detection in streaming data requires to respond to the changes in nearly real-time, change detection in streaming multivariate data is challenging.
There are two approaches to dealing with the problem of change detection in streaming multivariate data. The first approach is based on the transformation that
converts a multivariate data stream into a univariate data stream. Change detec- tion is then performed on the univariate data stream. Dasu et al. [Dasu 2009] have used this approach to design a change detection method for streaming multivari- ate data. In particular, a multivariate data stream is converted into a univariate data stream. The change detection task is performed by using Kolmogorov-Smirnov test. Similarly, Kim et al. [Kim 2009] have recently proposed the concept called the detection stream. A detection stream is a univariate stream that is generated by mapping a multivariate stream into stream of dissimilarity measures quantifying the difference between two windows.
The second approach is developing new methods for detecting the changes in multivariate streaming data. Kuncheva has recently presented a general method for detecting changes in multivariate streaming data by using likelihood ratio-based test [Kuncheva 2011]. Closely related to our work is the segment-based method for detecting in multivariate data stream proposed by Chen et al. [CHEN 2009]. Gret- ton et al. [Gretton 2012] present a kernel two-sample test that can check whether two multivariate samples coming from the same distribution. Furthermore, a ker- nel two-sample test with the linear computational complexity suitable for streaming environment is proposed.
In contrast to the previous work, Section5.3 introduces a new method for de- tecting the changes in multivariate streaming data by using the geometric and clus- tering approach.
5.5
Evaluation
The clustering-based change detector was written in Java. An algorithm for de- tecting changes in streaming data should be evaluated in three aspects scalability, accuracy, and monitoring capability. The detection accuracy of a change detection method depends on the window width and the number of clusters. The experiments on change detector using clustering were divided into two groups in terms of detec- tion accuracy, running time, and memory consumption. The first group of experi- ments evaluated the effectiveness of window width on the clustering-based change detector. The next group of experiments studied the effectiveness of the number of clusters on the performance of the clustering-based change detector. We analyzed the effect of the window width, number of clusters on the number of change points detected by the clustering-based detectors for change.
One of the challenges in assessing a change detection algorithm is the lack of ground truth data. In particular, the evaluation of a method for detecting changes in multivariate streaming data is more challenging. Therefore, the synthetic data was used for evaluating the performance of the change detection algorithms. The synthetic data set to evaluate the accuracy of our clustering-based change detec-
5.5. Evaluation 93 200 400 600 800 1000 1200 1400 1600 1800 2000 0 10 20 30 40 50 60
Window width in number of tuples
Running time in seconds
continuous change detector restarting detector
Figure 5.2: Effectiveness of window width on the performance of clustering-based method for detecting changes in terms of running time
tion algorithm is the streaming data set HyperP [Zhu 2010]. HyperP stream is a synthetic data stream of gradually evolving (drifting) concepts. The Hyper Plane stream consists of 100000 instances, and each instance consists of 10 attributes. These instances may fall into one of 5 classes (clusters). To see the advantages of the continuous detector of change over the restarting detector of change, we ran both continuous detector and restarting detector. Restarting detection of changes means that a detector will be reset if a change is detected while non-restarting de- tector continuously detects the changes. Non-restarting detection of change is also called continuous change detection.
We ran the detectors by clustering for two cases. First, to see the effect of the window width on detection performance of the detectors, we fixed the number of clusters while changing the window width. Second, to see the effect of the number of clusters on the detection performance of the detectors, we fixed the window width while the number of clusters changes.
5.5.1
Effectiveness of Window Width
To study the effectiveness of window width on the performance of clustering-based change detector, the number of clusters was fixed to 5, and the window width was