Clustering Analysis - Continuous Spatio-Temporal Queries

Related Work

2.2 Continuous Spatio-Temporal Queries

2.2.5 Clustering Analysis

Clustering is a well-studied area in mathematics and computer science. It is related to many differ-ent areas including classification, databases, data-mining, spatial range-searching, etc. As such, it

has received a lot of attention. Some of the works include [21, 31, 47, 63, 20, 86]. For an elaborate survey on clustering, readers are referred to [46]. Most of the previous work focuses on using clus-tering to analyze static or dynamic data and find interesting things about it. Instead, I now propose to utilize clustering idea as a means to abstract data (to enable cluster-based load shedding) and achieve scalable processing (minimize processing time) of continuous queries on moving objects.

To the best of our knowledge, this is the first work to use clustering for internal optimization of continuous query processing on streaming spatio-temporal data.

Clustering Data Streams

The commonly used clustering algorithm for offline (or non-incremental) clustering, K-means, is described in [21, 63] and introduced in more detail in Chapter 7. The objective of the algorithm is to minimize the average distance from data points to their closest cluster centers. An alternative interpretation of K-median clustering is that we would like to cover the points by k balls, where the radius of the largest ball is minimized [38].

Given a sequence of points, the objective of [28] Guha et. al. is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. They give constant-factor approximation algorithms for the K-median problem in the data stream model in a single pass. They study the performance of a divide-and-conquer algorithm, called Small-Space, that divides data into pieces, and then again clusters the centers obtained (where each center is weighted by the number of points closer to it than to any other center). The authors also pro-pose another algorithm (Smaller-Space) that is similar to the piece-meal approach except that in-stead of re-clustering only once, it repeatedly re-clusters weighted centers [28]. The advantage of Small(er)-Space is that we sacrifice somewhat the quality of the clustering approximation to obtain an algorithm that uses less memory. Their model and and analysis have similarities to incremen-tal clustering and online models. However, the approach is a little bit different. They maintain a

“forest” of assignments. They complete this to k trees, and all the nodes in a tree are assigned to the median denoted by the root of the tree. The disadvantage of this algorithm is similar to that of

K-means, namely the number of clusters must be known in advance (the number of clusters is the input parameter to the K-means algorithm).

Competitive Learning Clustering and the basic Leader-Follower Clustering are two algorithms for online (i.e., incremental) clustering presented in [21]. John A. Hartigan had already proposed the latter in an early publication on clustering algorithms [39]. One of the disadvantages of the Leader-Follower Clustering algorithm is that it lacks the ability to keep the number of clusters constant, so a large number of clusters might be created (potentially as many as there are data points). But this disadvantage of the Leader-Follower algorithm could actually be easily addressed by merging the clusters, if appropriate. Competitive Learning Clustering can be transformed in a single-scan algorithm to save the clustering time. In its basic form, however, it depends on a convergence criterion that makes several iterations over the data necessary. Other sources also describe these two algorithms, but name them differently. In [51] they are named as Growing K-means Clustering and Sequential Leader Clustering¹.

In [43], the author uses a clustering algorithm that pre-processes data points that arrive each second. Based on the clustering model, a Markov Model is learned and finally used for a prediction task. They used the k-means algorithm to pre-process data. The final result is an algorithm that is capable of clustering streaming data and learn a Markov Model with one scan over the data set. It is called the Extended Leader-Follower algorithm (ELF algorithm). Again, because the K-means algorithm requires the number of clusters to be known in advance, and with each data point update the clustering might change, this approach doesn’t work well for very dynamic data points that represent the location updates of moving objects and queries.

A list of considerations and criterions when dealing with incremental data stream algorithms is given in [20, 86, 6]. Incremental algorithms should only use a small and constant amount of memory. Consequently, a compact representation of the current model is accessible at all time.

The running time and hence the computational complexity should be such that new incoming data points can be processed at their arrival. The algorithm should be capable of distinguishing between

1In this thesis, I refer to these algorithms as Competitive Learning clustering and Leader-Follower clustering.

outliers, emerging patterns and noise.

Barbara in [6] presents an overview of the clustering algorithms BIRCH [89], COBWEB [23], STREAM [28, 59], and Fractal Clustering [5], which all could be used for an incremental clustering of data streams. He describes the advantages and shortcomings of these algorithms with respect to compactness, functionality and outliers.

BIRCH clusters data points using a CF-tree - a height-balanced tree (analogous to a B-Tree).

One of the drawbacks of the BIRCH method is that after some time it draws into secondary mem-ory. And even though it tries to minimize the number of I/Os for clustering a new point, it still takes a considerable amount of time to do so [6]. The processing is better done in batches to try to amortize the overall cost.

COBWEB [23] implements hierarchical clustering via a classification tree. The classification tree is not height-balanced which often causes the space (and time) complexity to degrade dramat-ically. This makes COBWEB an unattractive choice for data streams clustering.

STREAM [28, 59] aims to provide guaranteed performance by minimizing the sum of the square of the distances of points to the centroids (similar to K-means). STREAM processes data streams in batches of points by first clustering the points in each batch and then keeping the weighted cluster centers (i.e., the centroids weighted by the number of points attracted to them).

Then STREAM clusters the weighted centers to get the overall clustering model. If the batches are of equal size, the first clustering iteration has a constant processing time. But for the second itera-tion of clustering the time can increase without bounds as more batches of data arrive. Moreover, it is recognized by its authors [59] that it takes longer than K-means to find a bounded solution.

Fractal Clustering (FC) [5] groups points that show self-similarity, by placing them in the cluster in which they have the minimal fractal impact. FC works with several layers of grids (the cardinality of each dimension is increased 4 times with each next layer), and even though only occupied cells are kept in memory, the method suffers from high memory usage [8].

Gaber et. al. in [25] have proposed algorithms for incremental clustering, a simple so-called one-look clustering algorithm that takes into account the available resources of a machine. In [15]

Chaudhuri presents various considerations for clustering algorithms, such as which actions are possible or necessary when new data points are added to an existing model.

Clustering Motion

The difficulty in maintaining and computing clusters on moving objects is the underlying kinetic nature of the environment [29]. Once the clusters are computed at a certain time, and the time progresses, the clustering may change and deteriorate. To remain a “high quality” clustering (i.e., the cluster sizes are small compared to the size of the optimal clustering) one needs to maintain the clustering by either reclustering the points every once in a while, or alternatively, move points from one cluster to another. The number of such “maintenance” events may dominate the overall running time of the algorithm, and the number of such events can be extremely large, thus hampering the processing time.

In [89] Zhang et. al. described micro-clustering i.e., grouping data that are so close to each other that they can be treated as one unit. In [54] Li et. al. extended the concept to moving micro-clusters, groups of objects that are not only close to each other at a current time, but also likely to move together for a while.

In [38], authors analytically study motion clustering. They define the clustering motion prob-lem as following: Let P [t] be a set of moving points in <^d, with a degree of motion µ; namely for a pointp[t] ∈ P [t], we have p(t)=(p¹(t),...pd(t)), where pj(t) is a polynomial of degree µ, and t is the time parameter, for j = 1, ..., d. The authors in [38] demonstrate that if one is willing to compromise on the number of clusters used, then clustering becomes considerably easier (compu-tationally) and it can be done quickly. Furthermore, we can trade off between the quality of the clustering and the number of clusters used. Hence, one can compute quickly a clustering with a large number of clusters, and cluster those clusters in a second stage, so to get a more reasonable k-clustering. The authors also propose an algorithm for picking a “small” subset of the moving points by computing a fine clustering, and picking a representative from each cluster. The size of the subset, known as coreset, is independent of n (number of data points), and it represents the

k-clustering of the moving points at any time. Namely, instead of clustering the points, only the representative points get clustered. This implies that one can construct a data structure that can report the approximate clustering at any time.

Chapter 3 Part II:

Accuracy vs. Performance Tradeoff in

In document Continuous Query Processing on Spatio-Temporal Data Streams (Page 29-35)