A Fast Incremental Spectral Clustering for Large Data Sets

(1)

A Fast Incremental Spectral Clustering for Large

Data Sets

Tengteng Kong

1

, Ye Tian

1

, Hong Shen

1,2

1

_{School of Computer Science, University of Science and Technology of China}

2

_{School of Computer Science, University of Adelaide, Australia}

Abstract—Spectral clustering is an emerging research topic that has numerous applications, such as data dimension reduction and image segmentation. In spectral clustering, as new data points are added continuously, dynamic data sets are processed in an on-line way to avoid costly re-computation. In this paper, we propose a new representative measure to compress the original data sets and maintain a set of representative points by continuously updating Eigen-system with the incidence vector. According to these extracted points we generate instant cluster labels as new data points arrive. Our method is effective and able to process large data sets due to its low time complexity. Experimental results over various real evolutional data sets show that our method provides fast and relatively accurate results.

Index Terms—Spectral Clustering, Incremental, Eigen-Gap, Representative Point

I. INTRODUCTION

Spectral clustering uses information contained in the spec-trum of data afﬁnity matrix to detect the structure of data dsitri-butions. Recently, it has become increasingly popular both for its fundamental advantages over “traditional algorithms” [6] and for its simplicity in implementation with standard linear al-gebra methods [2], [5]. It has been used in various applications ranging from data dimension reduction to computer vision, image segmentation and speech recognition. The classical algorithms usually have to make explicit assumptions over the data sets before implementation, (e.g., EM algorithm assume that the data sets are as the law of Gaussian mixture models [1]). Therefore, these methods usually fail when data sets are arranged in a more complex situation [3], [4]. Compared with these algorithms, spectral clustering can achieve surprisingly good results by analyzing the spectrum of data set.

Before the implementation of spectral clustering, we need to construct a similarity matrix and compute its corresponding spectrum. Obviously, it is computationally expensive and the situation is more severe when facing mass data. Therefore, it is necessary to compress the data sets and apply spectral clustering in an on-line way to avoid costly re-computation as data evolves. However, almost all existing spectral clustering methods are off-line and without use of data compression. Hence, it becomes difﬁcult to apply spectral clustering when data sets are large and evolving.

In response to the above problems, there are mainly two kinds of solutions. One relies on simulating the change of

This paper was partially supported by the "100 Talents" Project of Chinese Academy of Sciences, NSFC grant #622307, and Provincial Natural Science Fund of Anhui #11040606Q52. The corresponding author is Hong Shen.

Eigen-system to avoid re-computation as new data points ar-rive: In [8], an incremental spectral clustering algorithm is pro-posed to handle the changes among the objects. The measure introduces an incidence vector to represent the insert/delete of data points and continuously updates the Eigen-system by analyzing the approximate relations between the changes of eigenvalues and eigenvectors. It achieves a good accuracy, however suffers from being uncertain of convergence, working only with a constant number of clusters. The other relies on extracting representative points to compress data set: In [9], a self-adaption algorithm is proposed to inspect the clusters as new data points are added. Instead of computing the affinity matrix of all entries, it only maintains a few representative data points, and hence works more effectively. However, use of only one representative data point in each cluster may introduce significant errors. In general, these methods have clustered the data sets incrementally in different ways, but have not achieved the desired efficiency.

In this paper, we propose an incremental spectral clustering algorithm to deal with evolving large data sets by extending the NJW spectral clustering algorithm [1]. Our algorithm efﬁciently assigns instant cluster labels to newly arriving data according to the representative sets estimated by our proposed measure and updates Eigen-system [6] with the incidence vector [7] to detect the change of cluster number. Compared with re-computation of the solution in NJW, our algorithm achieves a similar accuracy at a much lower computational cost.

The rest of the paper is organized as follows. In Section II, we give some background knowledge used in the NJW algorithm. In Section III, we introduce our incremental spectral clustering algorithm . The experimental results are reported in Section IV followed by concluding remarks.

II. PRELIMINARIES

First, we state some notation used in this paper. Scripted letters, such as ξ and φ, represent sets. Capital letters, such asLandW, represent matrices. Lower case letters in vector forms, such as−→vi and−→uj, represent column vectors. And we

use subscripts to index the elements in matrices and vectors. In addition, eigenvalues are listed in ascending order, and the ﬁrstk eigenvectors represent the eigenvectors corresponding to theksmallest eigenvalues.

2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

(2)

A. NJW Spectral Clustering Algorithm

The NJW algorithm which is one of the most common spectral clustering algorithms introduces a particular manner to use the ﬁrstkeigenvectors and give conditions under which the algorithm can be expected to do well. It can be outlines as follows using the notation in [2].

Algorithm 1NJW algorithm

Input : Afﬁnity matrix W∈Rn×n, number k of clusters to construct.

1) Compute Laplacian matrixL=D−W;Dis a diagonal matrix withDii=nj=1Wij

2) Compute the ﬁrst k eigenvectorsu1,. . .,ukof eigenprob-lemLu=λDu; letZ∈Rn×k be the matrix containing the vectorsu1,. . .,uk.

3) Cluster y1,. . .,yn by K-means algorithm into clusters c1,. . .,ck;yicorresponding to thei-th row ofZ.

Output : ClustersA1,. . .,Ak withAi={j|yj∈ci}.

As input to algorithm, the construction of afﬁnity matrix is very important. We use the k-nearest neighbor graph to construct the similarity matrix and use the Gaussian similarity function to measure the similarity of each point [2]:

Aij= exp − d(si, sj)2 2σ2 (1) It is simple to work with, results in a sparse affinity matrix whose first k eigenvectors can be efficient computed. How-ever, it is computational expensive to resolve the generalized eigenvalue system as new data points coming. By analyzing the spectrum of the Laplacian Matrix constructed by all data entries, the original data can be compressed into a certain number of representative points.

B. Incidence Vector

As new data coming, it is necessary to represent the dynamic changes in Laplacian matrix. A solution was proposed in [8] that introduced incidence vector to update Eigen-system. Deﬁnition 1. An incidence vector√cij−→rijis a column vector

with two nonzero elements: i-th element equal to√cij and j-th element−√cij, indicating data point i and j having a similaritycij. In addition, we letRbe the matrix contains all

the incidence vectors as column in any order.

Obviously, there are at most n2−n/2 columns in R

if the afﬁnity matrix W is generated by a full connected graph. Fortunately, the actual columns in R is far less than

n2−n/2, since W is sparse.

Proposition 2. Laplacian matrixL=D−W can be decom-posed asL=RRT [10]. And if data pointsviandvjhave a similarity changeΔcij corresponding to the incidence vector

Δcij−→rij, the new graph LaplacianLcan be decomposed as

˜

L= ˜RR˜T whereR˜ =[R,Δcij−→rij].

Consider a new coming data point vl, it can be simply

decomposed into a serious of incidence vectors added in R.

However, it’s worth to note that after updatingR, the matrixes

W,D, andLare expected to change either. Acccording to Pro. 2 the increment ofLandDwith respect toΔcij−→rij can be

expressed as:

ΔL = L−L˜

= RRT−R˜R˜T

= Δcij−→rij−→rijT (2)

ΔD = Δcijdiag{mij} (3) wheremij is a column vector whosei-th andj-th elements

equal to 1 while others equal to 0.Since the ﬁrst order approximate solution ofλcan be computed as:

Δλ= x

T_(Δ_L₋_λ_Δ_D₎_xT

xTDx (4)

we can further speciﬁed Eq. (4) according to Eq. (2) and Eq. (3) with incidence vectorΔcijrijas:

Δλ= Δcijx T_r_ij_rT ij−λdiag{vij} x xTDx (5) C. Eigen-gap

It is a general problem to choose the number of clusters for all clustering algorithms and there are various of methods de-vised for this problem. Here, we adopt the Eigen-gap heuristic [11] which is particularly designed for spectral clustering. It is known that the ﬁrstkeigenvalues is exact0, while there is a gap betweenλk andλk+1 which is called Eigen-gap in k

completely disconnected clusters. Similar situations exist with regard to general case according to the matrix perturbation theory. Therefore, the number of clusters k can be detected by Eigen-gap and expressed as follows:

k= arg min

i (max(gi)) (6)

wheregi =λi+1−λi f or i= 1, . . . n−1;n is the number

of data points.

D. Representative Measurement Analysis

There are several of methods to compute the central or representative points in a cluster. However, these methods are mostly based on density, distance or propinquity and are not applicable to reﬂect the complex relationship of points in clusters generated by spectral clustering. Here, we heuristically illustrate the relevance of points.

Consider the case of k-connected components whose ver-tices are ordered according to the cluster they belong to. Thus, the afﬁnity matrix is block diagonal, and the same is true for

L L= ⎡ ⎢ ⎣ L1 0 . .. 0 Lk ⎤ ⎥ ⎦

where each Li is a connected Laplacian graph which has a eigenvalue0with constant one eigenvector.

We know that the ﬁrstk eigenvectors of L are piecewise constant with corresponding eigenvalues 0. Hence, 0 is a

(3)

repeated eigenvector with multiplicityk. Thus the Eigen-solver could be any set of orthogonal vectors spanning the same space as the ﬁrstk eigenvectors of L. In [3], the author deﬁned a cost function as:

J= n i=1 k j=1 X_ij2 M_i2 (7)

whereMi= maxjXij. By minimizingJwith cluster number k, it recover the rotation which best aligns the columns ofX

with canonical coordinate system. Furthermore, minimizing

J means incorporate as few columns as possible to contain bigger data gap, that is, reserve marked indicator while reduce inapparent one. It is accord with our clustering target and give some expression to the label information of corresponding points. A similar result comes up in general case with per-turbed data. Therefore, it is reasonable for us to measure the representativeness of points use a simlar cost function.

III. OURPROPOEDMETHOD

By estimating points in every cluster with our proposed measurement, we compress the original data into a set of rep-resentative points. Then, instant cluster labels can be generated according to these extracted representative points as new data points added. However, as new data continuously come, the original representative points may not be able to represent its cluster very well. Hence, we apply incidence vector to update the change of data in the form of Eigen-system to keep a newest set of representative points. In this section, we will discuss these problems in detail.

A. Extracting Representative Points and its Number

1) Representative Measurement: When we get the result of clusters after applying the NJW algorithm, it makes sense to analysis the representativeness of each point in submanifold. There are many general algorithms designed for this problem [12], however, most of them are based on distance, density or mode estimation. Hence, it can’t reflect internal and external relations between clusters. For this purpose, we define a new cost function to measure the representative reliability of each cluster according to its eigenvectors. Inspired by (7), we define the representative reliabilityRiof pointviin clusterCj as:

Ri= k j=1 X_ij2 M_i2 (8)

whereMi= maxjXijand a better representative point has a smaller magnitude ofRi.

Fig. 1 shows a toy example of a graph evolves from (a) to (b), as a new type data pointDaccompany with an edgeBD

added. In Fig. 1(a), the representative point should beB; while in Fig. 1(b) the representative of point should be A. That is to say, the measure of Eq. (8) is prefer to choose points with more similarity internal clusters and less similarity external clusters. Hence, the connection with other clusters will reduce its representative reliability.

A B C D 0.2 0.4 0.3

(a) Before evolved

A C 0.2 0.4 0.3 0.1 D B (b) After evolved

Figure 1: A toy example of incremental data. The dash line are edge to be added

2) The number of Representative Points: The next problem is to select the number of representative points. We want to choose enough number of points to represent a cluster while at the same time reducing it as much as possible to avoid redundant computation. Thus, we can solve this problem by analyzing the Eigen-gap of each cluster and ﬁx the number by Eq. (6). Furthermore, if there is a particular demand on time and certain error is allowed, we can approximate the spectrum of each sub-cluster Cj with the corresponding columns and

rows of Z. Where Z is the spectrum of the whole data sets and denoted the reduced matrix as ZCj ∈RCj×Cj. Thus the approximate eigenvalues of clusterCcan be approximate express as: λCji = xCjiTL xCji xCjiTD xCji (9) wherexCji corresponding to thei-th column ofSCj. Since then, we can use Eq. (6) to detect the number of representative points.

B. Updating Representative Sets and Re-initializing the Algo-rithm when Cluster Number Changes

As new data coming incrementally, the error is accumulat-ing. This is also a problem in many other algorithms. Here we re-initialize the NJW algorithm to avoid a collapse. Then, there comes a question that when to apply the re-initialization step. We can simply apply the step when a certain pre-set number of points have been added, however, a constant number can hardly competent since the added data may have much different similarity connection with original data points. Hence, we except to gain a better result by continuously detect the change of cluster number in an approximate way. The current cluster number can be detected by Eigen-gap as:

k = arg min i (max(λ i+1−λ i)) = arg min i (max((λi+1+ Δλi+1)−(λi+ Δλi))) = arg min i (max (gi+ (Δλi+1−Δλi))) (10)

Thus, we can get the current number of clusterk by Eq. (10) and (5) and apply the re-initialization step whilek =k.

In III-A, we have chosen ki representative points by

(4)

data assigned toCi but without change the magnitude ofki.

In this situation, the previous extracted representative points still work since there is nothing new type points generated. However, when the number of ki increase, the previous extracted points can hardly make it. Therefore, we adopt a similar strategy as the above discussion of re-initialization step and solve this problem by simply adding the point which have aroused the change ofkito representative sets.

C. A Fast Incremental Spectral Clustering Algorithm for Large Data Set

1) The Algorithm: Summarize Section III-A and III-B, we propose a new incremental spectral clustering algorithm and describe it as follows:

Algorithm 2 A Fast Incremental Spectral Clustering Algo-rithm

Input: Number of clustersk, afﬁnity matrixW∈Rn×nat time

t. New comer data pointsvlafter t.

1) Apply Algorithm 1 with parameterk, W and generate

k clusters asC1. . . Ck. NotedXas the matrix contains

the ﬁrstkeigenvectors as columns andZ contains all. 2) For each cluster Ci, compute the representative

relia-bility Rj of every pointvj ∈Ci according to Eq. (8),

and choose the ﬁrst kCi points to represent cluster Ci noted as C_i. kCi is computed by Eq. (6) where the corresponding parameter λCi is given by Eq. (9). Note that the ﬁrstkCi points means the points corresponding to thekCi smallestRj.

3) For every new added point vl, compute the average distance Dis from vl to cluster Cj and assign vl to

clusterCm which give the smallest value ofDis: Dis=

iCjd(vl, vi)

C_j

4) Compute the current cluster numberkaccording to Eq. (10) where the change of eigenvector is given by Eq. (5) in the form of incidence vector. Ifk =k, then go back to step 1 to re-initialize the algorithm withk=k, otherwise continue.

5) Compute the current number of Cm’s representative

points k_C_m similar as step 4. Ifk_C_m > kCm then add

vltoCm , otherwise continue.

6) Go to step 3.

Output: Instant cluster lables of pointsvl.

2) Discussions: It is known that compute the spectrum of a standard matrix needs On3 operations, it can be furthermore reduce to On32

if the Laplacian matrix is sparse. However, the computational cost still very high. Hence, NJW algorithm may fail when data scale is large or new data comes too frequently. On the contrary, our algorithm may success. It is fast and relative accuracy. Here, we shortly analysis the time complexity to illustrate it. It needs O(n)

operations to compute the representative points in each cluster as initialization and On˜32

operations to generate cluster

labels and update representative sets as new data come, where

n and n˜ denote the number of data set and representative set respectively.n˜is usually much smaller thannand relative stable, hence, our method is effective and able to process large data sets.

IV. EXPERIMENTS A. Parameter Settings

As mentioned before, we use k-nearest neighbor graph to construct the sparse afﬁnity matrix. However, it may lead to non-symmetric matrix. Fortunately, we can make it symmetric by simply setting both Wij and Wji as the similarity ofvi

andvj, if eitherWij orWji is non-zero. In this experiment, we adopt the Gaussian similarity function to measure local neighborhoods between points and its parameterσis selected in a self-turning way suggested in [3]. Moreover, we employ ARPACK (a variant of Lanczos method) to compute the spectrum ofD−1L and choosek= 20to constructk-nearest neighbor graph.

B. Data Sets

The data set is a collection of about 810,000 documents which is known as RCV1 (Reuters Corpus Volume I) [14]. It is manually categorized into 350 classes and split into 23, 139 training documents and 781, 256 test documents. We use the category codes based on industries vocabulary and preprocess the data sets by removing document with multi-labeled and categories with less than 500 documents. Thus, we get about 200,000 documents in 103 categories. In this experiment, we extract a subset ϕ from the 200,000 documents to initialize our algorithm and simulate the increment of data sets by add data points toϕfrom the rest of 200,000 documents.

C. Quality Measure

We estimate our algorithm by computing Clustering Accu-racy (CA) and Normalized Mutual Information (NMI) between labels generated by our algorithm and the real one [13] :

CA= max map

i=n

i=1δ(yi, map(ci)) n

where n denote the number of documents,yi and ci denote

the real label and generated label of documentvirespectively.

Function δ(y, c) equals 1 if y = c, equals 0 otherwise. Permutation function map(·) maps each generated label to real one and the optimized mapping function can be found in [15]. The magnitude ofCAis between0and1, while a higher score ofCAmeans a better clustering quality.

N M I= _k i=1 _k j=1nijlog _n_·_n ij ni·nj inilognni jnjlognnj

wheren denote the number of documents,ni andnj denote

the magnitude of documents in clusteriand categoryj, and

nij denotes the mutual documents both in category i and

clusterj. The magnitude ofN M I is between0and1, while a higher score ofN M I means a better clustering quality.

(5)

3000 3500 4000 4500 5000 5500 6000 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 Number of points NMI K−means Modified NJW Increnmental spectal clustering

(a) NMI 3000 3500 4000 4500 5000 5500 6000 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 Number of points Accuracy K−means Modified NJW Increnmental spectal clustering

(b) Accuracy 30001 3500 4000 4500 5000 5500 6000 1.5 2 2.5 3 3.5 4 4.5 5 Number of points Time K−means Modified NJW Increnmental spectal clustering

(c) Time

Figure 2: A clustering quality and runtime comparison between K-means, Modiﬁed NJW and Alg. 2 using the RCV1 data set. For Alg. 2, we use 3000 points as initialization and incrementally add another 3000 points in the rest of data set. For K-means, each value is mean of 10 replicates.

D. Results

Fig. 2(a) and Fig. 2(b) shows the N M I and CA score using RCV1 data set. Both of the results conﬁrm that our algorithm achieves a relative good clustering quality between NJW and K-means. Although the value of N M I and AC

may drop gradually with the increase of added points, it could rectify by the automatic re-initialization operation of Alg. 2. Furthermore, it would perform much better with the increase number of points which is crucial for large data sets. Fig. 2(c) reports the runtime using the RCV1 data set. It can be seen that the runtime of Alg. 2 is close toK-means and much less than NJW. In addition, the increase of runtime is not so sharp as NJW as new points added. On the contrary, it become relative stable and approach to K-means. Hence, compared with re-computation by NJW, it achieves similar accuracy but with much lower computational cost.

V. CONCLUSIONS

A fast incremental spectral clustering algorithm for large data set is proposed in this paper. It extends the NJW algorithm to handle dynamic data and incorporates a new strategy of measurement to compress the original data sets with a certain number of representative points. Instead of evaluating the whole data set, by incrementally keeping a representative sets, the algorithm generates instant cluster labels as new points come. Therefore, the algorithm is fast and can be efﬁciently applied to large data sets. Moreover, by analyzing Eigen-gap in the form of incidence vectors, the change of cluster number can be detected automatically. Experimental results over a number of real evolutional data sets illustrate our methods provide fast and relative accurate results.

REFERENCES

[1] A.Y. Ng, M.I. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm, in: T.G. Dietterich, S. Becker, Z. Ghahramani (Eds.), Proceedings of the Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 2002, pp. 849–856.

[2] von Luxburg U. (2007). A tutorial on spectral clustering. Stat. Comput. 17, 395–416

[3] L. Zelnik-Manor and P. Perona. Self-tuning spectral clustering. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1601-1608. MIT Press, Cambridge, MA, 2005.

[4] J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence 22 (8) (2000) 888–905.

[5] F. Bach and M. Jordan. Learning spectral clustering. In Proc. of NIPS-16. MIT Press, 2004.

[6] F. R. K. Chung. Spectral Graph Theory. American Mathematical Society, 1997.

[7] B. Bollobas, Modern Graph Theory, Springer, New York, 1998. [8] H. Ning, W. Xu, Y. Chi, Y. Gong, and T. Huang. Incremental spectral

clustering with application to monitoring of evolving blog communities. In SIAM Int. Conf. on Data Mining, 2007.

[9] C. Valgren, T. Duckett, and A. Lilienthal. Incremental spectral clus-teringand its application to topological mapping. In Proc. IEEE Int. Conf. onRobotics and Automation, pages 4283–4288, 2007

[10] F.R.K. Chung, Spectral Graph Theory, in: CBMS Regional Conference Series in Mathematics, vol. 92, American Mathematical Society, Provi-dence, RI, 1997.

[11] Bhatia, R.: Matrix Analysis. Springer, New York (1997)

[12] D. Chaudhuri, C.A. Murthy, and B.B. Chaudhuri, “Finding a Subset of Representative Points in a Data Set,” IEEE Trans. Systems, Man, and Cybernetics, vol. 24, no. 9, pp. 1416-1424, 1994.

[13] Wen-Yen Chen, Yangqiu Song, Hongjie Bai, Chih-Jen Lin, and Edward Y. Chang. Parallel Spectral Clustering in Distributed Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence,2010. [14] D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. RCV1: A new benchmark

collection for text categorization research. Journal of Machine Learning Research, 5:361–397, 2004.

[15] L. Lovasz, M. Plummer, Matching Theory, Akademiai Kiado, North-Holland, Budapest, 1986.