Privacy Preserving Distributed Cell based K means Clustering Algorithm

(1)

2017 International Conference on Mathematics, Modelling and Simulation Technologies and Applications (MMSTA 2017) ISBN: 978-1-60595-530-8

Privacy Preserving Distributed Cell-based K-means Clustering Algorithm

Fang SU, Yun-xiao ZU

*

_{and Wei-hai LI}

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, 100876, China

*Corresponding author

Keywords: K-means, Cell, Privacy preserving, Distributed system.

Abstract. In the traditional K-means clustering algorithm, the number of clusters is difficult to be

estimated and the selection of the initial cluster centers has a great influence on the clustering results. To solve these problems, we proposed cell-based K-means clustering algorithm. The objects closer to a cluster center are dense, and the objects farther away from a cluster center are sparse. Therefore, we analyzed the distribution of objects by dividing the data space into cells. Theoretical analysis and experimental results show that the improved algorithm not only can get the number of clusters, but also can select the initial clustering centers close to the true cluster centers. In addition, we applied the improved algorithm to distributed system, which can achieve both the security of data and the accuracy of clustering.

Introduction

Data mining is a process of discovering useful knowledge in a large amount of data. Data mining techniques include association rules, classification, prediction, clustering and outlier analysis [1]. Clustering analysis is a very important branch of data mining. The goal of clustering is that the objects in the same group are similar, while the objects in different groups are different. Clustering analysis is widely used in the fields of market research, speech recognition, image processing and data compression. According to different needs, different clustering methods have been developed, such as partition-based, hierarchical-based, density-based and grid-based [2].

The goal of partition-based clustering is to divide a data set containing n objects into k groups,

each of which represents a cluster. K-means clustering is a representative algorithm of partition-based method. Distance is used as the evaluation standard of similarity, that is, the closer the two objects are, the higher the similarity is. K-means clustering algorithm has the advantages of scalability, high efficiency and nearly linear time complexity, but it also has some disadvantages [3,4]. Therefore, this paper proposes the cell-based K-means clustering algorithm (hereinafter referred to as “CK-means clustering algorithm”). We divide the data space into cells, and each object is mapped in a specific cell. By analyzing the cells, we can get the number of clusters and effective initial clustering centers.

(2)

CK-means Clustering Algorithm

K-means Clustering Algorithm

The idea of K-means clustering algorithm is that, randomly select k objects from the data set as the

initial cluster centers, calculate the distance between each object and each cluster center, assign the objects to the nearest cluster, recalculate the cluster centers and reassign the objects. Iterate through the process until the termination condition is met. The termination condition may be one of the following conditions:

1) The cluster centers do not change anymore. 2) Each object is not reassigned.

3) The clustering criterion function converges. One of the common criterion functions is as follows:

2

1 i j

k

i j j x W

E x 

 



 

 (1) Here, x_i is the object in cluster W_j, _j is the mean of cluster W_j.

Related Definitions

Suppose the data set X has n objects { , , , }x x1 2  xn , and each object has m

attributes{ , , , }a a1 2  am . After preprocessing the data, the value range of the objects in dimension

1,2, , (j )

j   m is from min_j to max_j. We divide each dimension of the data space into d intervals

equally [7]. Thus the data space is divided into m

d independent cells. U represents a cell.

Definition 1. Cell ID: The spatial location of a cell can be uniquely identified by an integer

sequence ( , , ,u u₁ ₂  u_m), and uj is the id in dimension j of the cell.

Definition 2. Cell side length in dimension j, l_j:

max min

, 1, 2, ,

j j

j m

l

d j





  (2) Definition 3. The mapping of object x_i ( ,x x_i₁ _i₂, , x_{i m}), ( , , ,y y₁ ₂  y_m):

min

1 , 1, 2, ,

ij j j

j x

y j m

l

  

_  _ 

 

   (3) Definition 4. The density of a cell, Den: Each cell has the same volume, so the density of a cell is

the number of objects mapped in the cell. If Den is greater than 0, the cell is a non-empty cell. If Den is equal to 0, the cell is an empty cell.

Definition 5. The sum of the objects in a cell, Sum(Sum Sum₁, ₂, , Sum_m):

i

i x U

Sum x







(4) Definition 6. Density threshold, DenTh:

n DenTh

count

 (5)

Here, count is the number of non-empty cells.

(3)

Definition 8. Adjacent cells: U₁( , , ,a a₁ ₂  a_m) and U2 ( , , , )b b1 2  bm are cells in the data

space. If a_ib_i 1(i[1,m]) and aj bj(j1, 2, , i1,i1, , ) m , U1 and U2 are adjacent cells.

Definition 9. Connected area of the dense cell, A: Mark a dense cell and find its adjacent cells. If

some of its adjacent cells are dense cells, make the same mark. Continue to scale out until none of their adjacent cells is a dense cell. Cells with the same mark form a connected area.

Taking two-dimensional space as an example, as shown in Fig. 1, the number in the cell is cell ID.

Assume that (2, 2) is a dense cell and mark (2, 2) as a. If its adjacent cell (2,3) is a dense cell, mark

(2,3) as a. Scale out to find adjacent cells of (2,3) . If cell (2, 4) is a dense cell, and mark it as a.

Except (2, 2) , (2,3) and (2, 4) , none of their adjacent cells is a dense cell. Cells marked a form a

[image:3.612.238.373.211.319.2]

connected area A{(2, 2), (2,3), (2, 4)}.

Figure 1. Two-dimensional space.

CK-means Clustering Algorithm The algorithm flow is as follows:

Input: A data set consisting of n objects X { , , , }x x1 2  xn

Output: kclusters

1) Preprocess the data set and divide the data space into cells. Each object is mapped in a specific cell.

2) Calculate the density of each cell. 3) Calculate the density threshold DenTh.

4) Select the cell with the largest density, search for its connected area and mark it as A₁. In the

unmarked dense cells, select the cell with the largest density, search for its connected area and mark it as A₂.Continue this process until all the dense cells are marked. In this process, if there is more than

one cell with the largest density, select the cell farthest away from the previous connected areas [8]. 5) The number of connected areas can be used as the number of clusters k. Calculate the mean of

the objects in the connected areas A_j and take it as initial cluster center cj:

, 1, 2 ,

j

U A j

U A Sum

c j k

Den 











 (6) 6) Calculate the distance between object x i_i( 1,2 , )n and cluster center c_j(j1, 2 , ) k , and

assign each object to the nearest cluster. This paper uses Euclid Distance:

2

1

( , )_i _j m ( _{i p} _{j p})

p

distance x c x c 





 (7) 7) Calculate the mean of cluster W_j and take it as cluster center c_j.

(4)

PPDCK-means Clustering Algorithm

In distributed system, we randomly select a site as the master site, and take the other sites as the slave sites. Every site processes the local data, and the slave sites send the processed data to the master site, and the master site processes the global data [9,10]. Every site does not reveal its original private data. Suppose the distributed system contains r sites { , , , }(S S₁ ₂  S_r r3), and the corresponding data

sets are { ,X X1 2, , Xr}. The global data set is

1 r i i X X 



_

. Data set X i_i( 1, 2, , ) r has ni objects,

and every object has m attributes. Suppose Sr is the master site, and { , , ,S S1 2  Sr1} are the slave

sites.

The algorithm flow is as follows: Input: Data sets { ,X X₁ ₂, , X_r} Output: kclusters

1) Every site takes the unified method to preprocess the local data set and divide the data space into cells. Each object is mapped in a specific cell.

2) Every site calculates ( ,ID Den Sum, ) of each local cell.

3) Slave site (S_i i1, 2, , r1) randomly divides each local cell data (Den Sum_i, _i) into r1

sub-data

1 1 2 2 1 1

{( , ),( , ), ,( , )}

r r

i i i i i i

Den Sum Den Sum Den Sum

 

 . Here, 1

1 t r i i t Den Den   



, 1 1 t r i i t Sum Sum   



. Si sends 1 sub-data to each of the other r2 slave sites. In other words, Si sends

( , , )

t t

i i

ID Den Sum to S_t( 1,2, , 1, 1, ,t i i  r1) and keeps ( , , )

i i

ID Den Sum .

4) Slave site Si receives r2 sub-data sent from the other slave sites. For the same ID, Si

calculates ' 1

1 i r i t t Den Den  





and ' 1

1 i r i t t Sum Sum   



. 5) Slave site Si sends

' '

(ID Den Sum, i, i) to master site Sr.

6) Master site Sr receives r1 updated cell data sent from slave sites { , , ,S S1 2  Sr1}. For the

same ID, S_r calculates

1 ' global 1 r i r i

Den Den Den







 and 1 '

global 1 + r i r i

Sum Sum Sum







.

7) Master site S_r calculates the global density threshold DenTh



Den_global count.



Den_global

is the number of the objects, and count is the number of the global non-empty cells.

8) Master site Sr filters out the global dense cells and searches for the connected areas of the

global dense cells. The number of the connected areas can be used as the number of clusters k. Sr

calculates the mean of the objects in connected areas A_j and takes it as initial global cluster center

j c :

global

, 1, 2 ,

j j U A j U A Sum

c j k

Den 











 (8) 9) Master site S_r broadcasts the global cluster centers { , , , }c c₁ ₂  c_k .

(5)

11) Slave site Si calculates local cluster data ( ,n Csumij ij). nij is the number of the objects in

cluster W_j on site S_i. Csum_ij, a m-dimensional variable, is the sum of the objects in cluster W_j on

site Si. Si randomly divides ( ,n Csumij ij) into r1 sub-data. Si sends 1 sub-data to each of the

other r2 slave sites. In other words, S_i sends ( , )

t t

ij ij

n Csum to S_t( 1,2, , 1, 1, ,t i i r1) and

keeps ( , )

i i

ij ij

n Csum .

12) Slave site Si receives r2 sub-data sent from the other r2 slave sites and calculates

1 ' 1 i r ij tj t n n  





and ' 1

1 i r ij tj t Csum Csum   



.

13) Slave site Si sends

' '

( ,n Csum_ij _ij) to master site S_r.

14) Master site S_r calculates local cluster data ( ,n Csum_rj _rj) and receives r1 updated cluster

data sent from slave sites { , , ,S S1 2  Sr1}. Sr calculates the mean of cluster Wj and takes it as

cluster center c_j:

1 ' 1 1 ' 1 1,

, 2, ,

r ij rj i r ij rj i j Csum Csum j k n c n       







 (9) 15) Iterate through process 9)-14) until the cluster centers do not change anymore.

Experimental Results

[image:5.612.125.488.491.528.2]

In order to verify the effectiveness of the improved K-means clustering algorithm, we compared the traditional K-means algorithm and the improved K-means algorithm. The test data sets are Iris from UCI database and Sim randomly generated by MATLAB simulation tool. They have certain classifications. Therefore, the accuracy of clustering can be used to evaluate the performance of the improved algorithm.

Table 1. Data set description.

Data Set Number of Objects Number of Attributes Number of Clusters Source Iris 150 4 3 UCI Sim 1000 2 3 Artificial

In this paper, K-means clustering algorithm, CK-means clustering algorithm and PPDCK-means clustering algorithm experiments were carried out ten times, and take the mean as the clustering results. The distributed system contained sites { , , }S S S1 2 3 , and the test data were distributed

horizontally on 3 sites.

The clustering results are shown in Fig. 2 and Fig. 3. The formula of the accuracy of clustering is as follows:

1





k _j j

accuracy z

n (10)

(6)

Figure 2. Number of iterations. Figure 3. Accuracy of clustering.

Conclusion

Selection of the number of clusters and the initial cluster centers will influence the clustering results of K-means algorithm [11]. This paper optimized the algorithm for both of the problems. We get the dense cells by mapping the objects to the corresponding cells, search for the connected areas, take the number of connected areas as the number of clusters and take the mean of the connected areas as the initial cluster centers. Compared with the traditional K-means algorithm, CK-means clustering algorithm successfully reduces the number of iterations and improves the accuracy of clustering. Therefore, we apply the improved algorithm to privacy preserving distributed systems. By analyzing the experimental results, we found that PPDCK-means clustering algorithm achieves both the security of data and the accuracy of clustering.

Acknowledgement

This work is supported by Beijing Key Laboratory of Space-ground Interconnection and Convergence (Beijing University of Posts and Telecommunications).

References

[1] Ming-xing Wang. Data Mining Algorithm Optimization Research and Application [D]. Hefei: Anhui University, 2014.

[2] Yu-hui Han, Shu-liang Zhao, Meng-meng Liu. Multi-scale Clustering Mining Algorithm [J]. Computer Science, 2016, 43(8): 244-248.

[3] Yan-yun Cheng, Peng Zhou. Improved K-means Clustering Algorithm for Dynamically Allocating Clustering Centers [J]. Computer Technology and Development, 2017, 27(2): 33-36, 41. [4] Xiao-yu Li, Li-ying Yu, Hang Lei. Parallel Implementation and Application of A K-means Improved Algorithm [J]. Journal of University of Electronic Science and Technology of China, 2017, 46(1): 61-68.

[5] Lei Zhou. Reflections on Distributed Data Mining Solutions [J]. Journal of Changzhou Vocational College of Information Technology, 2016, 15(5): 22-24.

[6] Wu Yuan. Research on Privacy Preserving Data Mining in Distributed Environment [D]. Nanjing: Nanjing University of Posts and Telecommunications, 2015.

(7)

[8] Xing-shu Chen, Xiao-song Wu, Wen-xian Wang. K-means Initial Clustering Center Optimization Algorithm based on Characteristic Degree of Association [J]. Journal of Sichuan University (Engineering Science Edition), 2015, 47(1): 13-19.

[9] Liu-sheng Huang, Miao-miao Tian, He Huang. Research on Big Data Privacy Preserving Cryptography [J]. Journal of Software, 2015, 26(4): 945-959.

[10] Xiang Wu, Xiao Yu, Huan-huan Wang. Design of Anonymized Private Data Publishing System for Data Mining [J]. Application of Electronic Technique, 2016, 42(11): 62-65.