Text Document Clustering Based on Density K means

(1)

2016 International Conference on Computer, Mechatronics and Electronic Engineering (CMEE 2016) ISBN: 978-1-60595-406-6

Text Document Clustering Based on Density K-means

Di WU

1

, Yan ZENG

2

and Yin-chuan QU

2

1_{College of Computer, National University of Defense, Chang Sha, China} 2_{Beijing Gaodi Information Technology Co., Ltd., Beijing, China}

Keywords: K-means, Density, Text document, Clustering.

Abstract. K-means is one of the most fundamental techniques in clustering. It has been applied in many fields, such as image processing and Natural Language Processing. It has good performance in many cases, especially in dealing with large data sets. However, how to choose the initial cluster centers is a hard problem, different choice may cause the clustering results by K-means unstable even get the local optimum. To solve this problem, many methods have be proposed, while these methods only apply in some certain fields and perform disappointed when we use for text documents clustering. In this paper, we designed a novel density K-means algorithm and apply it in the text document clustering. The experimental results show that it performs better than most of the existing methods in Chinese corpus. Furthermore, compared with other algorithms, our algorithm can effectively decrease the iterations.

Introduction

(2)

distribution of the data to decide the initial cluster centers. However, this algorithm cannot solve the large iterations of K-means and leaded to high computational complexity. While the mentioned algorithms can solve the initial centers for K-means to some extent, in some certain fields. In addition, most of methods under the assumption that the data obeys mixed Gaussian distributions. However, most data sets are more complicated, especially the information of Text documents, and that will lead to the disappointed performance with these methods.

In this paper, we have two contributions to text document clustering. To solve the problem that caused by choosing initial clustering centers randomly and the large number of iteration, we proposed a new algorithm for center initialization for K-means. Then for text document clustering we put forward a novel clustering method based on density peaks. Experiment demonstrates that the new algorithm has strong ability in text document clustering.

The remainder of paper proceeds as follows. In section 2, we describe the detail of K-means algorithm. Section 3 introduces our method Cluster center initialization of K-means with density peak (CCIDP). In Section 4, we propose a system that used for Text Document clustering. In next section, experimental results demonstrate that our method performs better than other three methods. In the final Section, we give a summary and discuss our future work.

The K-Means Algorithm

In this section, we describe the original K-means algorithm. Let 𝐷 = {𝑑_𝑖, 𝑖 = 1, ⋯ , 𝑛} be a data set with k clusters, 𝐶 = {𝑐_𝑖, 𝑖 = 1, ⋯ , 𝑘} be a set of k centers and 𝑆_𝑗 = {𝑑|𝑑 𝑖𝑠 𝑎 𝑚𝑒𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡ℎ𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟} be the set of samples that belong to the k-th cluster.

The main idea is to classify the given set of data D into k disjoint clusters, and k is a parameter which needs us to impute before clustering. First, we select k centers for each cluster randomly. Then assign each point to the nearest center. After that calculate the average coordinates of all points in each cluster and regard it as the new cluster centroid. The algorithm updates the centers and assigns the point repeatedly until the centers do not change anymore.

K-means algorithm can be described as follows [16]:

Algorithm 1. The K-means algorithm

1. Input: 𝐷 = {𝑑𝑖, 𝑖 = 1, ⋯ , 𝑛}, 𝑘.

2. Output: a set of k clusters.

3. Initiate algorithm by randomly select initial cluster center 𝑐𝑘.

4. Assign each 𝑑𝑖to the cluster 𝑆𝑘, which has minimum distance d(𝑑𝑖, 𝑐𝑘)

5. Calculate new 𝑐𝑘 centers as

𝑐𝑘= ∑𝑑𝑖∈𝑆𝑘_|𝑆 𝑑𝑖

𝑘| . (1)

6. Repeat steps 2 and 3 till there is no movement in cluster centers.

The distance between two data points x(𝑥1, 𝑥2, 𝑥3, ⋯ , 𝑥𝑛) and y(𝑦1, 𝑦2, 𝑦3, ⋯ , 𝑦𝑛) is computed

as follows:

d(x, y) = √(𝑥1− 𝑦1)2+ (𝑥2− 𝑦2)2+ ⋯ + (𝑥𝑛− 𝑦𝑛)2 . (2)

The results of K-means are not unique because of the different random initial cluster centers. To get better results, we need to run K -means plenty of times with different random initial centers. In next section, we propose our cluster center initialization algorithm based on finding the centers with density peaks.

Cluster Center Initialization of K-Means with Density Peak

(3)

large distance with each other. For each point 𝑑_𝑖 in data set D, two parameters determine whether the point has potential to become the next cluster center: 𝑐_𝑗+1: 𝜌_𝑖, local density and its distance d from points with higher density. The two parameters both based on the distance between points.

Definition 1: 𝜌_𝑖 is the density of the point 𝑑_𝑖. In this paper, instead of defining the density of 𝑑_𝑖 as the number of points that within a radius r around 𝑑_𝑖, we use a Gaussian kernel to determine the value of 𝜌𝑖. 𝑑𝑖,𝑗 , which denote the distance between 𝑑𝑖 and 𝑑𝑗, and r is a cut off distance.

𝜌_𝑖 = ∑𝑛−1_𝑗=1𝑒𝑑𝑖,𝑗𝑟 . (3)

Definition 2: 𝛿𝑖 is the minimal distance from point i to the point with higher local density.

𝛿_𝑖 = min_𝑗:𝜌_𝑗_>𝜌_𝑖𝑑_𝑖,𝑗. (4)

If point i has the highest density, we take 𝛿_𝑖 = max_𝑗(𝑑_𝑖,𝑗). It means that only the point is the local or global density maximum. Its 𝛿𝑖 is much higher than the distance from the nearest point with

higher density.

Definition 3: 𝑟_𝑖 is the potential of the point 𝑑_𝑖 to become the next cluster center. We combine the 𝜌_𝑖 with 𝛿_𝑖. In our assumption, the cluster center point 𝑑_𝑖 should have higher 𝜌_𝑖 and 𝛿_𝑖.

[image:3.595.199.412.479.639.2]

𝑟_𝑖 = 𝜌_𝑖 × 𝛿_𝑖 . (5) We can combine the following Figure to express the main idea of the algorithm more vividly. Figure 1 shows 28 points distributed in two dimensions. Firstly, we compute 𝜌_𝑖 for each point using formula (3). Then we compute the distance from other points with higher density and Take the minimum value as 𝛿_𝑖. We can plot 𝛿_𝑖 as the function of 𝜌_𝑖 of each point like figure 2. After that, we compute 𝑟_𝑖 for each point to denote the potential of becoming the cluster center. In figure 2, point 1 and point 10 are “outliers” with high 𝛿 and 𝜌 which means high 𝑟 value. So point 1 and point 10 are cluster centers. Finally, we can assign the remaining points to the cluster of the nearest neighbor with high density. Figure 3 shows the result of clustering.

(4)

[image:4.595.193.409.72.430.2] [image:4.595.207.387.80.225.2]

Figure 2. Plot 𝛿 as a function of 𝜌.

Figure 3. The clustering result and different colors represent different clusters.

The algorithm CCIDP is described as follows:

Algorithm 2. CCIDP

1. Input: 𝐷 = {𝑑𝑖, 𝑖 = 1, ⋯ , 𝑛}, 𝑘.

2. Output: a set of k clusters.

3. Compute 𝜌𝑖and 𝛿𝑖 of each 𝑑𝑖in set D using equation (3) and equation (4), respectively.

4. Set 𝑚 = 0,𝐶 = { }, 𝑟𝑖= 𝜌𝑖× 𝛿𝑖 .

5. Add 𝑑𝑖 to C, if 𝑟𝑖= max( 𝑟𝑗) , 𝑗 = 1,2, ⋯ , 𝑛.

6. Remove 𝑑𝑖 from D, m=m+1. 7. Repeat step 5 and 6 until m==k

8. Assign each 𝑑𝑖 to corresponding cluster 𝑆𝑘, when 𝑑𝑖 is most closely to 𝑐𝑘 with a minimum distance d(𝑑𝑖, 𝑐𝑘).

9. Compute new 𝑐𝑘 as:

𝑐𝑘=

∑_𝑑_𝑖_∈𝑆_𝑘𝑑_𝑖 |𝑆𝑘|

10. Repeat step 9 and 10 until there is no movement in cluster centers.

Using this CCIPD, we can find good initial center set and get better results with less iterations. Comparing to other algorithms requires the iterative optimization of the objective function. In our algorithm, clustering is done in one step.

Text Document Clustering Based on CCIPD

The Vector Representation of Texts

(5)

classical technique of text representation is based on the VSM (vector space model), each dimension is composed of a feature item and its weight. The weight of a feature item is calculated by the TF-IDF (term frequency-inverse document frequency) method:

w(𝑡_𝑖, 𝑑) = 𝑡𝑓(𝑡𝑖,𝑑)×𝑙𝑜𝑔(

𝑁 𝑛𝑖+0.01)

√∑ 𝑡𝑓(𝑡𝑖,𝑑)×𝑙𝑜𝑔(

𝑁 𝑛𝑖+0.01) 𝑡𝑖∈𝑑

. (6)

[image:5.595.120.472.264.400.2]

where w(𝑡_𝑖, 𝑑) denotes the weight of the feature item 𝑡_𝑖 in document d; 𝑡𝑓(𝑡_𝑖, 𝑑) denotes the term frequency of 𝑡_𝑖 in document d; N is the total number of documents in the training corpus; 𝑛_𝑖 represents the number of documents where the term 𝑡_𝑖 appears. The denominator is a normalization factor. Then the document d is represented as: d = (𝑡₁: 𝑤₁, 𝑡₂: 𝑤₂, ⋯ , 𝑡_𝑚: 𝑤_𝑚), that is, 𝑡₁is the i-th feature item in the document d, 𝑤_𝑖 means its weight and m represents the number of feature items in the vector.

Figure 4. The process of Text Document clustering.

Calculate the Distance between Two Text Documents

The cosine similarity is often used to calculate the similarity between two text documents, while in K-means, we need to calculate the distance between them. It requires a structure that is capable of converting text similarity into distance. The conversion method should satisfy the following two conditions:

1) The greater the similarity of the two text documents, the smaller distance between them; on the contrary, the similarity is smaller, and the distance should be greater.

2) When the similarity between the two text documents is the greatest (the angle of cosine is 1), the distance between them should be 0; on the contrary, when the similarity between them is minimal (the angle of cosine is 0), the distance between them should be positive infinity.

In order to satisfy the above conditions, we use the angle of cosine to calculate the degree of similarity between two text documents [17,18]. Then we transform the raw similarity to its log value to represent the distance between two texts. The logarithm value should be in the range (0, 1). Thus, for given two text vectors 𝑑_𝑖, 𝑑_𝑗:

𝑑𝑖𝑠(𝑑𝑖,𝑑𝑗) = log (𝑠𝑖𝑚(𝑑𝑖, 𝑑𝑗)) = log 𝑑𝑖×𝑑𝑗

|𝑑𝑖|×|𝑑𝑗| . (7)

Text Document Clustering Based on CCIPD

As the K-means randomly select initial centers will lead to local optimum and cannot get stable results, we use CCIPD after calculating the distance between the two text documents. The process of algorithm see Fig.4.

Experiments and Discussion

(6)

[image:6.595.85.511.149.251.2]

from http://www.sogou.com/labs/dl/c.html/, and we selected 5 topics in this data set, including car, information, sports, education, militaries. To compare the performance of different algorithms, we use F-value to evaluate their clustering results. Besides, precision, the rate of recall and the time of algorithm are other factors we considered during experiments.

Table 1. Summary of Index to evaluate different algorithm’s performance.

abbreviation meaning

TP The rate of extracting features

TT The total time of the algorithm

P The precision of the clustering results

R Recall rate

[image:6.595.80.514.292.366.2]

F F-value

Table 2. Performance Comparison of Four algorithms on the data set when TP = 0.5%

algorithm TT[ms] P[%] R[%] F[%]

K-means 112590 71.45 59.87 65.15 Lei’s approach [14] 103923 70.05 60.10 64.69 Jianpei Zhang’s approach [15] 90932 71.67 63.57 67.38

CCIPD 89920 72.01 63.64 68.12

Table 3. Performance Comparison of Four algorithms on the data set when TP = 1%

algorithm TT[ms] P[%] R[%] F[%]

K-means 112348 65.17 68.92 66.99 Lei’s approach [14] 105446 66.10 70.44 68.20 Jianpei Zhang’s approach [15] 92299 71.52 71.49 71.50

CCIPD 91310 71.18 71.77 71.66

Table 2 summarizes the performance of four document clustering algorithms, when the TP is 0.5%. In this case, we can find that the running time of CCIPD and Jianpei’s approach is very similar. Besides, both of them are lower than the other two ways. In view of precision of cluster results, CCIPD is still work well than other results, improving 0.56% compared with K-means. Considering the recall rate, CCIPD improve a lot compared with the original K-means, and also ahead of other two algorithms. In the F-value, CCIPD is higher and standing at 68.12%, while the K-means, Jianpei Zhang’s approach and Lei’s approach is 65.15%, 64.69%, and 67.38% respectively. From the Table III, similarly, the CCIPD has better performance. However, when the TP = 1%, the running time of CCIPD is little higher than Jianpei Zhang’s approach [15]. AS only higher about 11 ms, we think that it is acceptable to improve other performance at the expense of little running time. We can see the result in figure5 and figure 6 more directly.

[image:6.595.86.513.406.479.2]

(7)

[image:7.595.177.417.66.207.2]

Figure 5. Performance Comparison of Four algorithms on the data set when TP = 0.5%.

Figure 6. Performance Comparison of Four algorithms on the data set When TP = 1%.

Conclusion and Future Work

In this paper, we proposed a method that optimally select cluster centers based on finding density peak. Our idea is on the assumption that the cluster center is in high density and each of them keeps relative distance. Compared with other algorithms for solving center initialization problem for K-means, our method CCIPD has less strict conditions and get better performance when processing the text information. In order to demonstrate the applicability of our algorithm in solving the text document clustering, we conduct CCIPD and other three existing algorithms on Chinese document data sets. The experimental results indicate that CCIPD has less running time and higher precision than other algorithms. Furthermore, the F-value of CCIPD is higher than other algorithms. In the future, we will study on the pre-process of document cluster, especially in the similarity of two text documents.

Acknowledgement

The work is supported by National Basic Research and Development Program (No.2013CB329601, 2013CB329604) and National Natural Science Foundation of China (No.61502517, No.61472433, No.61372191, No.61572492).

References

[1] H. Jiawei and M. Kamber, “Data mining: concepts and techniques,” SanFrancisco, CA, itd: Morgan Kaufmann, vol. 5, 2001.

[2] W. S. Sarle, “Algorithms for clustering data,” Technometrics, vol. 32,no. 2, pp. 227-229, 1990.

[3] M. Steinbach, G. Karypis, V. Kumar et al., “A comparison of document clustering techniques,” in KDD workshop on text mining, vol. 400, no. 1. Boston, 2000, pp. 525-526.

0 20 40 60 80 100 120

TT[ms*1000] P[%] R[%] F[%]

[image:7.595.162.438.232.374.2]

(8)

[4] M. B. Cohen, S. Elder, C. Musco, C. Musco, and M. Persu, “Dimensionality reduction for k-means clustering and low rank approximation,” in Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing. ACM, 2015, pp. 163-172.

[5] A. C. Joshi, V. R. Padghan, J. R. Vyawahare, and S. P. Saner, “Enforcing document clustering for forensic analysis using weighted matrix method(wmm),” 2015.

[6] A. Fujita, D. Y. Takahashi, and A. G. Patriota, “A non-parametric method to estimate the number of clusters,” Computational Statistics & Data Analysis, vol. 73, pp. 27-39, 2014.

[7] C. Wang, Y. Song, A. El-Kishky, D. Roth, M. Zhang, and J. Han, “Incorporating world knowledge to document clustering via heterogeneous information networks,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2015, pp. 1215-1224.

[8] M. Bernotas, K. Karklius, R. Laurutis, and A. Slotkien˙e, “The peculiarities of the text document representation, using ontology and tagging-based clustering technique,” Information Technology And Control,vol. 36, no. 2, 2015.

[9] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA., 1967, pp. 281-297.

[10] G. W. Milligan and M. C. Cooper, “An examination of procedures for determining the number of clusters in a data set,” Psychometrika, Vol. 50, no. 2, pp. 159-179, 1985.

[11] A. Rodriguez and A. Laio, “Clustering by fast search and find of density peaks,” Science, vol. 344, no. 6191, pp. 1492-1496, 2014.

[12] E. W. Forgy, “Cluster analysis of multivariate data: efficiency versus interpretability of classifications,” Biometrics, vol. 21, pp. 768-769, 1965.

[13] S. S. Khan and A. Ahmad, “Cluster center initialization algorithm for k-means clustering,” Pattern recognition letters, vol. 25, no. 11, pp.1293-1302, 2004.

[14] Xiao-Feng and K. Xie, “ k-means,” Journal of Software, vol. 19, no. 7, 2008.

[15] D. Zhuo, J. Yu, F. Gao, and L. Yu, “Text document clustering based on maximium distance of selecting initial centers,” Computer science application’s Research, vol. 31, no. 3, pp. 713-715, 2014.

[16] K. A. Nazeer and M. Sebastian, “Improving the accuracy and efficiency of the k-means clustering algorithm,” in Proceedings of the World Congress on Engineering, vol. 1, 2009, pp. 1-3.

[17] J. Zhao and X. Hu, “Similarity computation based on text categorization”, Microcomputer Applications, vol. 24, no. 12, pp. 46-47, 2008.