Density Based Balanced Clustering: A recursive approach

(1)

1 | P a g e

Density Based Balanced Clustering: A recursive approach

Sonam Gupta

¹

, Dr. Amit Sharma

²

1

Mtech Scholar Computer Science & Engineering Department Vedant College of Engineering and Technology Bundi, Rajasthan, India

2 Professor, Computer Science & Engineering Department, Vedant College of Engineering and Technology, Bundi, Rajasthan, India

ABSTRACT-

Density based clustering has drawn much attention due to its ability to recognize irregularly shaped clusters.

But introducing the concept of balanced clusters has not been attempted in a density based clustering. This paper proposes a divide and conquer approach to output clusters which are balanced in terms of population and density. Two new metric are proposed to measure the extent of balance in the output clustering. All concepts of popular density based clustering are used for the purpose.

Keywords: Clustering, Density-based clustering, balance, DBSCAN, divide and conquer strategy

I. INTRODUCTION

Clustering is a widely used data mining approach when it comes to handling large bulk of data for analysis, searching, sorting, storing and updating purposes. Many clustering algorithms partition a given dataset into groups or clusters of different sizes. However, certain applications like image segmentation, load balancing in WSNs, logistics planning, content-based image browsing and more pose an additional requirement of having balanced clusters. Balance can be achieved in different aspects: size of each cluster, amount of object space assigned to each cluster, density of each cluster, or in terms of variance or frequencies of data points in each cluster. There various approaches to clustering datasets: density-based [1], grid-based [2], partition-based [3], hierarchy-based [4], subspace clustering [5] and more. Among these, density-based clustering is most appropriate for balanced output because the concept of density is closer to the inherent concept of grouping, that is, a high dense area in the object space is naturally a cluster. For measurement of density and connectivity, local distribution of nearest neighbors is considered.

Various approaches of density based clustering differ in their definition of density. A common characteristic is that density is a quantity which indicates that how many other objects may be similar to a given object.

Simple definitions of density appears in the work of Ester et al [1], called DBSCAN. Though definition is strong, the corresponding clustering method is not effective. DDC by Rodriguez and Laio [6] used these definitions of density to provide a proper clustering approach. But later, Liang and Chen [7] indicated that DDC might suffer from decision graph fraud problem. Hence, 3DC, a divide and conquer approach was proposed in [7] to alleviate the problem. It can output both complete and partial clustering. A different definition of density inspired from field theory of physics is proposed in [8]. The clustering results of this algorithm are better than

(2)

2 | P a g e DDC though the clustering methods are very similar. The concepts of DBSCAN, DDC and the divide and conquer strategy of 3DC algorithm are used to achieve balanced clusters in the paper.

The paper can be organized as follows. Section II discusses the background of the proposal explaining the different concepts considered. Section III discusses the proposed density-based balanced clustering algorithm.

Section IV discusses the experimental results on some synthetically generated datasets. Section V discusses the comparison results in terms of two newly proposed performance metrics.

II. BACKGROUND

A. Concepts from DBSCAN

The term density in DBSCAN [1] refers to the minimum number of points (𝑀𝑖𝑛𝑃𝑡𝑠) within a specified radius(𝐸𝑝𝑠), also referred to as core-points while the others are the border points. For a point 𝑎 to be categorized as a core point among the dataset 𝐷 of 𝑛 points, it should follow the following equation.

𝑁𝐸𝑝𝑠 𝑎 = 𝑏 ∈ 𝐷 |𝑑𝑖𝑠𝑡(𝑎, 𝑏) ≤ 𝐸𝑝𝑠 (1)

A directly density reachable point (say b) is a point falling within the 𝐸𝑝𝑠 region of some core point (say a). the density reachability can be equated through the following equations

1. 𝑏 ∈ 𝑁𝐸𝑝𝑠 𝑎 (2) 2. 𝑁_𝐸𝑝𝑠(𝑎) ≥ 𝑀𝑖𝑛𝑃𝑡𝑠 (3)

For a chain of points, 𝑝_𝑖, … , 𝑝_𝑛, where 𝑝_𝑖 = 𝑎 and 𝑝_𝑛 = 𝑏, density reachability between points a and b is possible if 𝑝𝑖+1 is directly density reachable from point 𝑝𝑖.

B. Concepts from DDC

Rodriguez and Liao [6] have proposed a Delta-Density based Clustering (DDC) technique where density is not a property of cluster in this context, rather density is a value associated with each data point.

The “cut off kernel” density represented through the symbol ′𝜌^′defines the density according to the DDC framework and can be denoted through the equation

𝜌𝑖 = 𝜒 𝑑𝑖𝑗 − 𝑑𝑐 𝑗

(4)

Where 𝑖 and 𝑗 are two samples and 𝑑_𝑖𝑗 is the distance between them. 𝑑_𝑐 represents the cut off distance scale and is the only user parameter in the algorithm whose value has to be decided through expert knowledge and 𝜒 is an activation function whose value turns 1 when 𝜒 𝑥 = 0 and 0 when 𝜒 𝑥 = 1.

The parameter 𝛿 denotes the minimum distance between a sample i and other samples of higher density, denoted by the equation

𝛿𝑖 = min

𝑗 :𝑝_𝑗>𝑝_𝑖 𝑑𝑖𝑗 (5) Taking a sample with highest density, the parameter Delta (δ) can be denoted as

(3)

3 | P a g e 𝛿𝑖= max

𝑗 𝑑𝑖𝑗 (6)

The term “Border Density” for a cluster denotes the highest average density that a sample lying in the border region and its neighbor from another cluster can have. Denoted through the symbol 𝑏𝑑, it is calculated in the border region consisting of a set of samples lying at a distance 𝑑𝑐 from other clusters.

Halo Assignation refers to the partial clustering step of the DDC algorithm where the samples with density less than the border density of their clusters are counted as outliers and do not get assigned to any of the clusters.

For the assessment of the anomalous large density-delta examples, which served as a challenge for the DDC algorithm, a parameter Gamma (γ) is used. The value of Gamma for an object is product of its cut-off kernel density and delta distance. A large value of gamma then would imply large values for both quantities. The same can be denoted as

𝛾_𝑖 = 𝜌_𝑖. 𝛿𝑖 (7)

C. 3DC Algorithm

3DC clustering is proposed by Liang and Chen [7]. The idea behind Liang and Chen’s proposal is to borrow definitions of direct density reachability and density reachability from DBSCAN and apply the concept to decide whether to further partition a cluster or not. The former algorithm is listed in Fig 1.

III. PROPOSED DENSITY BASED BALANCED CLUSTERING

A. Development of Idea

The basic idea behind balancing the clusters is that whenever a cluster is unbalanced, it can be split into smaller clusters till all are balanced. This process can be implemented as a recursive process, a divide-and-conquer approach similar to 3DC. The criteria for deciding whether a cluster is balanced or not is one of the following:

The entire population of the data objects should be evenly distributed among the clusters to form a balanced output. Since, exactly equal number of objects in each cluster is not a practical option, it is better to have almost equal members in clusters. This “almost equal” is numerically taken in this dissertation as following

The density of each cluster should be equal. This criterion implies that object space is equally divided among the clusters as per their requirement assessed by the number of objects in it.

Algorithm: 3DC Clustering Input: Data, 𝑑𝑐

Output: Result obtained through clustering Phase I

Step 1: Gamma value and decision graph of the data are calculated.

Phase II

Step 1: Gamma value and decision graph of the subset are calculated.

Step 2: For each subset, 𝑥1, 𝑥2 with the largest gamma values are selected as centres Step 3: If 𝑥₂ is density reachable from 𝑥₁ with respect

(4)

4 | P a g e Step 2: Two samples with the largest gamma values

are selected as centres

Step 3: The rest of the samples are assigned to these two centres

Step 4: Partitioning of data into subsets and calculation of border densities is done

to 𝑑𝑐 and the border density, then Step 4: Output the current subset as a new cluster Step 5: Else

Step 6: Assignment of the remaining samples to the two clusters is done.

Step 7: Partition of the subset data into tiny subsets and computation of the border densities is done.

Step 8: Repeat phase II for all subsets until all the clusters have been decided.

An analogue of density of matter in physics is not required here. The concept of density as computed for 3DC can be directly used. Again density of all clusters cannot be exactly equal, rather equal within a margin.

Thus, a balanced clustering problem imposes a desired population per cluster. Also, it may require an objective function to be described in terms of cluster densities. Hence, the proposed method uses a quantity called balancing parameter which regulates the number of objects in a cluster to produce a balance of population and density.

B. Balancing Parameter

The proposed algorithm takes two parameters as input. First is the cut-off distance 𝑑𝑐which is used to compute cut-off kernel densities, as in 3DC. Second parameter is introduced for this proposed work. It is called balancing parameter because it is used to control the degree of uniform distribution of objects among the clusters. Let β denote the balancing parameter. The maximum number of objects in a cluster is then computed as

𝑀𝑎𝑥 − 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 = ^𝑁_𝛽+ 0.01 ∗ 𝑁 (8)

The value of β needs to be judged through a verification sample by expert. It is a user- defined parameter for the proposed algorithm.

C. Proposed Algorithm

The proposed algorithm is a divide-and-conquer approach. Each cluster at any instant before it is declared as final is treated as an independent object space. The values of gamma for all members of such temporary cluster are used to select two cluster peaks. If these cluster peaks are reachable from each other or the cluster has lesser objects than Max-population, then the cluster is declared as final cluster. In case any of the condition is violated, the cluster is divided further. The algorithm begins with entire dataset as single temporary cluster and divides it into sub-clusters. Every time a group is declared as final cluster, the divide step of the approach stops there for that region of object space. All final clusters are part of output. The formal algorithm is given in Algorithm

(5)

5 | P a g e IV. EXPERIMENTAL RESULTS

A. Criteria for Performance Evaluation

There are no standard metrics to judge an algorithm’s performance for balancing the cluster output. Since the focus of this paper is balanced clustering, proper metrics should be developed so that the algorithm can be evaluated and compared against other works. Considering the concepts of density used in the proposed method and the concept of balance assumed for the proposed work, following performance metrics can be designed:

i. Standard Deviation of Cluster Densities (SDCD) ii. Standard Deviation of Cluster Populations (SDCP)

Prior to defining these metrics, the term “cluster density” needs to be defined. Cluster density, analogous to density of matter in physics, should represent how compact the data objects of a cluster are within the entire object space. Since, we already have used such parameter for clustering purposes, the cut-off kernel density, defined for every data object, it can be simply extended to define cluster density as

Δ_C = 𝑎𝑣𝑔 𝜌_𝑖 , ∀𝑖 ∈ 𝐶 (9) That is, cluster density is the average kernel density of its member data objects.

The definition of SDCD follows immediately. It is the standard deviation of cluster densities of all clusters output from the algorithm. Mathematically,

𝑆𝐷𝐶𝐷 = 𝑠𝑡𝑑 ∆_𝑐 , ∀𝐶 (10) Algorithm: Density-Based Balanced Clustering

Input: Data, 𝑑𝑐, β

Output: Result obtained through clustering Phase I

Step 1: Max-population is computed using (1)

Step 2: Gamma values and decision graph of the data are calculated.

Step 3: Two samples with the largest gamma values are selected as peaks

Step 4: The rest of the samples are assigned to these two peaks to form two temporary subsets Step 5: Execute Phase II for each subset separately

Phase II

Step 1: Gamma value and decision graph of the subset are calculated.

Step 2: For each subset, 𝑥1, 𝑥2 with the largest gamma values are selected as peaks

Step 3: If 𝑥2 is density reachable from 𝑥1 with respect to 𝑑_𝑐 and the border density, OR number of members in the subset is less than Max- population then

Step 4: Output the current subset as a new cluster Step 5: Else

Step 6: Assignment of the remaining samples to the two peaks is done to partition current subset into two subsets.

Step 7: Repeat phase II for all subsets until all the clusters have been decided.

(6)

6 | P a g e A low value of SDCD indicates that densities of all clusters are almost equal, hence a balanced cluster structure is output. For the algorithms which do not output balanced clusters the value SDCD is high.

Population of a cluster is simply the number of members in the cluster. A perspective of balance can be to have clusters with similar population. In this context, standard deviation of population of clusters will give a direct indication. That is, if SDCP is high the clusters are not balanced; if SDCP is low the clusters are balanced.

SDCP is the standard deviation of population of all clusters in the output. Mathematically, it is defined as 𝑆𝐷𝐶𝑃 = 𝑠𝑡𝑑 𝐶 , ∀𝐶 (11)

Where 𝐶 is the number of members in cluster C.

B. Results

The results of experiments recorded are in the form of population of each cluster, cluster densities and time taken for execution. Then these results are processed to compute SDCD and SDCP. A variety of synthetically generated datasets have been considered for experiments [9]. Value of the balancing parameter 𝛽 is decided according to the ground truth clusters of the dataset. Dataset description and values of parameters are listed in Table I.

TABLE I

VALUES OF PARAMETERS TAKEN IN EXPERIMENTS FOR DIFFERENT DATASETS

Dataset Instances Ground-truth

clusters

𝒅𝒄 𝜷

Aggregation 788 7 2 4

Compound 399 6 2 4

Flame 240 2 3 2

Jain 373 2 2 4

Touching 73 2 2 2

Unbalanced 6500 8 5000 3

D31 3160 13 6 4

(7)

7 | P a g e Fig. 3 Results on datasets using (A) 3DC algorithm and (B) Proposed Algorithm

(8)

8 | P a g e On the listed values of parameters 𝑑𝑐 and β, the 3DC and the proposed algorithms are run for the different datasets. Fig. 3 give proper illustration of the cluster formation using the two algorithms. In the figure, graph with label A denote the formation by 3DC algorithm. B label represents output using the proposed algorithm.

For touching and Unbalanced dataset, same cluster formation is obtained by both the algorithms. A higher number of clusters in case of proposed algorithm in some datasets indicate further partition of larger clusters for balancing population and density as desired.

V. COMPARISON OF PERFORMANCE

The proposed metrics of performance – SDCD and SDCP – are computed for both the algorithms. The comparison is shown through column charts. Fig 4 shows the comparison, dataset wise, between the proposed and 3DC algorithms for values of SDCP. Since populations are too varied among the datasets, Fig 5 shows the difference in SDCP of both the algorithms. It clearly indicates that both algorithms perform equally for Touching and Unbalanced dataset. The performance of the proposed algorithm is much better than 3DC in other datasets, and is more emphasized as the size of dataset becomes large. D31 is largest dataset and records a significant improvement in performance by the proposed algorithm over 3DC.

Values of SDCD for both algorithms obtained over different datasets are shown comparatively as column chart in Fig 6. A lower value of SDCD indicates better performance. The proposed algorithm performs better than 3DC over most of the datasets.

Fig. 4 Comparing Standard Deviation of Cluster Populations in different datasets produced by proposed algorithm and 3DC

0 50 100 150 200 250 300 350 400 450 500

3DC Proposed Standard Deviation of Cluster Population

(9)

9 | P a g e Fig. 5 Difference in SDCP of 3DC and proposed algorithms for all experiments

Fig. 6 Comparing Standard Deviation of Cluster Densities in different datasets produced by proposed algorithm and 3DC

VIII. CONCLUSION

The most widely used approach of all the clustering approaches is the density-based approach. The reason for considering this aspect for clustering is that the density of clusters corresponds to the coherence within the objects of a cluster so a high dense region will almost coincide with the naturally occurring cluster. However, not much of the proposals concentrate on achieving balance in the clusters formed. In our paper, A divide-and- conquer approach for balanced clustering is proposed. Balanced clusters in terms of population and density can be produced by setting two parameters of the proposed parameterized balanced density-based clustering.

0 50 100 150 200 250 300 350 400

Difference in SDCP

0 1 2 3 4 5 6 7 8 9 10

3DC Proposed Standard Deviation of Cluster Densities

(10)

10 | P a g e Suggestions for assessing the values of parameters to be set according to particular requirements of dataset are provided. Two new performance metrics – namely, SDCD and SDCP – are proposed that are helpful to judge the outcomes of a clustering algorithm for the degree of balance among clusters. The proposed clustering algorithm is a variant of 3DC. Experiments including a variety of datasets are designed and results recorded to demonstrate that the proposed algorithm produces much balanced cluster structure as compared to 3DC. The divide-and-conquer approach saves time but produces balanced output region-wise. That is, if the dataset inherently has few subregions, the output will be balanced within a subregion only. This is an advantage as it follows the naturally occurring patterns closely. Yet, if application requires a more uniform and consistent balance in the cluster structure a straight forward approach need to be designed. Making the proposed algorithm parameter free is an aspect worth researching further. A more generalized form is always welcome.

REFERENCES

[1] M. Ester, H.P. Kriegel. J. Sander and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, Proceedings of the 2nd ACM SIGKDD Conference, pp. 226-231, 1996.

[2] W. Wang, J. Jang and R. Muntz, “STING: a statistical information grid approach to spatial data mining”, Proceedings of the 23^rd Conference on VLDB, pp. 186-195, 1997.

[3] S. P. Lloyd, “Least squares quantization in pcm”, IEEE Trans. Inf. Theory, Vol. 28, No. 2, pp. 129–137, 1982.

[4] L. Kaufman and P. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analysis”, John Wiley and Sons, New York, 1990.

[5] C. C. Aggarwal, A. Hinneburg and D. A. Keim, “On the surprising behavior of distance metrics in high dimensional space”, IBM Research report, RC 21739, 2000.

[6] A. Rodriguez and A. Laio, “Clustering by fast search and ﬁnd of density peaks”, Science, Vol. 344, pp.

1492–1496, 2014.

[7] Z. Liang and P.Chen, Delta-Density based clustering with a Divide-and-Conquer strategy: 3DC clustering, Pattern Recognition Letters, Vol. 73, Issue C, pp. 52-59, 2016.

[8] Chen, J. Y., & He, H. H. (2016). A fast density-based data stream clustering algorithm with cluster centers self-determined for mixed data. Information Sciences, 345, 271-293.

[9] D. Saini and M. Singh, “Variance Based Clustering For Balanced Clusters In Growing Datasets”, Proceedings of the International Congress on Information and Communication Technology (ICICT-2015), Udaipur, 2015.

[10] M. H. Bateni, A. Bhaskara, S. Lattanzi and V. Mirrokni, “Distributed Balanced Clustering via Mapping Coresets”, Advances in Neural Information Processing Systems(NIPS 2014), Vol 27, 2014.

[11] A. Hinneburg. and D. Keim, “An efficient approach to clustering large multimedia databases with noise”, Proceedings of the 4th ACM SIGKDD Conference, pp. 58-65, 1998.

(11)

11 | P a g e [12] M. Ester, H.P. Kriegel. J. Sander and X. Xu, “A density-based algorithm for discovering clusters in large

spatial databases with noise”, Proceedings of the 2nd ACM SIGKDD Conference, pp. 226-231, 1996.

[13] M. Ankerst, M. Breunig, H.P. Kriegel and J. Sander, “OPTICS: Ordering points to identify clustering structure”, Proceedings of the ACM SIGMOD Conference, pp. 49-60, 1999.

[14] A. Rodriguez and A. Laio, “Clustering by fast search and ﬁnd of density peaks”, Science, Vol. 344, pp.

1492–1496, 2014.

[15] X. Xu., M. Ester, H. P. Kriegel and J. Sander, “A distribution-based clustering algorithm for mining in large spatial databases”, Proceedings of 14th ICDE, pp. 324-331, 1998.

[16] Available at https://cs.joensuu.fi/sipu/datasets/