A Comparative study on data mining clustering...

(1)

A Comparative study on data mining clustering

algorithms

Fenil Shah

1

, Harsh Doshi

2

, Malav Shah

3

, Mitchell D’silva

4

1, 2, 3

Dwarkadas J. Sanghvi College of Engineering, Mumbai, India 4

Assistant Professor, Department of IT, Dwarkadas J. Sanghvi College of Engineering, Mumbai, India

ABSTRACT

Clustering algorithms have proved to be effective and popular in recent times. These algorithms are required to separate similar data from the different ones. Many organizations use these clustering algorithms to extract knowledge from the datasets and generate results which help them to take vital decisions for the organization. Main algorithms in clustering technique include partitional based clustering algorithm, hierarchical clustering algorithm, DBScan algorithm. However, the main concern lies in selecting the type of algorithm to choose in certain specific situation. This paper gives an overview of different algorithms along with their advantages and disadvantages and lastly, these algorithms are compared based on certain parameters which can help to choose the type of algorithm to be used for different databases.

Keywords: comparative analysis, clustering algorithms, DB Scan, hierarchical, partitional based clustering.

I. INTRODUCTION

Data clustering, in the simplest of its meaning is to cluster or group together relevant data which are similar in its properties or characteristics. This technique offers a refined structure to the dataset, by segmenting it into number of groups. These groups may tend to appear discrete or overlapping based on the dispersion of datasets [3]. Such techniques are used for analyzing the relation between the data points based on the similarities they possess in different conditions and later evaluating knowledge out of it. The data points residing in different clusters are different from each other. The criterion of clustering influences the quality of clusters. Clustering algorithms can be applied to multivariate datasets which consist of unstructured data or where there is no prior knowledge about the datasets. Such datasets can have different data types or may vary in the number of objects. There are a variety of clustering algorithms which are used on different kinds of datasets. It is very difficult to estimate as to which algorithm will be suitable for each kind of dataset as the size of these datasets is enormous. This paper delineates about two widely used algorithms: partition based clustering algorithm and hierarchical clustering algorithm, DB Scan algorithm [1].

II. PARTITIONBASEDCLUSTERINGALGORITHM

Initial partitioning of clusters of a database D is done by using partitioning algorithms, by assigning the number of partitions k. Iterative strategy is used when performing partitioning algorithm. Partitioning of database D starts by first attributing the value k i.e. number of partitions which acts as a reference point and gradually, iterative strategy is used. Some of the most renowned partitioning algorithm used are K-means and K-medoid. K-means generally uses central points or centroids to generate clusters. Whereas, in K-medoid each cluster is represented by determining one of the data point in the cluster found near its center. The basic procedure of partitioning algorithm is to determine the number of clusters required and assigning each data point to the respective cluster which is closest to it in terms of distance. A restrictive convex structure is obtained by using this algorithm. Huge databases with scattered data points use this type of algorithm for finding similarities within the database[2].

A. K Means

(2)

mostly works for numeric data. Each cluster consists of data which are similar in characteristics/attributes. Every cluster consists of one central point or centroid and the rest of the data points are compared on the basis of Euclidean distance. The distance of each point with respect to the centroids is calculated and the data point belongs to the cluster whose distance is least from the central point.

It is an unsupervised learning algorithm, which is generally used when the data is unlabeled. It is an iterative algorithm which assigns the data point to a cluster based on its characteristics.

 Algorithm

1. Decide the number of clusters K required by the user. 2. Then, randomly select the values of K centroids.

3. Assign each data point to its respective cluster based on the nearest distance between data point and the centroid.

4. Calculate the new centroids of the cluster based on the average of initial clusters.

5. Repeat step 3 until the values in each cluster and centroid value remain same after the next iteration[4].

B. K Mediods

K-medoid is similar to k-means as it groups the data points into a cluster by considering a reference point. The only difference between the two is the value of the reference point to be considered. In k-means, the average of all the data points is considered to be the centroid of the cluster, whereas in k-medoid, the values of data points are arranged in ascending order and the medoid which is the middle value is considered as centroid for the cluster. It is an iterative algorithm and the steps to generate the cluster are similar to k-means. It is efficient in determining the outliers and noise [3].

 Advantages of Partition based Clustering

1. It is a scalable and flexible algorithm as the cluster size changes according to the value of centroid and number of clusters required that is K.

2. Effective for spherical cluster datasets that are well separated.

3. More reliable because the objects or data points can leave one cluster to enter into another cluster to improve the criterion.

4. Much efficient than hierarchical clustering if the datasets is huge.

 Disadvantages of Partition based Clustering

1. The value of K is determined by the user which is not always accurate. 2. Different values of K produce different results.

3. Ineffective if the distance of data points is much higher than the average distance between two points that is in high dimensional spaces.

III. HIERARCHICALBASEDCLUSTERING

Clustering is a form of organizing the data such that there is a higher intra-cluster similarity and lower inter-cluster similarity. A type of unsupervised learning as it does not exhibit any predefined set of classes. Hierarchical clustering is a type of flat clustering method using the tree structure to group the data in the form of clusters. Partitioning the data recursively and generating clusters by using top down or bottom up approach is a way of mining huge datasets by clustering method [10]. In this method every object represents itself as a cluster and recursively merging data sets is performed until the complete tree structure is formed. The two significant types of hierarchical clustering methods are Agglomerative clustering (Bottom up) and Divisive clustering (Top down).

A. Agglomerative Clustering.

(3)

linkage in the tree structure. The different types of linkage in this type of clustering are complete-linkage, single-linkage and average single-linkage based on mean distance between the clusters.

 Algorithm:

Let A = {a1, a2, a3, ..., an} be the set of data points.

1. Start with the unconnected clusters which are at the lowest level base B(0)=0 and having the initial sequence starting with 0 example n=0.

2. Computation of short distance within the existing clustering for example a pair of (p),(k) accordingly computing the distance dist[(p),[k]]=short dist[ (i) ,(j)] in which considering the complete pair of clusters which are present in the existing system.

3. Increment the sequence count number n by 1 and merge the cluster set (p) and (k) into a new unified cluster set n. Form a new level of this clustering B(n)=dist[(p),(k)].

4. Modify distance matrix dist, by performing a discard operation on the rows and columns which are linked to the clusters (p) and (k) and then add row and column for every newly formed clusters denoting it as (p,k) and the previous cluster (g) is defined as dist[(g),(p,k)]=short(dist[(g),(p)], dist[(g),(k)]).

5. If no data point is left at the end and all data points are unified in single cluster then stop, else iterate from step 2 onwards.

IV. DIVISIVE CLUSTERING

Divisive clustering is a type of hierarchical clustering similar to agglomerative clustering but divisive clustering approach is in reverse with respect to agglomerative clustering. A method of clustering where top down method is used. Divisive clustering is an efficient way of hierarchical clustering which is suitable for large datasets. Lesser computational complexity is attributed to application in wider range of datasets. In this method of clustering all the data initially is a part of a huge cluster. This large cluster is then divided into sub-clusters using heuristics and the divisive clustering function chooses object with most dissimilarities within the cluster thus separating out a group of similar objects as one cluster from the dissimilar data objects [9]. Hence, forming recursive iteration of removal and splitting unified clusters into smaller clusters having high-intra similarity is called as divisive clustering. The algorithm mentioned for agglomerative applied in reverse gives us the divisive algorithm.

 Advantages of Hierarchical Clustering 1. Simple and easy to implement.

2. Structured and informative form of representation. 3. Better results using structural classification.

 Disadvantages of Hierarchical Clustering

1. Any step done cannot be modified it is immutable as undo option is not available within this type of clustering.

2. Higher computational cost.

3. Lesser flexibility for uncertain databases.

4. Large cluster handling, difficulty in identification of cluster from the dendogram and high time complexity of algorithm are some of the weakness of this type of clustering.

V. DENSITYBASEDCLUSTERING

There is spatial data having a mixture of pattern distributions involving different densities, along with which there is also a presence of background noise. Clusters of different densities can be designed as belonging to point processes having different intensities. Clustering of such data is a challenging problem in data mining. We may have to calculate a number of values so that we can differentiate between these different density distributions.

(4)

I. DBSCAN

It has only a couple of input parameters, namely Eps and MinPts. It begins with a point p and if it is a core point, then a cluster, including the neighborhood points can be generated. The problem with this algorithm is that it does not work well with datasets having different densities.

 DBSCAN Algorithm

Let P = {p1, p2, p3, ...,pn} be „n‟ number of data points. DBSCAN requires only a couple of parameters: ε (Eps) and the MinPts.

1. Start with a random starting point that has not been visited yet. 2. Calculate the neighborhood of this random point using ε.

3. If there is some threshold number of points present around this point, only then the clustering process starts and then this particular point is marked as visited or else this point is marked as a noise.

4. If the point is part of the cluster then its ε neighborhood will also be part of the cluster and the above procedure will be replicated for all ε neighborhood points.

5. An unvisited point is studied and processed, leading to the unearthing of either a cluster or noise. 6. This process will not stop until all points are labeled as visited.

II. DBCLUM

It works as an extension of DBSCAN algorithm. Here, clusters are first separately formed using MinPts and Eps and are then merged into a single cluster. This algorithm overcomes the problem of handling datasets of different densities. DBSCAN required only two parameters [8], but DBCLUM requires three parameters, namely Threshold, MinPts and Eps. Eps is the radius to find neighbors, MinPts is the minimum number of points required to form a cluster. Lastly, Threshold is the value which decides whether two clusters will be merged or not. The algorithm begins with a point P and then calculates distances of neighbours with respect to Eps. If the number of points found are more than MinPts, a cluster is then formed and labeled [8]. Otherwise, the point is termed as noise. The next step includes visiting another point and to try and form a cluster. This process is repeated until all points are assigned as a part of the cluster or as a noise.

 Advantages of Density Based Algorithm

1. The number of clusters in data need not be specified. 2. It is not affected by outliers to a great extent.

3. It can find clusters that are not in a definite shape i.e. even if they are arbitrary.

4. It requires a minimum number of parameters and it is also independent of the ordering of the points.

 Disadvantages of Density based Algorithm 1. The sensitivity to outliers is very less or almost null. 2. It cannot be used for high-dimensional data.

VI. COMPARITIVEANALYSIS

Factors Partitional based

Clustering

Hierarchical Clustering

Density based

Clustering Performance Excellent performance

if number of clusters specified

Low as computational cost is high

Good

Huge data and data variations

Easy to handle huge datasets

Can handle huge data efficiently as compared to density based clustering

Cannot handle huge variations in data

Handling noisy data and outliers

Inefficient Efficient Efficient than

Partitional based clustering

Time consumption Average time taken to form clusters

More time taken to form clusters as compared to k-means

(5)

Accuracy Low Higher than Partitional based algorithms

Average

Time Complexity O(nkl)where n is the number of patterns, k is the number of clusters and l is the number of iterations taken to converge.

O(n2logn) where n is the number of patterns

O(m* time taken to find

points in the

neighbourhood)

Space Complexity O(k+n) and addition storage is required for storing the data.

O(n2) O(m)

CONCLUSION

The above three main algorithms have been compared based on certain important factors or parameters which are necessary to be considered at the time of performing the clustering analysis. These results may vary depending on the type of datasets used and the number of datasets present in the database. However, these results are accurate for most of the specified constraints or specific conditions mentioned.

REFERENCES

[1]. Deepti Sisodia, Lokesh Singh, Sheetal Sisodia, Khushboo saxena,”Clustering Techniques: A Brief Survey of Different Clustering Algorithms”, pg. no.- 82-87, September 2012.

[2]. Zonghu Wang, Zhijing Liu, Donghui Chen, Kai Tang,”A New Partitioning Based Algorithm For Document Clustering”, pg. no.-1741-1745, 2011 Eighth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD).

[3]. A.Dharmarajan, T. Velmurugan, “Applications of Partition based Clustering Algorithms: A Survey”, 2013 IEEE International Conference on Computational Intelligence and Computing Research.

[4]. Gurpreet Singh, Jaskaranjit Kaur, MD. Yusuf Mulge, “Performance Evaluation of Enhanced Hierarchical and Partitioning Based Clustering Algorithm (EPBCA) in Data Mining”, pg.no.-805-810

[5]. Ankerst, M., Breunig, M., Kriegel, H.P., Sander, J.: OPTICS: Ordering points to identify the clustering structure. In: Proceedings of 1999 ACM-SIGMOD International Conference on Management of Data (SIGMOD‟99). pp. 49–60. Philadelphia, USA (June 1999)

[6]. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases. In: Proceedings of 1996 International Conference on Knowledge Discovery and Data Mining (KDD‟96). pp. 226–231. Portland, USA (August 1996)

[7]. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic sub-space clustering of high dimensional data for data mining applications. In: Proceedings of 1998 ACM-SIGMOD International Conference on Management of Data (SIGMOD‟98). pp. 94–105. Seattle, USA(June 1998)

[8]. M. F. Hassanin, M. Hassan and A. Shoeb, “DDBSCAN: Different Densities-Based Spatial Clustering of Applications with Noise”; 2015International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT), Kumaracoil, 2015, pp. 401-404.

[9]. ICIIBMS 2015, Track2: Artificial Intelligence, Robotics, and Human-Computer Interaction, Okinawa, Japan 978-1-4799-8562-3/15/$31.00 ©2015 IEEE, “A New Hierarchical Clustering Algorithm” , Zahra Nazari, Dongshik Kang & M.RezaAsharif Graduate School of Engineering & Science University of the Ryukyus Okinawa, Japan Yulwan Sung & Seiji Ogawa Kansei Fukushi Research Center Tohoku Fukushi University Sendai, Japan .