Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance

(1)

Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance

Rahul Malik ¹, Raj Kumar ²

1 2Department of CSE, Jind Institute of Engineering and Technology, Jind (Haryana), India

1M. Tech. scholar

2Assistant Professor

Abstract- The k-means method has been shown to be effective in producing good clustering results for many practical applications. However, a direct algorithm of k-means method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented dataset for 2 clusters up to Kmaxclusters, where Kmaxrepresents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure.

Keywords –K-mean, Clustering, Dataset.

I.INTRODUCTION

Clustering is a data mining technique that makes meaningful or useful cluster of objects that have similar characteristic using automatic technique.To make the concept clearer, we can take showroom of cloths as an example. In a showroom, cloths have a wide range of categories available. The challenge is how to keep those cloths in a way that costumers can take several cloths in a specific type without hassle. By using clustering technique, we can keep cloths that have some kind of similarities in one cluster or one shelf and label it with a meaningful name. If costumers want to grab cloths in a particular type, he or she would only go to that shelf instead of looking the whole showroom.Clustering problems arise in many different applications, such as data mining and knowledge discovery, data compression and vector quantization, and pattern recognition and pattern classification. The notion of what constitutes a good cluster depends on the application and there are many methods for finding clusters subject to various criteria, both ad hoc and systematic.

For further information on clustering and clustering algorithms, among clustering formulations that are based on minimizing a formal objective function, perhaps the most widely used and studied is k-means clustering. Given a set of n data points in real d-dimensional space, Rd, and an integer k, the problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. This measure is often called the squared-error distortion and this type of clustering falls into the general category of variance based clustering. Clustering based on k-means is closely related to a number of other clustering and location problems.

These include the Euclidean k-medians (or the multisource Weber problem) in which the objective is to minimize the sum of distances to the nearest center and the geometric k-center problem in which the objective is to minimize the maximum distance from every point to its closest center. There are no efficient solutions known to any of these problems and some formulations are NP-hard. An asymptotically efficient approximation for the k-means clustering problem has been presented by Matousek, but the large constant factors suggest that it is not a good candidate for practical implementation. One of the most popular heuristics for solving the k-means problem is based on a simple iterative scheme for finding a locally minimal solution. This algorithm is often called the k-means algorithm.

A. K- mean algorithm-

The simplest and most popular among iterative and hill climbing clustering algorithms is the K-means algorithm (KMA). As mentioned above, this algorithm may converge to a suboptimal partition. Since stochastic

(2)

optimization approaches are good at avoiding convergence to a locally optimal solution, these approaches could be used to find a globally optimal solution. The stochastic approaches used in clustering include those based on simulated annealing, genetic algorithms, evolution strategies and evolutionary programming [1, 2].

Figure 1 K- mean clustering with Iris data

The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented dataset for 2 clusters up to Kmaxclusters, where Kmaxrepresents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure. The validity measure is tested for LIC dataset for which the number of clusters in known

.

B. Introduction to SVM–

Identification of distinct clusters of documents in text collections has traditionally been addressed by making the assumption that the data instances can only be represented by homogeneous and uniform features. Many real- world data, on the other hand, comprise of multiple types of heterogeneous interrelated components, such as web pages and hyperlinks, online scientific publications and authors and publication venues to name a few. In this paper, we present K-SVMeans, a clustering algorithm for multi-type interrelated datasets that integrates the well-known K- Means clustering with the highly popular Support Vector Machines. The experimental results on authorship analysis of two real world web-based datasets show that K-SVMeans can successfully discover topical clusters of documents and achieve better clustering solutions than homogeneous data clustering.

We present K-SVMeans that clusters datasets with heterogeneous similarity characteristics. K-SVMeans simultaneously clusters along one dimension of the data while learning a classifier in another dimension, which, in turn effects the intermediate cluster assignment decisions in the original dimension. K-SVMeans clustering is a hybrid clustering solution that merges the well-known K-Means clustering algorithm with Support Vector Machines (SVM), a highly popular supervised learning algorithm that has been shown to be highly effective, especially for text and graph area clustering.

II.PROPOSED ALGORITHM

A. Modifying algorithm –

In this section, we describe the modifying algorithm. As mentioned earlier, the algorithm is based on storing the multidimensional data points in a kd-tree. For completeness, we summarize the basic elements of this data structure.

Define a box to be an axis-aligned hyper-rectangle. The bounding box of a point set is the smallest box containing all the points. A kd-tree is a binary tree, which represents a hierarchical subdivision of the point set's bounding box

(3)

root's cell is the bounding box of the point set. If the cell contains at most one point (or, more generally, fewer than some small constant), then it is declared to be a leaf. Otherwise, it is split into two hyper rectangles by an axis- orthogonal hyper plane. The points of the cell are then partitioned to one side or the other of this hyper plane. (Points lying on the hyper plane can be placed on either side.) The resulting sub cells are the children of the original cell, thus leading to a binary tree structure. There are a number of ways to select the splitting hyper plane. One simple way is to split orthogonally to the longest side of the cell through the median coordinate of the associated points.

Given n points, this produces a tree with O.n. nodes and O.log n. depth. We begin by computing a kd-tree for the given data points. For each internal node u in the tree, we compute the number of associated data points u: count and weighted centroid u: wgtCent, which is defined to be the vector sum of all the associated points. The actual centroid is just u: wgtCent=u:count. It is easy to modify the kd-tree construction to compute this additional information in the same space and time bounds given above. The initial centers can be chosen by any method desired. (Lloyd's algorithm does not specify how they are to be selected. A common method is to sample the centers at random from the data points.) Recall that, for each stage of Lloyd's algorithm, for each of the k centers, we need to compute the centroid of the set of data points for which this center is closest. We then move this center to the computed centroid and proceed to the next stage. For each node of the kd-tree, we maintain a set of candidate centers. This is defined to be a subset of center points that might serve as the nearest neighbor for some point lying within the associated cell.

The candidate centers for the root consist of all k centers. We then propagate candidates down the tree as follows:

For each node u, letC denote its cell and let Z denote its candidate set. First, compute the candidate z_ 2 Z that is closest to the midpoint of C. Then, for each of the remaining candidates z 2 Znfz_g, if no part of C is closer to z than it is to z_, we can infer that z is not the nearest center to any data point associated with u and, hence, we can prune, or filter, z from the list of candidates. If u is associated with a single candidate (which must be z_) then z_ is the nearest neighbor of all its data points. We can assign them to z_ by adding the associated weighted centroid and counts to z_. Otherwise, if u is an internal node, we recurse on its children. If u is a leaf node, we compute the distances from its associated data point to all the candidates in Z and assign the data point to its nearest center. It remains to describe how to determine whether there is any part of cell C that is closer to candidate z than to z_. Let H be the hyperplane bisecting the line segment zz_.H defines two halfspaces; one that is closer to z and the other to z_. If C lies entirely to one side of H, then it must lie on the side that is closer to z_ (since C's midpoint is closer to z_) and so z may be pruned. To determine which is the case, consider the vector ~u .z ÿ z_, directed from z_ to z.

Let v.H. denote the vertex of C that is extreme in this direction, that is, the vertex of C that maximizes the dot product .v.H. _ ~u.. Thus, z is pruned if and only if dist.z; v.H.. _ dist.z_; v.H... (Squared distances may be used to avoid taking square roots.) Our implementation differs somewhat from those of Alsabti et al. and Pelleg and Moore.

Alsabti et al.'s implementation of the modifying algorithm uses a less effective pruning method based on computing the minimum and maximum distances to each cell, as opposed to the bisecting hyperplane criterion. Pelleg and Moore's implementation uses the bisecting hyperplane, but they define z_ (called the owner) to be the candidate that minimizes the distance to the cell rather than the midpoint of the cell. Our approach has the advantage that if two candidates lie within the cell, it will select the candidate that is closer to the cell's midpoint.

B. Modifying Extraction algorithm –

Support Vector Machines are well known for their generalization performance and ability to handle high dimensional data which is a common case in document classification problems. Considering the binary classification case, let ((x1, y1) · · · (xn, yn)) be the training dataset where xi are the feature vectors that represent the observations and yi∈(−1, +1) be the two labels that each observation can be assigned to. From these observations, SVM builds an optimum hyper plane – a linear discriminant in the kernel transformed higher dimensional feature space – that maximally separates the two classes by the widest margin by minimizing the following objective function

min (w,b,ξi)w =∑ wT+ C Ni=1

ξi(1)

wherew is the norm of the hyperplane, b is the offset, y(xi) are the labels and ξiare the slack variables that permit the non-separable case by allowing misclassification of training instances.

(4)

We start with a brief overview of traditional K-Means. Given n data objects x1, x2 · · · xn ,∀_xi∈Rwwhere w is the size of the feature space and each xi is normalized such that ||xi|| = 1, K-Means partitions the xi into k disjoint clusters π1, π2, · · · , πk, so that

ki=1

πi= {x1, x2, · · · xn} where πi ∩ πj = ∅, i = j

where the centroid ci of each cluster πi is defined as ci= xk∈_πi

The goal of the clusterer is to maximize the similarity between the data objects and their assigned clusters, hence, the objective function becomes

maxQ= k= j=1 xTi· cj∀πi 1 ≤ i ≤ k (5)

K-Means optimizes the objective function iteratively by following two steps: A cluster assignment step, where each data object is assigned to a cluster with the closest centroid, followed by a cluster centroid update step.

The algorithm terminates when the change in the objective function value between two successive iterations is below a given threshold. Upon the termination of the algorithm, each data object belongs to one of the k clusters.

This partitioning, however, is done on a single dimension.

Consider that the instances in the set X = (x1, x2, · · · ,xn), which we want to obtain a clustering solution, are related to another set U = (u1, u2, · · · , um) in some way. Each xi can be related to one or multiple uj’s in aX → U mapping where objects in U denote a unique property of xi. The reverse map U → X lets us represent each u as a mixture of the xi’s that are connected to it. Let T denote the relationship matrix where Tij= 1if xi is relatedto uj, and zero otherwise.

During the clustering process, the intermediate cluster assignments in K-SVMeans are determined by two conditions. In the first condition, a data object xi is reassigned from a cluster πi to πj if xi is closer to πj ’s centroid than πi’s centroid and the u’s of xi are classified into the positive class by πj ’s SVM and into the negative class by πi’s SVM.

200

K-SVMeans Cluster Assignment Definitions:

xi:Objects to be clustered

dij:distance of object xi to cluster πj m(i): assigned cluster of xi

l(πi): SVM learner of cluster πi ˆy(u, π) = n

z=1 απ

zKπ(u, uπz) + bπ SVM decision valueu for cluster π

λ: Penalty term

III.EXPERIMENT AND RESULT

To run the matlab software platform [3, 4] our first work is to access the dataset.We select the following dataset on which we want to implement algorithm. The dataset is classified by colouring the different colour to different columns as described below:

• This colour shows the Policy Number in Dataset.

• This colour shows the Initial Name in Dataset.

• This colour shows the Premium Amount in Dataset.

• This colour shows the Age in Dataset.

• This colour shows the Agent Code in Dataset.

How much cluster it can make as shown in figure 2:

(5)

IDENTIFIED CLUSTERS FOR

THE CLUSTERING

DATA SET CHOOSEN

AND CLUSTER

FORMED

(6)

Click on Start button and we get the 5 clusters which represent the similar data in different -2 groups as shown in figure

3.

IV.CONCLUSION

From all the above calculations we come to the conclusion that the K-Mean algorithm is an excellent algorithm when we are dealing with a small or medium sized data. It simply provides good performance vector every time. A direct algorithm of k-means method requires time proportional to the product of number of patterns and number of clusters per iteration. This is computationally very expensive especially for large datasets. The main disadvantage of the k-means algorithm is that the number of clusters, K, must be supplied as a parameter. In this paper we present a simple validity measure based on the intra-cluster and inter-cluster distance measures which allows the number of clusters to be determined automatically. The basic procedure involves producing all the segmented dataset for 2 clusters up to Kmaxclusters, where Kmaxrepresents an upper limit on the number of clusters. Then our validity measure is calculated to determine which is the best clustering by finding the minimum value for our measure. The validity measure is tested for LIC dataset for which the number of clusters in known.

V.REFERENCE

[1] Ben-Hur, D. Horn, H.T. Siegelmann, and V. Vapnik. “A support vector clustering method”. In International Conference on Pattern Recognition, 2000.

[2] Bonchi F, Lucchese C “On condensed representations of constrained frequent patterns”. Knowl Inf Syst 9(2):180–201, (2006).

[3] B. R. Hunt, R. L. Lipsman, and J. M. Rosenberg (with K. R. Coombes, J. E. Osborn, and G. J. Stuck), “A Guide to MATLAB: for beginners and experienced users”, Cambridge University Press 2001.

[4]http://www.mathworks.com

CLICK HERE TO GENERATE

MF PLOT

Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance