Evolving Efficient Clustering Patterns in Liver Patient Data through Data Mining Techniques

(1)

Evolving Efficient Clustering Patterns in Liver Patient

Data through Data Mining Techniques

Pankaj Saxena

Reader

RBS Management Technical Campus, Agra

Vineeta Singh

Professor,ISS

DR. B. R. Ambedkar University, Agra

Sushma Lehri

Professor, IET

DR. B. R. Ambedkar University, Agra

ABSTRACT

Clustering is one of the most important research areas in the field of data mining. In simple words, clustering is a division of data into different groups. Data are grouped into clusters in such a way that data of the same group are similar and those in other groups are dissimilar. It aims to minimize intra-class similarity while to maximize interclass dissimilarity. Clustering is an unsupervised learning technique. Clustering is useful to obtain interesting patterns and structures from a large set of data. Clustering can be applied in many areas, such as marketing studies, DNA analysis, city planning, text mining, and web documents classification. Large datasets with many attributes make the task of clustering complex. Many methods have been developed to deal with these problems. In this paper, two well known partitioning based methods – k-means and k-medoids are compared over health data. This paper also proposes an improved k-means medoids clustering algorithm. The proposed algorithm is evaluated using the health dataset i.e Liver dataset and compare the results with other previous algorithms. The proposed algorithm is more effective in terms of computation time as compared to K means and K-medoids clustering algorithm. The algorithms under consideration, is evaluated with Rand Index, Jaccard Coefficient, Folkes and Mallows and Run Time as four metrics. Experimental results are obtained on WEKA, a data mining tool.

General Terms

Clusterization, k-means, k-medoids,WEKA Tool

Keywords

Rand index (RI), Jaccard Coefficient, Folkes and Mallows (FM) index, Silhouette Index

1. INTRODUCTION

Clustering is a division of data objects into groups of similar objects. Such groups are called clusters. Objects possessed by same cluster tend to be similar, while dissimilar objects are possessed by different clusters.These clusters represent groups of data and provide simplification by representing many data objects by fewer clusters. And, this helps to model data by its clusters. Clustering is a method of unsupervised learning and a well known technique for statistical data analysis. It is used in many fields such as machine learning, image analysis, pattern recognition, outlier detection, and health informatics. Various researchers have proposed different methods to achieve clustering. Along with managing a very large dataset, a robust clustering method must satisfy some requirements such as scalability, dealing different types of attributes, discovering clusters of arbitrary shape, high dimensionality,

ability to deal with noise and outliers, interpretability and usability. With clustering, time complexity increases with large number of dimensions and large set of data objects. Also the effectiveness depends upon the definition of similarity (or dissimilarity) among objects. Along with this, the output of clustering can be interpreted in different ways. Different clustering methods can be classified into various categories such as partitioning based methods, hierarchical methods, grid-based methods, density-based methods, model-based methods, methods for high dimensional data and constraint-based clustering. Among all these methods, this paper is aimed to explore two methods – k-means and k-medoids (PAM) – which are partitioning based clustering methods and proposes new Efficient Distance Based Improved K- Medoids Clustering algorithm.

The rest of the paper is organised as follows: section 2 gives related work in clustering. Section 3 includes the description of used dataset. Section 4 describes the proposed Improved K- Medoids Clustering Algorithm. Section 5 deals with performance comparison of various previous algorithms with some simulation results.

2. LITERATURE REVIEW

K-Means[2][5][6] is one of the simplest unsupervised non-hierarchical learning methods among all partitioning based clustering methods. It classifies[7][15] a given set of n data objects in k clusters, where k is the number of desired clusters and it is required in advance. But K-Means has some limitations like, it is applicable only when the mean of a cluster is defined, not applicable to categorical data. It is unable to handle noisy data and outliers. To solve this problem, k-medoids clustering method [4][11]is used where representive objects are called medoids instead of centroids because it is based on most centrally located object in a cluster. But it is relatively more costly as its complexity is O( i k (n-k)2), where i is the total number of iterations, k is the total number of clusters, and n is the total number of objects. Singh, S.S and Chauhan, N. C., [16] have compared two well known partitioning based methods – k-means and k-medoids. The advantage of k-means is its low computation cost, while drawback is sensitivity to noisy data and outliers. Compared to this, k-medoid is not sensitive to noisy data and outliers, but it has high computation cost.

(2)

Bala Sundar V et al [1] proposed technique that classifies the group of the objects based on attributes into K number of groups. The grouping is done by minimizing the sum of squares of distances between data using Euclidean distance formula and the corresponding cluster centroid.

In our proposed algorithm the initial medoids are not chosen randomly, rather distance matrix using Manhattan distance is used. For evaluating the algorithms under consideration, we used Rand Index, Jaccard Coefficient, Folkes - Mallows and Run Time as four metrices.

3. DATA SET USED

ILPD (Indian Liver Patient Dataset) Data Set is used. This data set contains 416 liver patient records and 167 non liver patient records. The data set was collected from north east of Andhra Pradesh, India. Selector is a class label used to divide into groups (liver patient or not). This data set contains 441 male patient records and 142 female patient records. This secondary data is collected from UCI repository [3], available on the web (http://archive.ics.uci.edu/ml/darasets.html), While Primary data is collected from Bisariya Pathological Lab, Etah, having same attributes and 2000 records.

Attribute Information

1. Age → Age of the patient 2. Gender → Gender of the patient 3. TB → Total Bilirubin

4. DB → Direct Bilirubin

5. Alkphos →Alkaline Phosphotase 6. Sgpt → Alamine Aminotransferase 7. Sgot → Aspartate Aminotransferase 8. TP → Total Protiens

9. ALB → Albumin

10. A/GRatio → Albumin and Globulin Ratio

11. Selector field used to split the data into two sets (labeled by the experts)

4 ALGORITHMS AND

IMPLEMENTATIONS

4.1 K-MEANS

K-Means [4] is one of the simplest unsupervised learning methods among all partitioning based clustering methods. It classifies a given set of n data objects in k clusters, where k is the number of desired clusters and it is required in advance. A centroid is defined for each cluster. All the data objects are placed in a cluster having centroid nearest (or most similar) to that data object. After processing all data objects, k-means, or centroids, are recalculated, and the entire process is repeated. All data objects are bound to the clusters based on the new centroids.In each iteration centroids change their location step by step. In other words, centroids move in each iteration.This process is continued until centroid remains unchanged. As a result, k clusters are found representing a set of n data objects. An algorithm for k-means method is given below:

Input : ‘k’, the number of clusters to be partitioned; ‘n’, the

number of objects.

Output: A set of ‘k’ clusters based on given similarity function.

Algorithm

Steps:

i) Arbitrarily choose ‘k’ objects as the initial cluster centers;

ii) Repeat,

a. (Re)assign each object to the cluster to which the object is the most similar; based on the given similarity function; b. Update the centroid (cluster means), i.e., calculate the mean value of the objects for each cluster;

iii) Until no change

Weaknesses of K-Means:

 Applicable only when the mean of a cluster is defined; not applicable to categorical data.

 Need to specify k, the total number of clusters in advance.

 Not suitable to discover clusters with non-convex shape, or clusters of very different size.

 Unable to handle noisy data and outliers.  May terminate at local optimum.

 Result and total run time depends upon initial partition.

4.2 K-MEDOIDS CLUSTERING

The k-means method uses centroid to represent the cluster and it is sensitive to outliers. This means, a data object with an extremely large value may disrupt the distribution of data. K-medoids method [4][12]overcomes this problem by using medoids to represent the cluster rather than centroid. A medoid is the most centrally located data object in a cluster. Here, k data objects are selected randomly as medoids to represent k cluster and remaining all data objects are placed in a cluster having medoid nearest (or most similar) to that data object. After processing all data objects, new medoid is determined which can represent cluster in a better way and the entire process is repeated. Again all data objects are bound to the clusters based on the new medoids. In each iteration, medoids change their location step by step. Or in other words, medoids move in each iteration. This process is continued until no any medoid move. As a result, k clusters are found representing a set of n data objects. It follows the principle of minimizing the sum of dissimilarities between each object and its corresponding reference point. An algorithm for K-Medoids method is given below:

Input : k: the number of clusters. D: a data set containing n objects.

Output : A set of k clusters.

Algorithm

Steps

i). Randomly choose k objects in D as the initial representative objects;

ii). for all objects in the data set D

a. Find the cluster C which is nearest to object i by using the dissimilarity measure;

b. assign object i to cluster C;

c. set the member object in cluster C having minimum intra cluster variance as new centroid of C

(3)

The following weakness of existing algorithm prompted to develop new algorithm.

Weaknesses of K-Medoids

 Relatively more costly; complexity is O( i k (n-k)2), where i is the total number of iterations, k is the total number of clusters, and n is the total number of objects.

 Relatively not so much efficient.

 Need to specify k, the total number of clusters in advance.

4.3 PROPOSED ALGORITHM

The proposed algorithm is an enhancement of K-Medoids algorithm (Partition Around Medoid). Here initial centroids are not chosen randomly. The distance matrix is found once and its sum values of objects are computed and sorted in ascending order. First k (which is given) sorted objects are chosen as initial objects. Each remaining object is clustered with the representative object to which it is most similar. For the remaining iterations, the object with minimum distance with other object in its cluster is found and assigned as a new object. Here maximum execution time is reduced, as representative objects are reduced by non representative object with in cluster only. This iterative process is continued until there is no change in medoids. This algorithm is tested with different values of k and their maximum Silhouette index cluster is calculated and compared, to find the optimal number of clusters.

Algorithm

Input

D: A dataset containing n objects with p attributes. Output

A set of K clusters Method

The proposed algorithm consists of following steps : Step 1 Initialize k (which is already provided)

Step 2 Calculate distance between every pair of objects by distance matrix using Manhattan measure as follows:

Step 3 Calculate pij to make the initial guess at the centre of clusters

; j = 1, ……, n

Step 4 Compute the sum value at each object and sort them in ascending order.

Step 5 Select k objects having minimum value as initial representative object.

Step 6 Assign each remaining object to cluster with the nearest representative object

Step 7 Calculate the cost, that is sum the distances of all objects to their medoids.

Step 8 Calculate Silhouette Index for the newly formed cluster.

Step 9

1) Find new medoid, for each clustered objects by finding minimum distance object among cluster objects using Manhattan distance in distance matrix.

2) Assign minimum distance object as a new representative object, cluster remaining objects and compute cluster cost.

3) Calculate the difference between current and previous cost.

Step 10. Stop, if the current cost is same as previous cost otherwise go to step 9.

The algorithm is tested with different values of K and their maximum Silhouette index cluster is calculated and compared. The Silhouette Index value which is higher for k is considered the correct number of clusters for the given dataset.

5 .EXPERIMENT AND DATA ANALYSIS

5.1 SILHOUETTE INDEX

Silhouette Index is a cluster validity index that is used to judge the quality of any clustering solution C. Here, a

represents the average distance of a point from the other points of its cluster, and b represents the minimum of the average distances of the point from the points of the other clusters.

Let be the dataset and let C = be its clustering into k clusters. Let be the distance between xk and x1. Let be the jth_{cluster, j=1, …,k, where} _{. The average distance} aji between the ith vector in the cluster Cj and the other vectors in the same cluster is given by Equation (1).

…(1)

The minimum average distance between the ith_{vector in the} cluster Cj and all the vectors clustered in the clusters Ck, k = 1, …, k, k ≠j is given by the Equation (2).

….(2) n = 1, …, k ; n ≠j

Then the Silhouette width of the ith vector in the cluster Cj is defined in the following way,

…. (3)

From Equation (1), it follows that We can now define the Silhouette of the cluster Cj as:

…. (4)

Finally, the global Silhouette Index of the clustering is given by:

….. (5)

(4)

The value of Silhouette Index varies from – 1 to 1 and a higher value indicates better clustering result.

[image:4.595.49.289.244.459.2]

Distance based K-medoids clustering was tested with various health datasets with different values of k. For each k value, the received cluster output is validated using Silhouette Index. The Silhouette Index value which is higher for k is considered the correct number of clusters for the given dataset. Table 1 represents the received Silhouette Index for the ILPD (Indian Liver Patient Dataset) dataset for different values of k from 2 to 12. For this dataset, the maximum Silhouette value of 0.4912 was obtained at k = 4. For other values of k, the Silhouette Index is less. So, it clearly concluded that this dataset can be divided into four clusters.

Table 1. Number of Clusters and Silhouette Index for Indian Patient Liver Dataset (IPLD)

Number of

Clusters

Mean Silhouette Index

2 .3667

3 .4243

4 .4912

5 .3578

6 .4011

7 .3267

8 .3110

9 .2534

10 .2278

11 .2167

12 .2312

5.2 METRICS USED FOR EVALUATION

In order to measure the performance of a clustering and classification system, a suitable metric will be needed. For evaluating the algorithms under consideration, we used Rand Index, Jaccard Coefficient,Folkes - Mallows and Run Time as four metrics.

5.3 CLUSTER VALIDITY

The procedure of evaluating the results of a clustering algorithm is known as cluster validity. The validation step permits to evaluate the goodness of clustering results using different measures. Supervised measures [7] like Rand Index, Adjusted Rand Index, Jaccard Coefficient and Folkes and Mallows Index are used here for evaluating the cluster output. All these measures evaluate the results according to class labels.

Given a set of n elements and two partitions (U and V) of S to compare,

a – The number of pairs of elements in S that are in the ‘same set’ in U and in the ‘same set’ in V.

b – The number of pairs of elements in S that are in the ‘same set’ in U and in the ‘different sets’ in V.

c- The number of pairs of elements in S that are in the ‘different sets’ in U and in the ‘same set’ in V.

d- The number of pairs of elements in S that are in the ‘different sets’ in U and in the ‘different sets’ in V.

where a, b, c and d are computed for all pairs of data points i and j and their respective cluster assignments.

M = a+b+c+d,

Which is the maximum number of all pairs in the dataset (M = N*(N-1)/2, where N is the total number of points in the dataset.

Following four indices are used to measure the degree of similarity between U and V :

1. Rand Index (R)

2. Adjusted Rand Index (ARI)

3. Jaccard Coefficient (J)

4. Folkes and Mallows Index (FM)

Where

The four indices have proven that high values of indices great similarity between U and V.

Cluster validity using distance-based K-medoids algorithm for primary & secondary datasets is represented in Table 2. Considering Rand Index (R), the obtained accuracy is 89% for primary dataset and 92% for secondary dataset. Performance analysis of various methods is performed by using WEKA tool.

Table 2: Cluster Validity using Improved Algorithm for

Primary & Secondary Liver Dataset

S.

No.

Datasets ARI R J FM

Index

1. Indian Liver Patient Dataset (Primary)

0.76 0.89 0.72 0.84

2. Indian Liver Patient Dataset ( Secondary)

[image:4.595.333.481.320.496.2]

(5)

For evaluating the algorithms under consideration, we used Rand Index, Jaccard Coefficient,Folkes - Mallows and Run Time as four measures.

The performance of improved algorithm is simulated over 2000 samples of primary data by using Rand Index, Jaccard Coefficient,Folkes - Mallows and Run Time as four cluster validity measures. Table 3 summarizes the results, where the adjusted Rand indices were reported in a), the Jaccard Coefficient in b),the Folkes & Mallows in c) and computation time in d).

Table 3 (a) Adjusted Rand Index(R)

N K-Means PAM Proposed

Method

800 0.657 0.691 0.764

1000 0.689 0.691 0.772

1200 0.591 0.662 0.756

1400 0.615 0.621 0.762

1600 0.666 0.674 0.774

1800 0.561 0.567 0.662

2000 0.598 0.591 0.771

Table 3 (b) Jaccard Coefficient (J)

Method

800 0.524 0.672 0.772

1000 0.612 0.711 0.792

1200 0.667 0.724 0.782

1400 0.692 0.798 0.811

1600 0.598 0.675 0.724

1800 0.612 0.678 0.728

2000 0.698 0.715 0.756

Table 3 (c) Folkes and Mallows Index (FM)

Method

800 0.774 0.792 0.852

1000 0.715 0.778 0.877

1200 0.784 0.811 0.862

1400 0.712 0.794 0.897

1600 0.811 0.842 0.882

1800 0.722 0.794 0.875

2000 0.677 0.716 0.859

Table 3 (d) Computation Time ( in seconds)

Method

800 5.124 4.912 4.711

1000 5.675 5.475 5.321

1200 6.167 6.001 5.881

1400 6.723 6.565 6.115

1600 7.356 7.189 6.567

1800 7.992 7.764 7.234

2000 8.452 8.189 7.992

An efficient distance based improved K- Medoids Clustering Algorithm is reported to be better than previous algorithms. The complexity of proposed algorithm is O(nk) which is better than of PAM.

6. Conclusion

In this paper , an efficient distance – based K –medoids algorithm has been used for clustering. This improved algorithm was applied on primary and secondary Liver dataset. The result of proposed algorithm are more accurate and easily found with less computation time. Compared to other algorithms like k-means algorithm and Parition Around Medoids (PAM), it uses less number of iterations to produce more accurate results. In this method , initial medoids are selected from distance matrix , using Manhattan distance. It avoids scanning of large database every time as it updates the Medoids, using the Manhattan distance matrix. Optimal number of clusters are chosen from Silhouette index.This improved algorithm is less effected by outliers. The experimental results exhibits that by using this algorithm, one can obtain compact clusters very quickly and efficiently.

4. ACKNOWLEDGMENTS

Our thanks to Bisaria Patholgy Lab. Etah, and UCI repository for providing primary, secodary Liver dataset respectively.

5. REFERENCES

[1] Bala Suder, V., Devi, T. and Saravanan N. 2012 “Development of a Data Clustering Algorithm for Predicting Heart” International Journal of Computer Applications” Vol 48, Issue 7, pp 0975-888.

[2] Eisten, M., Spellman, P., Brown, P. and Botstein, D. 1998, “Cluster Analysis and Display of Genome-Wide Expression Patterns”, in Proc. Natl. Acad. Science USA, Vol. 95, No. 25, pp. 14863 – 14868.

[3] ftp://ftp.ics.uci.edu/pub/machine-learning-databases [4] Han, J. W. and Kamber, M., 2001 Data Mining Concepts

and Techniques, Higher Education Press, Beijing. [5] Han, J., Kamber, M. and Tung, A. 2001. Spatial

[image:5.595.311.549.87.240.2] [image:5.595.48.287.202.354.2] [image:5.595.49.288.570.726.2]

(6)

[6] Hartigan, J., A. and Wong, M., A. 1979, “ A K-Means Clustering Algorithm”, Applied Statistics, Vol. 28, No. 1, pp. 100-108.

[7] Jiang, D., Tang, C. and Zhang, A. 2004, “Clustering Analysis for Gene Expression Data: A Survey”, IEEE Transactions on Knowledge and Data Engineering, Vol. 16, No. 11, pp. 1370–1386.

[8] Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, New York .

[9] Li, S, Wu X and Tan M 2008, “Gene Selection Using Hybrid Particle Swarm Optimization and Genetic Algorithm”, Soft Computing, Vol. 12, No. 11, pp. 1039–1048.

[10] Lu, Y, Lu, S, Fotoulhi F, Denf Y. and Brown, S. 2004, “Incremental Genetics K-Means Algorithm and Its Applications in Gene Expression Data Analysis”, BMC Bioinformatics, Vol. 5, pp. 172– 180. [11] Park, H-S and Jun, C-H 2009, “A Simple and Fast

Algorithm for K-Medoids Clustering”, Expert Systems with Applications, Vol. 36, No. 2, pp. 3336 – 3341.

[12] Raghuvira, P. A., Vani, K. S. and Rao, K. N. 2011. An Efficient Density Based Improved k- medoids Clustering Algorithm. International Journal of

Advanced Computer Science & Application. Vol 02, No.6,49-54.

[13] Ranga Raj, R. Punithavalli . 2012. “ Evaluation of Enhanced K-means Algorithm to Student Dataset.International Journal of Advanced Networking & Application”. Vol 04,Issue 02, pp 1578-80.

[14] Raymond, T. Ng and Jiawei Han 2002, “CLARANS: A Method for Clustering Obejects for Spatial Data Mining”, IEE Transactions on Knowledge and Data Engineering, Vol. 14, No. 5, pp. 1003–1016. [15] Selim, S., Z. and Ismail, M., A. 1984, “K-Means Type

Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality”, IEEE Trans. Pattern Anal. Mach. Intel., Vol. 6, No. 1, pp. 81– 87.