A New Method for Dimensionality Reduction using K- Means Clustering Algorithm for High Dimensional Data Set

(1)

A New Method for Dimensionality Reduction using

K-Means Clustering Algorithm for High Dimensional Data Set

D.Napoleon

Assistant Professor Department of Computer Science

Bharathiar University Coimbatore – 641 046

S.Pavalakodi

Research Scholar Department of Computer Science

Bharathiar University Coimbatore–641046

ABSTRACT

Clustering is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. Dimensionality reduction is the transformation of high-dimensional data into a meaningful representation of reduced dimensionality that corresponds to the intrinsic dimensionality of the data. K-means clustering algorithm often does not work well for high dimension, hence, to improve the efficiency, apply PCA on original data set and obtain a reduced dataset containing possibly uncorrelated variables. In this paper principal component analysis and linear transformation is used for dimensionality reduction and initial centroid is computed, then it is applied to K-Means clustering algorithm.

Keywords

Clustering; Dimensionality Reduction, Principal component analysis, k-means algorithm, Amalgamation.

1. INTRODUCTION

Data Mining refers to the mining or discovery of new information in terms of patterns or rules from vast amounts of data. Data mining is a process that takes data as input and outputs knowledge. One of the earliest and most cited definitions of the data mining process, which highlights some of its distinctive characteristics, is provided by Fayyad, Piatetsky-Shapiro and Smyth (1996), who define it as “the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.”Some popular and widely used data mining clustering techniques such as hierarchical and k-means clustering techniques are statistical techniques and can be applied on high dimensional datasets [6]. A good survey on clustering methods is found in Xu et al. (2005).

High dimensional data are often transformed into lower dimensional data via the principal component analysis (PCA)(Jolliffe, 2002) (or singular value decomposition) where coherent patterns can be detected more clearly [14]. Such unsupervised dimension reduction is used in very broad areas such as meteorology, image processing, genomic analysis, and information retrieval [9]. It is also common that PCA is used to project data to a lower dimensional subspace and K-means is then applied in the subspace (Zha et al., 2002)[8]. In other cases, data are embedded in a low-dimensional space such as the eigenspace of the graph Laplacian, and K-means is then applied (Ng et al., 2001)[3].The main basis of PCA-based dimension reduction is that PCA picks up the dimensions with the largest variances. Mathematically, this is equivalent to finding the best low

rank approximation (in L2 norm) of the data via the singular value decomposition (SVD) (Eckart & Young, 1936). However, this noise reduction property alone is inadequate to explain the effectiveness of PCA[5]

Dimension reduction is the process of reducing the number of random variables under consideration, and can be divided into feature selection and feature extraction [4]. As dimensionality increases, query performance in the index structures degrades. Dimensionality reduction algorithms are the only known solution that supports scalable object retrieval and satisfies precision of query results [16]. Feature transforms the data in the high-dimensional space to a space of fewer dimensions [9].The data transformation may be linear, as in principal component analysis (PCA), but any nonlinear dimensionality reduction techniques also exist [18]. In general, handling high dimensional data using clustering techniques obviously a difficult task in terms of higher number of variables involved. In order to improve the efficiency the noisy and outlier data may be removed and minimize the execution time, we have to reduce the no. of variables in the original data set. To do so, we can choose dimensionality reduction methods such as principal component analysis (PCA), Singular value decomposition (SVD), and factor analysis (FA). Among this, PCA is preferred to our analysis and the results of PCA are applied to a popular model based clustering technique [7].

Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering [5].The main linear technique for dimensionality reduction, principal component analysis, performs a linear mapping of the data to a lower dimensional space in such a way, that the variance of the data in the low-dimensional representation is maximized. In practice, the correlation matrix of the data is constructed and the eigenvectors on this matrix are computed. The eigenvectors that correspond to the largest eigenvalues (the principal components) can now be used to reconstruct a large fraction of the variance of the original data. Moreover, the first few eigenvectors can often be interpreted in terms of the large-scale physical behavior of the system. The original space (with dimension of the number of points) has been reduced (with data loss, but hopefully retaining the most important variance) to the space spanned by a few eigenvectors. Many applications need to use unsupervised techniques where there is no previous knowledge about patterns inside samples and its

(2)

grouping, so clustering can be useful. Clustering is grouping samples base on their similarity as samples in different groups should be dissimilar. Both similarity and dissimilarity need to be elucidated in clear way. High dimensionality is one of the major causes in data complexity. Technology makes it possible to automatically obtain a huge amount of measurements. However, they often do not precisely identify the relevance of the measured features to the specific phenomena of interest. Data observations with thousands of features or more are now common, such as profiles clustering in recommender systems, personality similarity, genomic data, financial data, web document data and sensor data. However, high-dimensional data poses different challenges for clustering algorithms that require specialized solutions. Recently, some researchers have given solutions on high-dimensional problem. Our main objective is proposing a framework to combine relational definition of clustering with dimension reduction method to overcome aforesaid difficulties and improving efficiency and accuracy in K-Means algorithm to apply in high dimensional datasets. Kmeans clustering algorithm is applied to reduced datasets which is done by principal component analysis dimension reduction method.

2. METHODOLOGIES

Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas[12].Clustering is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity [17]. The quality of a clustering result depends on both the similarity measure used by the method and its implementation and also by its ability to discover some or all of the hidden patterns [15].

2.1 Clustering

Cluster analysis is one of the major data analysis methods widely used for many practical applications in emerging areas[12].Clustering is the process of finding groups of objects such that the objects in a group will be similar to one another and different from the objects in other groups. A good clustering method will produce high quality clusters with high intra-cluster similarity and low inter-cluster similarity [17]. The quality of a clustering result depends on both the similarity measure used by the method and its implementation and also by its ability to discover some or all of the hidden patterns [15].

2.2 K-Means Clustering Algorithm

K-means is a commonly used partitioning based clustering technique that tries to find a user specified number of clusters (k), which are represented by their centroids, by minimizing the square error function [13]. Although K-means is simple and can be used for a wide variety of data types. The K-means algorithm is one of the partitioning based, nonhierarchical clustering methods. Given a set of numeric objects X and an integer number k, the K-means algorithm searches for a partition of X into k clusters that minimizes the within groups sum of squared errors. The K-means algorithm starts by initializing the k cluster centers [1]. The input data points are then allocated to one of the existing clusters according to the square of the Euclidean distance from the clusters, choosing the closest. The mean (centroid) of each cluster is then computed so as to update the cluster center [1]. This update occurs as a result of the change in the membership of each cluster. The processes of re-assigning the input vectors and the update of the cluster centers is repeated until no more

change in the value of any of the cluster centers. The steps of the K-means algorithm are written below:

Input: X = {d1, d2,……..,dn} // set of n data items.

Output: A set of k clusters

Step 1: Initialization: choose randomly K input vectors (data

points) to initialize the clusters.

Step 2: Nearest-neighbor search: for each input vector, find the

cluster center that is closest, and assign that input vector to the corresponding cluster.

Step 3: Mean update: update the cluster centers in each cluster

using the mean (centroid) of the input vectors assigned to that cluster

Step 4: Stopping rule: repeat steps 2 and 3 until no more change in

the value of the means.

2.3 Principal Component Analysis

Principal component analysis (PCA) involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, and each succeeding component accounts for as much of the remaining variability as possible. Depending on the field of application, it is also named the discrete KarhunenLoève transform (KLT), the Hostelling transform or proper orthogonal decomposition (POD).PCA was invented in 1901 by Karl Pearson.[4] Now it is mostly used as a tool in exploratory data analysis and for making predictive models. PCA involves the calculation of the eigenvalue decomposition of a data covariance matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. The results of a PCA are usually discussed in terms of component scores and loadings.

PCA is the simplest of the true eigenvector-based multivariate analyses. Often, its operation can be thought of as revealing the internal structure of the data in a way which best explains the variance in the data. If a multivariate dataset is visualized as a set of coordinates in a high-dimensional data space (1 axis per variable), PCA supplies the user with a lower-dimensional picture, a "shadow" of this object when viewed from its (in some sense) most informative viewpoint.PCA is closely related to factor analysis; indeed, some statistical packages deliberately conflate the two techniques. True factor analysis makes different assumptions about the underlying structure and solves eigenvectors of a slightly different matrix.

2.4 Principal Components

Technically, a principal component (PCS) can be defined as a linear combination of optimally weighted observed variables which maximize the variance of the linear combination and which have zero covariance with the previous PCs. The first component extracted in a

(3)

principal component analysis accounts for a maximal amount of total variance in the observed variables. The second component extracted will account for a maximal amount of variance in the data set that was not accounted for by the first component and it will be uncorrelated with the first component. The remaining components that are extracted in the analysis display the same two characteristics: each component accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components, and is uncorrelated with all of the preceding components. When the principal component analysis will complete, the resulting components will display varying degrees of correlation with the observed variables, but are completely uncorrelated with one another. PCs are calculated using the Eigen value decomposition of a data covariance matrix/ correlation matrix or singular value decomposition of a data matrix, usually after mean centering the data for each attribute. Covariance matrix is preferred when the variances of variables are very high compared to correlation. It would be better to choose the type of correlation when the variables are of different types. Similarly the SVD method is used for numerical accuracy [19].After finding principal components reduced dataset is applied to kmeans clustering. Here also we have proposed a new method to find the initial centroids to make the algorithm more effective and efficient. The main advantage of this approach stems from the fact that this framework is able to obtain better clustering with reduced complexity and also provides better accuracy and efficiency for high dimensional datasets

3. DATASET DESCRIPTION

Experiments are conducted on a breast cancer data set which data is gathered from uci web site. This web site is for finding suitable partners who are very similar from point of personality’s view for a person. Based on 8 pages of psychiatric questions personality of people for different aspects is extracted. Each group of questions is related to one dimension of personality. To trust of user some questions is considered and caused reliability of answers are increased. Data are organized in a table with 90 columns for attributes of people and 704 rows which are for samples. There are missing values in this table because some questions have not been answered, so we replaced them with 0. On the other hand we need to calculate length of each vector base on its dimensions for further process. All attributes value in this table is ordinal and we arranged them with value from 1 to 5, therefore normalizing has not been done. There is not any correlation among attributes and it concretes an orthogonal space for using Euclidean distance. All samples are included same number of attributes.

4. EXPERIMENTAL SETUP

In all experiments we use MATLAB software as a powerful tool to compute clusters and windows XP with Pentium 2.1 GHZ. Reduced datasets done by principal component analysis reduction method is applied to kmeans clustering. As a similarity metric, Euclidean distance has been used in kmeans algorithm.The steps of the Amalgamation k-means clustering algorithm are as follows.

Input: X = {d1, d2,……..,dn} // set of n data items. Output: A set of k clusters

Phase-1: Apply Pca to Reduce the Dimension of the Breast

Cancer Data Set

Step 1: Organize the dataset in a matrix X. Step 2: Normalize the data set using Z-score.

Step 3: Calculate the singular value decomposition of the data matrix. X =UDV T

Step 4: Calculate the variance using the diagonal elements of D. Step 5: Sort variances in decreasing order.

Step 6: Choose the p principal components from V with largest variances.

Step 7: Form the transformation matrix W consisting of those p

PCs.

Step 8: Find the reduced projected dataset Y in a new coordinate axis

by applying W to X. Phase-2: Find the Initial Centroids

Step 1: For a data set with dimensionality, d, compute

the variance of data in each dimension(column). Step 2: Find the column with maximum variance and

call it as max and sort it in any order.

Ste p 3: Divide the data points of cvmax into K subsets, where K is the desired number of clusters. Step 4: Find the median of each subset.

Step 5: Use the corresponding data points (vectors) for Step 6: each median to initialize the cluster centers.

Phase-3: Apply K-Means Clustering With Reduced Datasets. Step 1: Initialization: choose randomly K input vectors (data

points) to initialize the clusters.

Step 2: Nearest-neighbor search: for each input vector, find the cluster center that is closest, and assign that input vector to the corresponding cluster.

Step 3: Mean update: update the cluster centers in each cluster using the mean (centroid) of the input vectors assigned to that cluster.

Step 4: Stopping rule: repeat steps 2 and 3 until no more change in the value of the means.

4.1 Experimental Results

Breast cancer original dataset is reduced using principal component analysis reduction method. Dataset consists of 569 instances and 30 attributes. Here the Sum of Squared Error (SSE), representing distances between data points and their cluster centers have used to measure the clustering quality.

(4)

Step 1: Normalizing the original data set

Using the Normalization process, the initial data values are scaled so as to fall within a small-specified range. An attribute value V of an attribute A is normalized to V’ using Z-Score as follows:

V’= (V-mean (A))/std (A) (1)

It performs two things i.e. data centering, which reduces the square mean error of approximating the input data and data scaling, which standardizes the variables to have unit variance before the analysis takes place. This normalization prevents certain features to dominate the analysis because of their large numerical values.

Step 2: Calculating the PCs using Singular Value Decomposition of the normalized data matrix

The number of PCs obtained is same with the number of original variables. To eliminate the weaker components from this PC set we have calculated the corresponding variance, percentage of variance and cumulative variances in percentage, which is shown in Table 1. Then we have considered the PCs having variances less than the mean variance, ignoring the others. The variance in percentage is evaluated and the cumulative variance in percentage first value is same as percentage in variance, second value is summation of cumulative variance in percentage and variance in percentage. Similarly other values of cumulative variance is calculated.

Step 3: Finding the reduced data set using the reduced PCs The transformation matrix with reduced PCs is formed and this transformation matrix is applied to the normalized data set to produce the new reduced projected dataset, which can be used for further data analysis,also applied the PCA on three biological dataset and the reduced no. of attributes obtained for each dataset .

Step 4: Finding intial centroids

The proposed algorithm finds a set of medians extracted from the dimension with maximum variance to initialize clusters of the means[10]. The method can give better results when applied to k-means.

Step 5: Reduced datasets are applied to k-means algorithm with computed centroids

The reduced breast cancer dataset is applied to the standard k-means clustering[2] to. The SSE value obtained and the time taken in ms for reduced breast cancer datasets with original Amalamation k-means is given in Table 2

Table 1.The Variances, Variances in Percentages, and Cumulative Variances in Percentages Corresponding to Pcs

Table 2. Shows Results Of Amalgamation Of K-means With Number Of Clusters, SSE and Execution Time

Amalgamation K –Means Algorithm

Dataset No of Clusters SSE Execution Time(in ms) Breast Cancer Reduced Dataset 1 12603 0.598 2 9513 0.623 3 7641 0.791 4 4841 0.771 5 811 0.889

(5)

The above results show that the Amalgamation kmeans algorithm provides sum of squared error distance and Execution time of corresponding clusters. Figure 1. shows graph of SSE and Number of clusters. In this figure, when number of clusters increases, sum of squared error distance values decreases.Figure 2.shows number of clusters increases, Execution time increases.

Figure 1. Shows SSE and Number of Clusters

Figure 2. Shows Execution Time and Number Of Clusters

The following Figure 3. And Figure 4. shows the Cluster Results for BREAST CANCER DataSet in 2Dand 3D using Amalgamation K-Means Algorithm with the number of clusters K=5

Figure 3. Shows Clusters For Breast Cancer Dataset Using Amalgamation K-means Algorithm in 2D

Figure 4. Shows Clusters for Breast Cancer Dataset Using Amalgamation K-means Algorithm in 3D

5. CONCLUSION

In this paper a dimensionality reduction through PCA, is applied to k-means algorithm. Using Dimension reduction of principal component analysis, original breast cancer dataset is compact to reduced data set which was partitioned in to k clusters in such a way that the sum of the total clustering errors for all clusters was reduced as much as possible while inter distances between clusters are maintained to be as large as possible. We propose a new algorithm to initialize the clusters which is then applied to k-means algorithm. The experimental results show that principal component analysis is used to reduce attributes and reduced dataset is applied to k-means clustering with computed centroids. Evolving some dimensional reduction methods like canon pies can be used for high dimensional datasets is suggested as future work.

6. REFERENCES

[1] Bradley, P. S., Bennett, K. P., & Demiriz, A. (2000).Constrained k-means clustering (Technical ReportMSR-TR-2000-65). Microsoft Research, Redmond, WA.

[2] C Ding,”Principal Component Analysis and Effective K-means Clustering”

[3] Chao Shi and Chen Lihui, 2005. Feature dimension reduction for microarray data analysis using locally linear embedding, 3rd Asia Pacific Bioinformatics Conference, pp. 211-217.

[4] Chris Ding and Xiaofeng He, “K-Means Clustering via Principal Component Analysis”, In proceedings of the 21st

International Conference on Machine Learning, Banff, Canada, 2004

[5] Davy Michael and Luz Saturnine, 2007. Dimensionality reduction for active learning with nearest neighbor classifier in text categorization problems, Sixth International Conference on Machine Learning and Applications, pp. 292-297

[6] IEEEI.T Jolliffe, “Principal Component Analysis”, Springer, second edition.

(6)

[7] Kiri Wagsta- Claire Cardie ,”Constrained K-means Clustering with Background Knowledge”

[8] .Maaten L.J.P., Postma E.O. and Herik H.J. van den, 2007. Dimensionality reduction: A comparative review”, Tech. rep.University of Maastricht.

[9] Moth’d Belal. Al-Daoud , (2005).A New Algorithm for Cluster Initialization, World Academy of Science, Engineering and Technology.

[10]. O Shamir,”Model Selection and Stability in k-means Clustering”

[11] Rand, W. M. (1971). Objective criteria for the evaluation of clustering met hods. Journal of the AmericanStatistical Association, 66, 846-850.

[12] .RM Suresh, K Dinakaran, P Valarmathie,“Model based modified k-means clustering for microarray data”,

[13] International Conference on Information Management and Engineering, Vol.13, pp 271-273, 2009, [14].Valarmathie P., Srinath M. and Dinakaran K., 2009. An increased performance of clustering high dimensional data through dimensionality reduction technique, Journal of Theoretical and Applied Information Technology, Vol. 13, pp. 271-273

[14]. Wagsta_, K., & Cardie, C. (2000). Clustering with instance-level constraints. Proceedings of the Seventeenth International Conference on Machine Learning (pp. 1103{1110). Palo Alto, CA: Morgan Kaufmann.

[15] Wray Buntine,” K-means Clustering and PCA”, National ICT Australia

[16]. Xu R. and Wunsch D., 2005. Survey of clustering algorithms, IEEE Trans. Neural Networks, Vol. 16, No. 3, pp. 645-678. [17] Yan Jun, Zhang Benyu, Liu Ning, Yan Shuicheng, Cheng

Qiansheng, Fan Weiguo, Yang Qiang, Xi Wensi, and Chen Zheng,2006. Effective and efficient dimensionality reduction for large-scale and streaming data preprocessing, IEEE transactions on Knowledge and Data Engineering, Vol. 18, No. 3, pp. 320-333.

[18] Yeung Ka Yee and Ruzzo Walter L., 2000. An empirical study on principal component analysis for clustering gene expressionData”,Tech. Report, University of Washington.