Efficient Algorithm to Find Information Rich Subset in High Dimensional Data

(1)

International Journal of Emerging Technology and Advanced Engineering

Website: www.ijetae.com (ISSN 2250-2459,ISO 9001:2008 Certified Journal, Volume 3, Issue 11, November 2013)

228

Efficient Algorithm to Find Information Rich Subset in High

Dimensional Data

Radhika K R

1

, Dr. Thriveni J

2 1_{Asst.prof, Dept of CSE, BMSIT} 2_{Assoc.prof, Dept.of CSE, UVCE}

Abstract-- Clustering in High Dimensional Data is the cluster analysis of data with anywhere from a few dozens to many thousands of dimensions. High-dimensional data spaces are often encountered in areas such as medicine, where DNA micro array technology can produce a large number of measurements at once and the clustering of text documents where if a word-frequency vector is used, the number of dimensions equals the size of the dictionary. Subspace clustering is the task of detecting all clusters in all subspaces. It is an extension of traditional clustering that seeks to find clusters in different subspaces within a data set. In high dimensional data many of the dimensions are often irrelevant. These irrelevant dimensions confuse clustering algorithm by hiding clusters in noisy data. In very high dimensions it is common for all the instances in a dataset to be nearly equidistant from each other completely masking the clusters. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple possibly overlapping subspaces. The existing Information Rich Subset (IRS) algorithm can find the effective information rich subspaces trends but here K-means algorithm has been used to find the suitable clusters. K-means can be applied only when the mean of a cluster is defined. This may not be the case in some applications such as when data with categorical attributes are involved. The K-means method is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value. Here a extension work has been proposed to optimize the IRS algorithm to find the suitable subspaces and also effectiveness is evaluated with the existing different subspace clustering algorithms.

Keywords— High Dimensional Data, Clustering, k-means, Subspace

I. INTRODUCTION

Data mining refers to extracting or mining knowledge from large amounts of data. Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks. It uses machine learning, Statistical and

Visualization techniques to discover and present

knowledge in a form, which is easily comprehensible to humans.

Data Mining is the process of exploration and analysis, by automatic or semi automatic means, of large quantities of data in order to discover meaningful patterns and rules. Data Mining, also popularly known as Knowledge Discovery in Databases (KDD), refers to the nontrivial extraction of implicit, previously unknown and potentially useful information from data in databases. Data mining systems can be categorized according to the kinds of knowledge they mine based on data mining functionalities such as characterization, discrimination, association and correlation. One of the primary data mining tasks is clustering which is intended to help a user discovering and understanding the natural structure or grouping in a data set.

Clustering is the process of grouping a set of physical or abstract objects into classes of similar objects. It is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. Categorizations of Major Clustering Methods are:

Partition Methods: partition data set containing n

objects into a set of k-clusters.

Hierarchical method: A hierarchical method creates a

hierarchical decomposition of the given set of data objects.

Density-based Model: Clustering based on density

(local cluster criterion), such as density connected points.

Grid-based Methods: Grid-based method quantizes

the object space into a finite number of cells that form a grid structure.

Model-based Methods: Model-based methods

hypothesize a model for each of the clusters and find the best fit of the data to the given model.

(2)

International Journal of Emerging Technology and Advanced Engineering

229 High-dimensional data spaces are often encountered in areas such as medicine, where DNA micro array technology can produce a large number of measurements at once and the clustering of text documents where if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

Subspace clustering is the task of detecting all clusters in all subspaces. It is an extension of traditional clustering that seeks to find clusters in different subspaces within a data set. Traditional clustering algorithms consider all of the dimensions of an input data set in an attempt to learn as much as possible. In high dimensional data many of the

dimensions are often irrelevant. These irrelevant

dimensions confuse clustering algorithm by hiding clusters in noisy data. In very high dimensions it is common for all the instances in a dataset to be nearly equidistant from each other completely masking the clusters. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple possibly overlapping subspaces.

In Today’s world data sets consist of a high dimensional feature space. Most clustering techniques use the distance or similarity between objects as a measure to build clusters. But in high dimensional spaces distance between points become relatively uniform. Most interesting patterns cannot be revealed using global methods which consider the entire data and feature spaces during their analysis. Identifying some interesting patterns in large scale high dimensional data is usually accomplished using popular techniques such as dimensionality reduction, feature selection.

The existing Information Rich Subset (IRS) algorithm can find the effective information rich subspaces trends but here K-means algorithm has been used to find the suitable clusters. K-means can be applied only when the mean of a cluster is defined. This may not be the case in some applications such as when data with categorical attributes are involved. The K-means method is sensitive to noise and outlier data points because a small number of such data can substantially influence the mean value. Here a extension work has been proposed to optimize the IRS algorithm to find the suitable subspaces and also

effectiveness is evaluated with the existing different

subspace clustering algorithms.

Applications of Data mining

Data mining is used by businesses to improve its marketing and to understand the buying patterns of its clients. Attrition analysis, customer segmentation and cross selling are the most important ways through which data mining is showing new ways in which business can multiply its revenue.

Data mining is used in the banking sector for credit card fraud detection by identifying the patterns involved in fraudulent transactions. It is also used to reduce credit risk by classifying a potential client and predicting bad loans.

II. CHALLENGES TO FIND SUBSPACE IN

HIGH-DIMENSIONAL DATA

There are many factors to be considered for a clustering algorithm in data mining i.e., efficiency, shape of clusters, sensitivity to outliers and the requirements of parameters. A clustering algorithm will assume a certain set of criteria for a cluster and also to determine subspaces that have good clustering.

a)Attribute subset selection: This is commonly used for

data reduction by removing irrelevant or redundant dimensions. Given a set of attributes, attribute subset selection finds the subset of attributes that are most relevant to the data mining task. Attribute subset selection involves searching various attribute subsets and evaluating these subsets using certain criteria.

b)Criterion of high coverage: The first criterion that as

to be considered is the coverage. This is a reasonable criterion, since a subspace with more distinguished clusters will have high coverage whereas a subspace with close to random data distribution will have low coverage.

c) Correlation of dimensions: Finding subspaces with good clustering may not always be helpful, the subspace should be correlated. The reason is that although a subspace may contain clusters, it may not be interesting if the dimensions are independent to each other. So the dimension of the subspaces to be correlated.

d)Most clustering techniques use the distance or

similarity between objects as a measure to build clusters.

e)The K-means method is sensitive to noise and outlier

data points because a small number of such data can substantially influence the mean value.

f) Factors to be considered for a clustering algorithm in

data mining i.e., efficiency, shape of clusters, sensitivity to outliers and the requirements of parameters.

g)The dimensionality reduction methods used for linear

(3)

International Journal of Emerging Technology and Advanced Engineering

230

III. RELATED WORK

There are several organizations/institutions that have carried out research in these areas. Some of the important methods suited widely in literature which is relevant to the problem. Dimensionality reduction is an effective approach to downsize data. Most machine learning and data mining techniques may not be effective for high dimensional data, because query accuracy and efficiency degrade rapidly as the dimension increases. Dimensionality reduction is one of the areas which attempts to extract the meaningful dimensions from the large pool of feature or to develop a meaningful low-dimensional representation. Classical methods used for linear dimensionality reduction that are widely used in many practical applications are PCA (Principal Component Analysis) is a mathematical procedure that transforms a number of uncorrelated variables called PCA. The objective of PCA is to reduce the dimensionality of the dataset but retain most of the original variability in the data. But PCA can only produce a linear mapping into a low-dimensional space but there are many datasets where the underlying variability of the

features creates a highly non-linear structure. However,

PCA creates new features. It is difficult to obtain intuitive understanding of the data using the new features only.

Yi-Hong Chu et al., [1] devised a novel subspace clustering model to discover the clusters based on the

relative region densities in the subspaces, where

the clusters are regarded as regions whose densities are

relatively high as compared to the region

densities in a subspace. Based on this idea, different density

thresholds are adaptively determined clusters in

different subspace cardinalities. Due to the infeasibility of applying previous techniques in this novel clustering model, they also devise an innovative algorithm, referred to as DENCOS (density conscious subspace clustering), to adopt a divide-and-conquer scheme to efficiently discover clusters satisfying different density thresholds in different subspace cardinalities.

Geoff [2] considered the problem of assessing the significance of groups in High-Dimensional data and proposed a method of resampling approach in conjunction with factor analytic models for the generation of the bootstrap samples for the number of groups.

Data mining applications place special requirements on clustering algorithms including the ability to find clusters embedded in subspaces of high dimensional data, scalability, end-user comprehensibility of the results, non-presumption of any canonical data distribution, and insensitivity to the order of input records.

Mining the representative subspace clusters in

high-dimensional data. Subspace clusters can be clustered into

several groups, and several representative clusters can be generated from each group. Unfortunately, when the size of the set of representative clusters is specified, the problem of finding the optimal set is NP-hard. To solve this problem efficiently [3] presented an approximate method PCoC. The greatest advantage of this method is that they only need a subset of subspace clusters as the input. The performance study shows the effectiveness and efficiency of the method.

Subspace clustering is the task of detecting all clusters in all subspaces. It is an extension of traditional clustering that seeks to find clusters in different subspaces within a data set. Traditional clustering algorithms consider all of the dimensions of an input data set in an attempt to learn as much as possible. In high dimensional data many of the dimensions are often irrelevant. These irrelevant dimensions confuse clustering algorithm by hiding clusters in noisy data. Subspace clustering algorithms localize the search for relevant dimensions allowing them to find clusters that exist in multiple, possibly overlapping subspaces.

New clustering method for high dimensional data streams called WSCStream is proposed in [4]. This method incorporates a fading cluster structure and a dimensional weight matrix. The weight associated with each dimension indicates the importance of each dimension to the corresponding cluster. The weighted distance between a cluster and a data point is used to obtain the final

clusters as the new data points arrive over time.

Experimental results on real and synthetic datasets demonstrate that WSC Stream has higher clustering quality.

A fast subspace partition data streams clustering

(4)

International Journal of Emerging Technology and Advanced Engineering

231 Furthermore, this approach has better scalability with different dimensionality and different partition granularity.

Identifying information-rich subsets in high-dimensional spaces and representing them as order revealing patterns (or trends) are an important and challenging research problem. The information quotient of large-scale high-dimensional datasets is signiﬁcantly reduced by the curse of dimensionality which makes the traditional clustering and association analysis methods unsuitable. Most interesting patterns cannot be revealed using global methods which consider the entire data and feature spaces during their analysis. Identifying some interesting patterns

in large scale high-dimensional data is usually

accomplished using popular techniques such as

dimensionality reduction, feature selection and subspace clustering. Though these methods are successfully able to identify the groupings in the feature subsets and localized neighborhood data subspaces, none of these methods extract the latent patterns that are present in local

information-rich subsets of the data.

Analyzing databases with many attributes per object is a recent challenge. For these high dimensional data it is known that traditional clustering algorithms fail to detect meaningful patterns. As a solution Gunnemann et al., [6] proposed a subspace clustering techniques, which analyze

arbitrary subspace projections of the data to detect

clustering structures. In this demonstration, they introduced the firstsubspace clustering extension for the well-established KNIME data mining framework. While KNIME offers many data mining function nalities. The novel extension provides a multitude of algorithms, data

generators, evaluation measures, and visualization

techniques specifically designed for subspace clustering. It also integrates the KNIME framework allowing a flexible combination of the existing KNIME features with the novel subspace components.

Mining temporal multivariate data by clustering is an important research topic. The Complex data, interesting patterns are often neither bound to the whole dimensional

nor temporal extent domain. This challenge is met

by temporal subspace clustering methods. Under these

conditions, existing temporal subspace clustering

approaches miss the patterns contained in the data [7]. A novel clustering method that mines temporal subspace clusters reflected by sets of objects and relevant data intervals.

They enabled flexible handling of misaligned time series by adaptively shifting time series in the time domain, and they achieved robustness to measure the errors by allowing certain fractions of deviating values in each relevant point in time.

An algorithm called GKM is proposed by Yufen Sun et al., [8] to generalize k-means algorithm for high dimensional data. In GKM they associated a weight vector with each cluster to indicate which dimensions are relevant to this cluster. To prevent the value of the objective function from decreasing because of the elimination of dimensions, virtual dimensions are added to the objective function. The values of data points on virtual dimensions are set artificially to ensure that the objective function is minimized when the real subspace clusters or the clusters in original space are found. This Algorithm preserves the advantages of k-means to identify subspace clusters with linear time complexity.

Identifying ‘subspace trends’ in high-dimensional

datasets focusing on information-rich subsets are proposed by Snehal Pokharkar et al., [9] and developed a new algorithm to extract such subspace trends.

Subspace clustering algorithms are used to enhance images of the same object using different devices at different conditions [15]. Experiments were performed on two distinct databases containing urban scenes which were tested using state-of-the-art matching algorithms. Start point was the hypothesis that low discriminate local point descriptors lead to misclassification, which can be reduced employing clustering techniques as filters. Significant results obtained for the two tested databases, which indicate

that subspace clustering techniques have much to

contribute at this research area.

IV. PROPOSED WORK

(5)

International Journal of Emerging Technology and Advanced Engineering

232

V. CONCLUSION

New algorithm as to be proposed to find the subspace in High-Dimensional data. The algorithm should be able to find suitable subspace which gives more information by making use of IRS-algorithm and also efficiency as to be calculated

REFERENCES

[1] Yi-Hong Chu, Jen-Wei Huang, Kun-Ta Chuang, and De-Nian Yang,

“Density Conscious Subspace Clustering for

High-Dimensional Data”, IEEE Transaction on Knowledge and Data Engineering, vol. 22, no.1, pp.16-30, 2010.

[2] Geoff McLachlan, “Assessing the Significance of Groups in

High-dimensional Data”, In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 1-6, 2010.

[3] Guanhua Chen, Xiuli Ma, Dongqing Yang, Shiwei Tang, and Meng

Shuai, “Mining Representative Subspace Clusters in

High-dimensional Data”. In proceedings of 6th_{IEEE International}

Conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 1, pp. 490-494, 2009.

[4] Lining Li, and Changzhen Hu, “A Weighted Subspace Clustering Algorithm In High-Dimensional Data Streams”, In proceedings of

4th_{IEEE International Conference on Innovative Computing and}

Information Control (ICICIC), pp. 631-634, 2009.

[5] ZhongpingZhang and HaoWang ,“A Fast Subspace Partition

Clustering Algorithm for High Dimensional Data Streams”, In

Proceedings of IEEE International Conference on Intelligent Computing and Intelligent Systems(ICIS), vol. 1, pp. 491-495, 2009.

[6] Gunnemann Stephan, Kremer Hardy, Musiol Richard, Haag Roman, and Seidl Thomas, “A Subspace Clustering Extension for the

KNIME Data mining Framework”, In proceedings of 12th_IEEE

International Conference on Data Mining Workshops (ICDMW) ”, pp. 886-889, 2012.

[7] Kremer Hardy, Gunnemann Stephan, Held Arne, and Seidl Thomas,

“Effective and Robust Mining of Temporal Subspace Clusters”, In proceedings of 12th _{IEEE International Conference on Data Mining}

(ICDM)”, pp. 369-378, 2012.

[8] Yufen Sun, Gang Liu, and Kun Xu, “A k-Means-Based Projected

Clustering Algorithm”, In Proceedings of 3rd_{IEEE International}

Joint Conference on Computational Science and Optimization (CSO), vol. 1, pp. 466-470, 2010.

[9] Snehal pokharkar and Chandan K.Reddy, “Identifying

Information-Rich Subspace Trends in High-dimensional Data”, In Proceedings of SIAM International Conference on Data Mining (SDM), pp. 557-568, April 2009.

[10] Margaret H Dunham, Data Mining: Introductory and Advanced

Topics, Pearson Education, 2006.

[11] Jiawei Han and Micheline Kamber,Data Mining: Concepts and

Techniques, 2nd_edition,_{Morgan Kaufmann Publishers, March 2006.}

[12] Haiyan Bian, Finding Interesting Subspace Clusters from High

Dimensional Datasets, University of Cincinnati, 2006.

[13] Rakesh Agrawal, Dimitrios Gunopulos, Prabhakar Raghavan and

Johannes Gehrke, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, IBM, 1998.

[14] Yan Zhang and Yiyu Jia, “EProbe: An Efficient Subspace Probing

Framework”, 23rd_{IEEE International Conference on Tools with}

Artificial Intelligence (ICTAI), pp. 841-848, 2011.

[15] Coelho M., Valle E, Santos Junior C, De Albuquerque Araiijo A,

“Subspace Clustering for Information Retrieval in Urban Scene

Databases”, 24th_{IEEE International Conference on Graphics,}

Patterns and Images (Sibgrapi), pp. 173-180, 2011.

IRS-identification

Proposed method

Finding-subspace

Trends Identification of

independent paths Pattern Graph

Generation

Bicluster