74 | P a g e
HIGH DIMENSIONAL DATA ANALYSIS ALGORITHM BASED ON CLUSTERING FOR
FEATURE SUBSET SELECTION
Kaushal V
1, C.N .Chinnaswamy
2, T.H.Sreenivas
31
IS&E , The National Institute of Engineering, Mysore (India)
2
Associate Professor, I S & E,The National Institute of Engineering, Mysore (India)
3
Professsor, CS&E, The National Institute of Engineering, Mysore (India)
ABSTRACT
Feature selection is a pre-processing step to machine learning and it is very effective in reducing the dimensionality, irrelevant data is removed and improves learning accuracy, and enhances result comprehensibility. However the exponential increase of dimensionality of data poses a various severe challenge to many of the existing feature selection methods with respect to efficiency and effectiveness. Proposed Algorithm is used to Identify and remove various irrelevant data set. This proposed algorithm can be implemented using two different steps I,e…1) graph theoretic clustering methods and secondly 2) representative feature cluster is selected. . In the second step, subset of features is formed where the feature that is strongly related to target class. The main intention of the feature subset selection research is on mainly searching for relevant features. The proposed logic has mainly focused on minimized redundant data set and improves the feature subset accuracy. The clustering-based strategy of proposed algorithm has mainly high probability of producing a subset of independent and the useful features since features in different clusters are relatively independent to one another. Here We adopt the efficient minimum-spanning tree (MST) clustering method to ensure and enhance the efficiency of proposed algorithm . Empirical study is also used to evaluate the efficiency and effectiveness of the proposed algorithm.
Keywords: Feature Selection, Feature Subset Clustering, Graph Theoretic-Based Clustering, MST- Based Clustering / Construction, Correlation Calculation
I. INTRODUCTION CLASSIFICATION
Data mining refers to the use of variety of tools and techniques to identify the useful information and extracting them so that they can be put to use in various areas or categories like decision support, prediction, microarray, text mining and forecasting respectively . The data is of a very huge volume, but as it stands of low value as no direct use can be made of it; hidden information in the data which is of greater use ”. Data mining tools has to refer a standard model from the database, and supervised learning has to define the user with one or more classes. The database contains attributes which denote the class of a tuple it is one or more and these are known as predicted attributes where the rest of the attributes are called as predicting attributes. Class is defined by a combination of values which are for the predicted attributes. When we are learning classification rules the system has to find the rules that predict the class from the predicting attributes so as, firstly where the user has to
75 | P a g e
define conditions for each class, the data mining system then it mainly constructs descriptions for the classes.
Basically the system given a case / tuple with certain known attribute values should be able to predict to what class this case belongs to or not, once the classes are defined, the system should have infer rules that govern the classification therefore so as to the system should be able to find the description of each class which is defined.
With the aim of choosing a subset of good features with the target concepts, feature subset selection is the best way to remove dimensionality and redundant features improving the learning accuracy and increasing result comprehensibility respectively. Various feature subset selection methods are studied and have been proposed for machine learning applications. They are broadly divided into four broad categories:
1) the Embedded, 2) Wrapper, 3) Filter, and 4) Hybrid approaches respectively. The embedded methods use feature selections as an approach for training process and are usually very specific to a given learning algorithms, and therefore they maybe more efficient than the other three categories respectively. Traditional algorithms of the machine learning like the decision trees or the artificial neural networks are the various examples of embedded approaches. The wrapper methods mainly use the predictive accuracy of a predetermined learning algorithm to determine the wellness of the selected subsets and the learning accuracy of the learning algorithms is usually high. However, the computational complexity is high and the generality of the selected features is limited or restricted. The filter methods is independent of learning algorithms with good generality.
Filter methods computation of complexity is low, but the learning algorithms accuracy is not guaranteed. The hybrid methods area generally the combination of filter and wrapper methods. By using the filter method we can reduce search space that will be considered by the subsequent wrapper method. This focuses on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. The computation of wrapper methods are very expensive and tend to be over fit on small training sets. The filter methods, along with their generality is usually a good choice when more number of features are present. Thus, we intend to mainly focus on the filter method in this paper respectively. The remaining part of the paper is organized as follows In Section II, the new feature subset selection FAST algorithm is proposed. In Section III, experimental results with time analysis support the proposed FAST algorithm. Finally in Section IV, we summarize the current scenario or present study and draw some of the various conclusions
.
II FEATURE SELECTION
2.1 Framework and Definitions of feature selection
Feature selection commonly known as attribute selection / variable selection or variable subset selection, is a process of selecting a subset with relevant features for model construction. The central assuming/assumption is that when using a feature selection technique, the data contains many irrelevant or redundant features.
Redundant features provide no information than the features which are currently selected and irrelevant features provide no information which is useful with respect to a context. Feature selection techniques are a subset of feature extraction. Various New features are created in feature extraction from the functions of original features, whereas feature selection technique returns a subset of the features. Feature selection techniques are mainly used in domains where there are so many features and comparatively few samples (or the data points). Feature selection technique is used in analyzing DNA microarrays, where there are thousands/so many features, and a few tens to hundreds of samples. The three main important benefits provided by feature selection techniques
76 | P a g e
when constructing predictive models are namely, 1) Improved model interpretability, 2) Enhanced generalization by reducing over fitting. 3)Shorter training times Feature selection shows which features are important for prediction, and how each of these features are related and thus is a useful part of data analysis process. The accuracy of the learning machines are mainly affected by Irrelevant features and redundant attributes. Thus, the feature subset selection technique should be able to identify and remove irrelevant and redundant information as much as possible. Features in good featured subsets are highly correlated with (predictive of) that of the class, yet they are uncorrelated with (not predictive of) one another which are used in many applications. Their results sometimes have the best agreement with performance of human .Graph- theoretic methods in clustering graph which is much longer/shorter (according to some criterion) than its nearest neighbors. The result is a forest with each tree in the forest represents a cluster. In our approach we apply graph-theoretic clustering methods to features. Then the method we adopt is the minimum spanning tree (MST)-based clustering algorithms,
Fig. 1 Framework of the Proposed Feature Subset Selection Algorithm
III IRRELEVANT FEATURES REMOVAL
Irrelevant feature removal and redundant feature elimination are two connected components in feature selection framework respectively. Features related to the target concept are got by the former. This can be achieved by removing irrelevant ones. The latter removes unwanted features from the relevant once by choosing the representatives from different feature clusters in order to give the final subset. The unrelated feature removal is a direct process , once the right relevance measure is selected or defined. However the redundant feature elimination is a bit of sophisticated process. In our proposed algorithm, it mainly involves First , construction of minimum spanning (MST) tree from a weighted complete graph is done Second , partitioning of the MST into a forest where each tree represents a cluster is performed and Finally , the selection of representative features from the clusters is achieved
3.1 Load Data and Classify
The process is loaded with the data firstly and then the data has to be preprocessed for the removal of missing values, outliers and noise. After the removal of missing values, outliers and noise then the given dataset must then be converted into the arff format which is mainly the standard format for the WEKA tool kit. Then the Attributes and values from the arff format which are extracted and stored in the database. The last column of the
77 | P a g e
dataset is considered to be the class attribute and distinct class labels and the entire dataset is classified with respect to class labels
3.2 Information Gain Computation
Relevant features are important for a best subset, as they have strong correlation with the target concept. On the other hand while redundant features have their values completely correlated with one an other and hence they don’t give out the best subset.By this module Information gain is calculated to find the relevance of each attribute with respect to the class label. Also it is said to be Mutual Information measure, it mainly measures the amount by which the distribution of the feature values and target classes differ from one an other with respect to the statistical independence. This estimates non-linearly the correlation between feature values / or between feature values and target classes. The symmetric uncertainty (SU) evaluates the importance of the set of attributes. It is mainly used to evaluate the wellness of features for classification The symmetric uncertainty can be given by:
3.3 T-Relevance Calculation
It is denoted by SU(Fi,C). The relevance between the target concept C and the Feature Fi is referred to as the T- Relevance of feature Fi and concept C. If SU(Fi,C) is greater (>) than a predetermined threshold, then it is said to be that Fi is a strong T-Relevance feature. , =2 × + once after finding of the relevance value, the redundant attributes will be discarded with respect to the of the threshold value.
3.4 F-Correlation Calculation
The correlation between the/ or any pair of features Fi and Fj (Fi,Fj € ^ F ^ i ≠ j) is called the F-Correlation of Fi and Fj, and can be represented by SU(Fi, Fj). The equation symmetric uncertainty which is used to find the relevance between the attribute and class is again applied to find the similarity between two attributes with respect to each label.
3.5 MST Construction
With the F-Correlation value once which is calculated above in F-correlation calculation. Minimum spanning tree can be formed . For that we can use an algorithm by name Kruskals algorithm which for the effective outcome. In greedy algorithm kruskals algorithm is one of the type of weighted graph in which it finds a minimum spanning tree (MST) for a connected weighted graph. Moreover it means that it finds a subset of the edges that forms a tree which includes every vertex , the total edges of all the weight in the tree is minimized. If there is no connection in the graph ,then it finds the MST (a minimum spanning tree for each of the connected component). Description:- 1) Forest F is created(a set of trees) such that , in which graph vertex are separated by tree. 2) Create a set S which has all the edges in the graph ,3) While S is nonempty and also forest F is not yet spanning. 4) Then remove an edge with minimum weight from set S ,5) If that edge in which it mainly contains two different trees, then add it to the Forest F, combining the two different tress.
trees into a single tree. 6) Otherwise discard that edge. At the end of the algorithm (during termination),
78 | P a g e
minimum spanning forest (MST) is obtained in the given graph. If the graph is connected, then the forest has a single component and forms a minimum spanning tree ( MST ). An example of the given tree can be is represented in the diagram as follows,
Fig 2: Example of Clustering Step
ALGORITHM
inputs: D (F1, F2, F3 ..., Fm , C ) – are the given data set – where the T-Relevance threshold.output: S -selected feature subset . //==== Part 1 : Removal of Irrelevant Feature ====
1) for i = 1 to m do
2) T-Relevance =SU(Fi, , C) 3) if T-Relevance > then 4 )S=SU {Fi};
//==== Part 2: Construction of Minimum Spanning Tree ==== //
5) G = NULL; //G is a complete graph 6) for each pair of features {Fi,Fj}<S do 7 )F-Correlation = SU (F,Fj)
8) Add Fiand/or Fj to Gwith F-correlation as The weight of the corresponding edge, 9) minSpanTree = KRUSKALS(G);
//Using the algorithm KRUSKALS generate the minimum spanning tree //==== Part 3: Tree Partition and Representative Feature Selection ===
10) Forest = minSpantree
11)for eachedge belongs to forest do
12) if SU( Fi, Fj)<SU(Fi,C)^SU(Fi,Fj)<SU(Fj,c) Then 13)Forest=Forest-Ei
14) S = pie
15) for each belongs to forest do 16)Fjr=argmaxFkbelongstoTiSU(F,k,C) 17 )S=S U {FjR};
18 returnS
In the tree which is given above, the vertices mainly represent the relevance value and the edges represent the F- Correlation value respectively. The complete graph G reflects all the target-relevant features of the tree along with the correlation. Unfortunately, when the given graph G has k number of vertices and k(k-1)/2 edges, For
79 | P a g e
any high-dimensional data when used , as it is heavily dense and the edges with different weights are strongly interwoven or connected. however, its decomposition of complete graph is mainly NP-hard respectively. Thus for the graph G, we build an MST (minimum spanning tree ) which connects all the vertices such that the sum of the total weights of the edges is minimum, By using well-known popular algorithm by name Kruskal algorithm. The weight of edge is (Fi`,Fj`) which is nothing but the F-Correlation of the SU (Fi`,Fj`) respectively..
3.6 After the formation of the cluster or after the completion of the minimum spanning tree (MST), next step is we remove the edges which has smaller weights, then T-Relevance SU(Fi`, C) and SU(Fj`, C) of both , from the minimum spanning tree (MST) . After the edges e which are useless they are removed, then a Forest F is obtained. Each tree Tj € Forest indicates a cluster that is represented or denoted as V (Tj), which is the vertex set of Tj as well. As indicated above, each features in a cluster which are redundant, so that for each given cluster V (Tj) we choose a representative feature Fj R who’s “ T-Relevance SU(Fj` R,C) would be greatest”.
V. CONCLUSION
In this paper, FAST is a novel clustering-based feature subset selection algorithm for high dimensional data.
Clustering is a grouping process where the data objects are grouped into classes or clusters , so that objects within a cluster have high similarity in comparison to one another but are very dissimilar to objects in other clusters. The algorithm involves (i) removing irrelevant features as irrelevant features have no or weak correlation with target concept, (ii) construction of a minimum spanning tree ( MST) from relative ones, and (iii) partitioning the MST and selecting representative features, i.e. redundant data elimination . In the proposed algorithm, every cluster consists of set of features and the Redundant features are assembled in a cluster and the most representative feature can be removed out of the cluster. Where Each cluster is treated as a single feature and dimensionality can be reduced majorily.
VI ACKNOWLEDGMENT
In performing my research, i had to take the help and guidelines of some respected persons, who deserve all my greatest gratitude from me. The project completion has given me much pleasure. i would like to show my humble gratitude to Mr. Dr K Raghuveer Head of the Department ( HOD ) , Information Science and Engineering ( IS&E), this institution has given me great atmosphere in knowing about the subject and good guideline for this project throughout with numerous consultations with various faculties. i would also like to thank all those who have directly and indirectly guided me in writing this project.
In addition, a big thank you to C N Chinnaswamy, associate Professor, Information Science & engineering and T.H.Sreenivas , professor, computer science & engineering of this institution The National Institute Of Engineering (NIE), who brought me to the Methodology of this work, and whose tremendous passion for the
“Datamining” had an lasting impact on me.
Many people, especially my classmates, team members itself have made valuable comment suggestions on this proposal which gave me an inspiration to improve my project. I thank all the people for their help directly and indirectly to complete my work.
80 | P a g e REFERENCES
[1] Almuallim H. and Dietterich T.G., Algorithms for Identifying Relevant Features, In Proceedings of the 9th Canadian Conference on AI, pp 38-45,1992.
[2] Bell D.A. and Wang, H., A formalism for relevance and its application in feature subset selection, Machine Learning, 41(2), pp 175-195, 2000.
[3] Biesiada J. and Duch W., Features election for high-dimensional data ła Pearson redundancy based filter, Advances in Soft Computing, 45, pp 242C249,2008.
[4] K.Suman and S.Thirumagal, “Feature Subset Selection with Fast Algorithm Implementation”, IJCTT – volume 6 number 1 – Dec 2013.
[5] R.Pavithra1, J.Vinitha Grace2, A.Arun Sethupathy Raja3, V.Stalin4, M.Ramakrishnan5 “Analysing the Efficiency of Data by Using Fast Clustering and Subset Selection Algorithm” Vol. 2, Issue 3, March 2014.