MULTI-FEATURED DATA ANALYSIS ALGORITHM FOR FEATURE SUBSET SELECTION BASED ON CLUSTERING

(1)

MULTI-FEATURED DATA ANALYSIS ALGORITHM FOR FEATURE SUBSET SELECTION BASED ON

CLUSTERING

T Bhavana Bhat

¹

, M V Trupthi

²

, Meghana Satish

³

1,2,3

Information Science, NIE, Mysore, (India)

ABSTRACT

Feature selection, as a preprocessing step to machine learning and is very effective in reducing dimensionality, irrelevant data removal, improving learning accuracy, and enhancing result comprehensibility. However, the exponential increase of dimensionality of data poses a severe challenge to many existing feature selection methods with respect to efficiency and effectiveness. Proposed Algorithm can be used to Identify and remove irrelevant data set. This proposed algorithm implements using two different steps that is graph theoretic clustering methods and representative feature cluster is selected. . In the second step, subset of features is formed by the feature that is strongly related to target classes. The main focus of feature subset selection research is on searching for relevant features. The proposed logic has focused on minimized redundant data set and improves the feature subset accuracy. The clustering-based strategy of proposed algorithm has a high probability of producing a subset of independent and useful features since features in different clusters are relatively independent. We adopt the efficient minimum-spanning tree (MST) clustering method to ensure efficiency of proposed algorithm . Empirical study is used to evaluate the efficiency and effectiveness of the proposed algorithm.

Keywords: Feature Subset Selection, Filter Technique, Feature Clustering, Graph-Based Clustering

I. INTRODUCTION CLASSIFICATION

Data mining refers to "using a variety of tools and techniques to identify useful information and extracting them in such a way that they can be put to use in various areas like decision support, prediction, microarray, text mining and forecasting. The data is of huge volume, but as it stands of low value as no direct use can be made of it; hidden information in the data is of greater use”. Data mining tools have to refer a standard model from the database, and supervised learning requires theuser to define one or more classes.

The database contains attributes that denote the class of a tuple it is one in number or more and these are known as predicted attributes whereas the rest of the attributes are called predicting attributes. Class is defined by a combination of values for the predicted attributes. When learning classification rules the system has to find the rules that predict the class from the predicting attributes so firstly the user has to define conditions for each class, the data mining system then constructs descriptions for the classes. Basically the system given a case or tuple with certain known attribute values should be able to predict what class this case belongs to, once classes are defined the system should infer rules that govern the classification therefore the system should be able to find the description of each class. With the aim of choosing a subset of good features with respect to the target concepts, feature subset selection is the best way for dimensionality reduction, irrelevant data removal,

(2)

improving learning accuracy and increasing result comprehensibility. Various feature subset selection methods have been proposed and studied for machine learning applications. They are divided into four broad categories:

the Embedded, Wrapper, Filter, and Hybrid approaches. The embedded methods use feature selections a part of the training process and are usually specific to given learning algorithms, and therefore maybe more efficient than the other three categories. Traditional algorithms of machine learning like decision trees or artificial neural networks are examples of embedded approaches.

The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets, the learning accuracy of the learning algorithms is usually high. However, the computational complexity is large and the generality of the selected features is limited. The filter methods are independent of learning algorithms, with good generality. Their complexity of computation is low, but the learning algorithms’s accuracy is not guaranteed the hybrid methods area combination of filter and wrapper methods byusing a filter method to reduce search space that will be considered by the subsequent wrapper. This mainly focus on combining filter and wrapper methods to achieve the best possible performance with a particular learning algorithm with similar time complexity of the filter methods. The computation of wrapper methods are extreamly expensive and tend to overfit on small training sets. The filter methods, along with their generality, are usually a good choice when large number of features are present. Thus, we intend to focus on the filter method in this paper.

The rest of the paper is organized as follows In Section II, the new feature subset selection algorithm FAST is proposed. In Section III, experimental results to support the proposed FAST algorithm are reported. Finally, in Section IV, we summarize the present study and draw some conclusions.

II FEATURE SELECTION 2.1 Framework and Definitions

Feature selection, also known as attribute selection, variable selection or variable subset selection, is a process of selecting a subset of relevant features for model construction. The central assumption is that when using a feature selection technique, the data contains many redundant or irrelevant features.

Redundant features provide no more information than the features currently selected, and irrelevant features provide no useful information with respect to a context. Feature selection techniques are a subset of feature extraction. New features are created in feature extraction from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are usually used in domains where there are many features and comparatively few samples (or data points). Feature selection is used in analyzing DNA microarrays, where there are thousands of features, and a few tens to hundreds of samples. The three main benefits provided by feature selection techniques when constructing predictive models are.

 Improved model interpretability,

 Enhanced generalization by reducing over fitting.

 Shorter training times

Feature selection shows which features are important for prediction, and how these features are related and thus is a useful part of data analysis process. The accuracy of the learning machines is severely affected by Irrelevant

(3)

features and redundant features. Thus, feature subset selection should identify and remove as much of the irrelevant and redundant information as possible. Features in good feature subsets are highly correlated with (predictive of) the class, yet uncorrelated with (not predictive of) each other

Fig. 1 Framework of the Proposed Feature Subset Selection Algorithm

Graph-theoretic methods in clustering are used in many applications. Their results sometimes have had the best agreement with human performance. The general graph-theoretic clustering is as follows: computing a neighborhood graph of the instances, then removing any edge in the graph that is much longer/shorter (according to some criterion) than its neighbors. The result is a forest with each tree in the forest represents a cluster. In our approach we apply graph-theoretic clustering methods to features. We adopt the minimum spanning tree (MST)-based clustering algorithms, in particular.

Fig 2 Architecture of Proposed Method

III IRRELEVANT FEATURES REMOVAL

Irrelevant feature removal and redundant feature elimination are the two connected components in feature selection framework.

(4)

Features relevant to the target concept are obtained by the former. This is done by eliminating irrelevant ones.

The latter removes redundant features from relevant ones by choosing representatives from different feature clusters in order to produce the final subset. The irrelevant feature removal is pretty straightforward once the right relevance measure is selected or defined. However the redundant feature elimination is a bit of sophisticated. In our proposed algorithm, it involves 1) construction of minimum spanning tree from a weighted complete graph; 2) partitioning of the MST into a forest where each tree represents a cluster; and 3) the selection of representative features from the clusters.

3.1 Load Data and Classify

The process is loaded with the data. The data has to be preprocessed for the removal of missing values, outliers and noise. The given dataset must then be converted into the arff format which is the standard format for the WEKA toolkit. Attributes and values from the arff format are extracted and then stored in the database. The last column of the dataset is considered as the class attribute and distinct class labels from that are selected and the entire dataset is classified with respect to class labels.

3.2 Information Gain Computation

Relevant features are necessary for a best subset, since they have strong correlation with the target concept. On the other hand while redundant features have their values completely correlated with each other and thus don’t give out the best subset. In this module Information gain is calculated to find the relevance of each attribute with the class label. Also said to be Mutual Information measure, it measures the amount by which the distribution of the feature values and target classes differ with respect to the statistical independence. This estimates nonlinearly the correlation between feature values or between feature values and target classes. The symmetric uncertainty (SU) evaluates the worth of a set of attributes. It is used to evaluate the goodness of features for classification

The symmetric uncertainty is defined as follows:

Gain (X|Y) = H(X) − H(X|Y) …..(1)

= H(Y)−H(Y│X)…..(2)

In order to calculate gain, we need to find the entropy and conditional entropy values. Th equations are as follows:

……(3)

…….(4)

3.3 T-Relevance Calculation

It is denoted by SU(Fi,C). The relevance between the feature and the target concept C is referred to as the T- Relevance of feature Fi and concept C. If SU(Fi,C) is greater than a predetermined threshold, we say that Fi is a

(5)

strong T-Relevance feature. , =2 × + After finding the relevance value, the redundant attributes will be removed with respect to the threshold value.

3.4 F-Correlation Calculation

The correlation between any pair of features Fi and Fj (Fi,Fj € ^ F ^ i ≠ j) is called the F-Correlation of Fi and Fj, and denoted by SU(Fi, Fj). The equation symmetric uncertainty which is used for finding the relevance between the attribute and the class is again applied to find the similarity between two attributes with respect to each label.

3.5 MST Construction

With the F-Correlation value computed above in F-correlation calculation the Minimum Spanning tree can be constructed. For which we use Kruskals algorithm which form minimum spanning tree ( MST ) effectively. Kruskal's algorithm is a type of greedy algorithm in which graph theory that finds a minimum spanning tree for a connected weighted graph. This means that it finds a subset of the edges that forms a tree that includes every vertex in which, where the weight of all the total edges in the tree is minimized. If there is no connection in the graph, then it easily finds a minimum spanning forest (a minimum spanning tree for each of connected component).

Description:

 Create a forest F (a set of trees), in which each vertex in the graph is separated by tree.

 Create a set S which contains all the edges in the graph

 While S is nonempty and also F is not yet spanning.

 Then Remove an edge with minimum weight from set S

 If that edge in which it contains two different trees, then add it to the forest F, combining the two

trees into a single tree.

 Otherwise remove that edge.

At the end ( termination ) of the algorithm, the forest forms a minimum spanning forest of the graph.

If the graph is connected, then the forest has a single component and forms a minimum spanning tree ( MST ). The sample tree is as indicated in the diagram,

Fig 3 Example of the clustering step ALGORITHM

_________________

inputs: D(F1, F2, ...,Fm , C ) - the given data set - the T-Relevance threshold.

(6)

output: S - selected feature subset .

//==== Part 1 : Irrelevant Feature Removal ====

1 for i = 1 to m do

2 T-Relevance = SU (Fi , C) 3 if T-Relevance > then 4 S=SU {Fi};

//==== Part 2: Minimum Spanning Tree Construction ====

5 G = NULL; //G is a complete graph 6 for each pair of features {Fi,Fj}<S do 7 F-Correlation = SU (F,Fj)

8 Add F iand/or Fj to Gwith F-correlation as The weight of the corresponding edge, 9 minSpanTree = KRUSKALS(G); //Using KRUSKALS Algorithm to generate the minimum spanning tree

//==== Part 3: Tree Partition and Representative Feature Selection ===

10 Forest = minSpantree

11for eachedge belongs to forest do

12 if SU( Fi, Fj)<SU(Fi,C)^SU(Fi,Fj)<SU(Fj,c) Then

13Forest=Forest- Ei 14 S = pie

15for each belongs to forest do

16Fjr=argmaxFkbelongstoTiSU(F,k,C) 17 S=S U {FjR};

18 returnS

In this tree as shown above, the vertices represent the relevance value and the edges represent the

F-Correlation value. The complete graph G reflects the correlations shows all the target-relevant

features. Unfortunately, when graph G has k vertices and k(k-1)/2 edges. For high-dimensional data

when used , it is heavily dense and the edges with different weights are strongly interwoven (

connected). Moreover, its decomposition of complete graph is NP-hard respectively. Thus for graph G,

we build an minimum spanning tree (MST), which connects all the vertices such that the total sum of

the weights of the edges is minimum, using the well-known algorithm I,e Kruskal algorithm. The

weight of edge is (Fi`,Fj`) which is nothing but F-Correlation SU(Fi`,Fj`).

(7)

3.6 Cluster Formation

After building or the completion of the minimum spanning tree (MST), in the third step, we firstly remove the edges whose weights are smaller than both of the T-Relevance SU(Fi`, C) and SU(Fj`, C), from the minimum spanning tree (MST). After removing all the unwanted edges, a forest Forest F is obtained. Each tree Tj € Forest indicates a cluster that is denoted as V (Tj), which is the vertex set of Tj as well. As we have illustrated above, the features in each cluster are redundant, so for each cluster V (Tj) we choose a representative feature Fj R who‟s T-Relevance SU(Fj R,C) is the greatest.

Fig 4 Data Flow Diagram IV EXPERIMENTAL RESULTS

The experimental results of selected features, considering proportion of features selected, classification accuracy, time to obtain the feature subset and win/draw/loss record. Here, a nonparametric Friedman test is performed for the purpose of exploring statistical significance of results followed by Nemenyi post-hoc test, on multiple data sets to statistically compare algorithms.

4.1 Proportion of Selected Features

For each data set, the proportion of selected features of six feature selection algorithms are recorded. It is observed that by selecting only a small portion of the original features, all the six algorithms achieve significant reduction of dimensionality. FAST obtains the best proportion of selected features of 1.82% on an average.

FAST wins other algorithms which are shown by win/draw/loss records. The proportion of selected features has been improved by each of the six algorithms compered with that on given data sets for microarray data. This indicates that the six algorithms work well with microarray data. Only CFS among six algorithms, cannot choose features for two data sets whose dimensionalities are 19994 and 49152 respectively, FAST ranks 1 again with the proportion of selected features of 0.71%. FAST ranks 1 again with a margin of 0.48% to the second best algorithm FOCUS-SF for the text data. TABLE 2: Proportion of selected features of the six feature selection algorithms.

(8)

The classification accuracy of Naïve Bayes. It is observed that :1) The classification accuracy of Naïve Bayes has been improved by FAST,CFS AND FCBF by 12.86%, 6.62% and 4.32% respectively is compared with original data.. 2) Unfortunately, Relief, Consist, and FOCUS-SF have decreased the classification accuracy by 0.32%, 1.35%, and 0.86%, respectively. FAST ranks 1 with a margin of 6.24% to the second best accuracy 80.60% of CFS. The Win/Draw/Loss records show that FAST outer forms all other five algorithms at the same time. 3) The classification accuracy of Naive Bayes has been improved by all the six algorithms FAST, CFS, FCBF, ReliefF, Consist, and FOCUS-SF by 16.24%, 12.09%, 9.16%, 4.08%, 4.45%, and 4.45%, respectively for microarray data. FAST ranks 1 with a mar-gin of 4.16% to the second best accuracy 87.22% of CFS. When using Naive Bayes to classify microarray data, this shows the FAST is more effective than others. 4) For text data, FAST and CFS have improved the classification accuracy of Naive Bayes by 13.83% and 1.33%, respectively. For other four algorithms Re-liefF, Consist, FOCUS-SF, and FCBF the accuracy has been decreased by 7.36%, 5.87%, 4.57%, and 1.96%, respectively. FAST ranks 1 with a margin of 12.50% to the second best accuracy 70.12% of CFS.

From the classification accuracy of RIPPER it is observed that: The classification ac-curacy of RIPPER has been improved by the five feature selection algorithms FAST, CFS, FCBF, Consist, and FOCUS-SF by 7.64%, 4.08%, 4.51%, 5.48%, and 5.32%, respectively compared with the original data; and FAST ranks 1 with a margin of 2.16% to the second best accuracy 78.46% of Consist has been decreased by relief by 2.04%. FAST outperforms all other algorithms which are shown by the win/draw/loss records.

4.2 Time Complexity Analysis

The computation of SU values for T-Relevance and F-Correlation, which has linear complexity in terms of the number of instances in a given data set are involved by the amount of work for algorithms. A linear time complexity O(m) in terms of the number of features m is the first part of algorithm. Only one feature is selected by assuming k(1≤ k ≤m) features are selected as relevant ones in the first part, when k = 1. Thus ,the complexity is O(m) and there is no need to continue the rest parts of the algorithm. The second part of the algorithm first constructs a complete graph from relevant features and the complexity is O (k^2), and then generates an MST from the graph using Prim algorithm whose time complexity is O (k^2) when 1 < k ≤ m. The third part partitions of the MST is to choose the representative features with the complexity of O(k).

Thus the complexity of the algorithm is O(m+k^2) when 1 < k ≤ m. FAST has linear complexity O(m), when k≤

square root of m while obtains the worst complexity O(m^2) when k = m. However, in the implementation of FAST, k is heuristically set to be | square root of m * log m|. Hence, the complexity is O(m*lg^2 m, which is typically less than O(m^2) since lg^2 m<m. This can be explained as follows. Let f (m) = m – lg^2 m, so the derivative fʹ(m)= 1- 2 lg e/m, which is greater than zero when m > 1. So f(m) is a increasing function and it is greater than fð1Þ which is equal to 1, i.e. ,m > lg^2 m, when m > 1. This means the bigger the m is, the farther the time complexity of FAST deviates from O(m^2). Thus, the time complexity of FAST is far more less than O(m^2) on high-dimensional data. With the high-dimensional data, FAST has a better runtime performance.

(9)

V. CONCLUSION

In this paper, for high-dimensional data we have presented a novel clustering-based feature subset selection algorithm. The algorithm involves (I) irrelevant features are removed, (ii) a minimum spanning tree from relative ones are constructed, and (iii) selecting representative features by partitioning the MST. In the proposed algorithm, a cluster consists of features. Thus dimensionality is drastically reduced by treating each cluster as a single feature. The performance of the proposed algorithm is compared with those of the five well-known feature selection algorithms Relief, FCBF, CFS, FOCUS-SF and consist on the 35 publicly available images, microarray, and text data from the four different aspects of the proportion of selected features, runtime, the Win/draw/Loss record and accuracy of given classifier. Here, for Naïve Bayes, C4.5, and RIPPER the proposed algorithm obtained the best proportion of selected features, the best runtime and the second best classification accuracy for IB1.The Win/draw/Loss records confirmed the conclusions. For microarray data it is found that FAST obtains the rank of 1, the rank of 2 for text data, and the rank of 3 for image data for classification accuracy and a good alternative is CFS.

FCBF is a good alternative for image and text data at the same time. The alternatives for text data can be Consist and FOCUS-SF. To explore different types of correlation measures, and study some formal properties of feature space can be the future plan.

VI ACKNOWLEDGMENT

The success and final outcome of this project, Multi-Featured Data Analysis Algorithm for Feature Subset Selection Based on Clustering, required a lot of guidance and assistance from many people and we are extremely fortunate to have got this all along the completion of our project work. It was quiet exciting and worthwhile for us to work on this project , as during this work, we have gained a lot of practical and theoretical knowledge of great significance. The satisfaction would be incomplete without acknowledging the people without whom this project wouldn’t have had been a success. First and foremost, we would like to thank Dr. G.

L. Shekar, Principal, NIE, Mysore, for his moral support towards the completion of our project work. We deeply express our sincere gratitude to our guide Dr C.N Chinnaswamy, Assnt Professor, NIE, Mysore for his valuable guidance, constant encouragement, support and suggestions for improvement. .Also, we would like to extend our sincere regards to all the non-teaching staff of the IS&E department for their timely support. Finally, an honorable applause to our families and friends for extending their support in completing this project. Without the help of the above mentioned, it wouldn’t be possible to successfully get through this project! Hence I extend my deepest thanks to all of them. This being one of our first design experiences has proved to be extremely advantageous and will surely be of great help for the overall development of our future

REFERENCES

[1] Almuallim H. and Dietterich T.G., Algorithms for Identifying Relevant Features, In Proceedings of the 9th Canadian Conference on AI, pp 38-45,1992.

[2] Bell D.A. and Wang, H., A formalism for relevance and its application in feature subset selection, Machine Learning, 41(2), pp 175-195, 2000.

[3] Biesiada J. and Duch W., Features election for high-dimensional data ła Pearson redundancy based filter, Advances in Soft Computing, 45, pp 242C249,2008.

(10)

[4] Qinbao Song, Jingjie Ni, and Guangtao Wang, “A Fast Clustering-Based Feature Subset Selection Algorithm for High-Dimensional Data” IEEE Transactions on knowledge and data engineering, Vol. 25, No. 1,January 2013.

[5] L.D. Baker and A.K. McCallum, “Distributional Clustering of Words for Text Classification,” Proc. 21st Ann. Int’l ACM SIGIR Conf. Research and Development in information Retrieval, pp. 96-103, 1998.