Categories of FS Methods - Data mining of many-attribute data : investigating the interaction b

The feature selection(Liu H and Motoda H (1998)) methods fall into three categories as follows:

1. Complete methods: this includes exhaustive and some non-exhaustive methods.

The exhaustive methods are guaranteed to find an optimal subset by generating and checking all possible candidate subsets, however it only applied if the time is not an

issue and the size of the whole relevant feature set is small. In some cases there can be more features but we are still guaranteed to find an optimal subset using search strategies such as Branch and Bound.

Figure 2.1. A hierarchy of feature selection methods.

A study by Dash M and Liu H (2003) looked at various search strategies. In their research, they looked at five different algorithms: exhaustive (Focus), complete (ABB), heuristic (Set Cover), probabilistic (LVF), and a hybrid of complete and probabilistic search methods (QBB). The results could be seen as offering guidelines for a user to select the best algorithm under particular circumstances. Despite the cost of time, the Focus and ABB methods were preferable because they ensured smallest consistent subsets. But in the usual case of limited computing time a user is best guided to choose from LVF and QBB.

Research by Pudil P and Novovicova J (1998) looked to present some guidelines on the method of feature selection to choose based on the knowledge of the problem needing to be solved. A preliminary flowchart was built indicating the methods of feature selection to choose based on the characteristics of the problem. For example, if the total number of features is greater than 30, sequential feature selection methods are recommended otherwise a branch and bound search is suggested.

Feature Selection Methods

Complete Heuristic Random

Exhaustive Non-exhaustive

Focus Branch & Bound

Evolutionary Algorithm Instance Based Relief Relief F Analytic CFS PCA MIT ANN

An optimal feature selection method cannot be improved in terms of accuracy but the time complexity leaves a lot to be desired. An improved branch and bound method, (IBAB) proposed by Chen X (2003), aims to reduce the search time that the conventional branch and bound method usually requires. Partial paths, which are sub paths of branch and bound paths, are searched for. If a partial path is found such that its criterion function value is less than the current stored best for partial paths then all full paths containing this partial path are ignored. However, by reducing the time taken to perform the branch and bound search, optimality is compromised.

2. Heuristic methods: sequential search methods. Although these algorithms may not

guarantee minimal size subsets, they will be efficient in generating consistent subsets of size close to minimal in much less time when the number of relevant features and the number of features both are large.

Jain A and Zongker D (1997) evaluate different feature selection methods, looking specifically at their advantages and disadvantages for particular problems. The experiments conducted in the study demonstrated the existence of the curse of dimensionality, also known as Hughes paradox or the peaking phenomenon. For a feature selection algorithm there appears to be an optimal number of features that can be selected. Adding more features causes the classification error to rise. This effect seems counterintuitive. The more information about a problem is used, fewer mistakes should be expected. This effect has been attributed to the fact that traditional Datasets are finite in size and, as such, only imperfect estimates of probability distributions may be found.

Kudo M and Sklansky J (2000) also compare feature selection algorithms for classifiers. The study incorporates a comparison of branch and bound methods, sequential algorithms and genetic algorithms on a variety of small, medium and large Datasets. In conclusion it is seen that the sequential algorithms can give better results than the other methods for the small and medium sized datasets.

Gadat S and Younes L (2007) introduce a new model addressing feature selection from a large dictionary of variables that can be computed from a signal or an image. Features are extracted according to an efficiency criterion, on the basis of specified

classification or recognition tasks. This is done by estimating a probability distribution P on the complete dictionary, which gives most probability to the more efficient, or informative, components. A stochastic gradient descent algorithm is implemented by using the probability as a state variable and optimizing a goodness of fit criterion for classifiers based on variables randomly chosen according to P. Then classifiers are generated from the optimal distribution of weights learned on the training set. Several pattern recognition problems including face detection, handwritten digit recognition, spam classification and micro-array analysis, are tested for this experiment. The results show that the performance is significantly improved over an initial rule in which features are simply uniformly distributed. Optimal Feature Weighting method (Scherf M and Brauer W (1997)) is moreover competitive in comparison with other feature selection algorithms and leads to an algorithm which does not depend on the nature of the classifier which is used, whereas, for instance, RFE or L0-SVM are only based on SVM.

3. Random methods: These methods generate the candidate subsets randomly but

often use a supervised guidance which allows mutation in the logic for searching alternative areas of the feature space. This random method cannot guarantee the discovery of the optimal subset.

The research by Juliusdottir T et al. (2005) investigates a simple evolutionary algorithm/classifier combination on two microarray cancer datasets, where this combination is applied twice – once for feature selection, and once for further selection and classification. Their contribution are: (further) demonstration that a simple EA/classifier combination is capable of good feature discovery and classification performance with no initial dimensionality reduction; demonstration that a simple repeated EA/k-NN approach is capable of competitive or better performance than methods using more sophisticated pre-processing and classifier methods.

Even though the complete methods guarantee the expectation of the optimal candidate subset, it costs a high price to implement such methods which require high computational complexity. For this reason, and especially with the size of datasets today in the bioinformatics field, heuristic and random methods are getting more

considered and widely implemented frequently in spite of not having the guarantee of optimality.

In document Data mining of many-attribute data : investigating the interaction between feature selection strategy and statistical features of datasets (Page 39-43)