This section reviews typical traditional feature selection methods some of which will be used in this thesis to compare with the newly developed algorithms.
2.6.1
Wrapper Feature Selection Approaches
Generally, wrapper feature selection algorithms are usually computation- ally more expensive than filters because each evaluation involves a train- ing process and a testing process of the classification algorithm [53]. Mean- while, since the search space of a feature selection problem with n fea- tures has 2n possible points, it is usually impossible to search the whole
search space exhaustively. Therefore, most of the existing wrappers em- ploy greedy or stochastic search strategies [4].
Sequential forward selection (SFS) [109] and sequential backward se- lection (SBS) [110] are two commonly used wrapper feature selection al- gorithms. Both of them use a greedy hill-climbing search strategy to search for the optimal feature subset. SFS starts with an empty set of features and iteratively adds one feature at one time until no improvement in classifi- cation accuracy can be achieved. By contrast, SBS sequentially removes features from a full candidate feature subset until the further removal of any feature does not increase the classification accuracy. However, both
2.6. TRADITIONAL METHODS FOR FEATURE SELECTION 51 SFS and SBS suffer from the so-called nesting effect, which means that once a feature is selected (discarded) it cannot be discarded (selected) later. Therefore, both SFS and SBS are easily trapped in local optima [5]. In addi- tion, both SFS and SBS require long computational time when the number of features is large [5].
In order to avoid nesting effect, Stearns [111] proposed a “plus-l-take away-r” method in which SFS was applied l times forward and then SBS was applied forrback tracking steps. However, determining the best val- ues of (l, r) is a challenging task. In order to solve this problem, Pudil et al. [112] proposed two floating selection methods, sequential backward floating selection (SBFS) and sequential forward floating selection (SFFS) to automatically determine the values of (l, r). In addition, the values of (l,r) in SBFS and SFFS that denotes the number of forward and backtrack- ing steps are dynamically controlled instead of being fixed in the “plus-l- take away-r” method. Although the floating methods are claimed to be at least as good as the best sequential method, they are still likely to become trapped in a local optimal solution even the criterion function is mono- tonic and the scale of the problem is small [113].
Based on the best-first algorithm and SFFS, Gutlein et al. [114] pro- posed a linear forward selection (LFS) in which the number of features considered in each step was restricted. Because of the small number of fea- tures used for evaluations in each step, LFS improves the computational efficiency of sequential forward methods while maintaining comparable accuracy of the selected feature subset. However, LSF starts with ranking all the individual features without considering the presence or absence of some other features, which in turn limits the performance of the LSF algo- rithm in problems where there are interactions between features.
Recently, evolutionary computation techniques have been applied to wrapper feature selection models, such as PSO [115], GAs [8], GP [116], and ACO [117]. Typical methods will be reviewed in Section 2.7.
2.6.2
Filter Feature Selection Approaches
A filter feature selection algorithm searches for the optimal feature sub- set in the search space based on a certain evaluation criterion, which is independent of any learning/classification algorithm.
Different criteria, including distance measures [118], dependency mea- sures [119], consistency measures [120], and information measures [121], have been applied to develop filter feature selection algorithms. Besides the evaluation criterion, how to search for the best feature subset is an- other important factor in feature selection methods. Among the existing feature selection algorithms, two classical filter based methods are FOCUS [122, 123] and Relief [124]. The FOCUS algorithm was originally defined for noise-free Boolean domains [123]. It starts with an empty feature sub- set and exhaustively examines all subsets of features and then selects the minimal subset of features that is sufficient to determine the class labels for all instances in the training set. However, the FOCUS algorithm performs an exhaustive search to find the best feature subset, which is computation- ally expensive.
The Relief algorithm is another popular filter feature selection method that assigns a relevance weight to each feature [124]. The weight is in- tended to denote the relevance of the feature to the target concept. Relief samples instances randomly from the training set and updates the rele- vance values based on the difference between the selected instance and the two nearest instances of the same and opposite class (the “near-hit” and “near-miss” ). However, the Relief algorithm does not deal with re- dundant features, because it attempts to find all relevant features regard- less of the redundancy between them [125], which is referred as feature interaction, a challenge in feature selection tasks.
Decision trees use only relevant features that are needed to completely classify the training set and remove all other features. Cardie [126] pro- posed a filter based feature selection algorithm that used a decision tree algorithm to select a subset of features for a nearest neighbourhood algo-
2.6. TRADITIONAL METHODS FOR FEATURE SELECTION 53 rithm. Experiments showed that the feature subset generated by a deci- sion tree helped the nearest neighbour algorithm to reduce its classifica- tion error rate.
Yu and Liu [119] claimed that feature relevance alone was insufficient for efficient feature selection of high-dimensional data. They proposed a feature selection algorithm that took both relevance and redundancy into account. The algorithm, however, is limited to problems that only have discrete features.
Mutual Information for Filter Feature Selection
Since mutual information are capable to evaluate the relationship between variables, they have been applied to feature selection to measure the rela- tionship between the selected features and the class labels.
Hall [127] proposed a correlation based filter feature selection method (Cfs), which uses mutual information to evaluate the correlation between the features and the class labels to evaluate the goodness of the selected features. Kwak and Choi [128] developed a greedy search based feature selection method, where mutual information was used to evaluate the goodness of the selected features. The algorithm stopped when a desired number of features was reached. There are also some other filter methods using mutual information, but most of them suffer from two problems [129]. The first one is that they need a predefined weighting parameter to balance the relative importance of the relevance (reflecting the classifica- tion performance) and the redundancy (reflecting the number of features) of the selected feature subset, which is usually difficult to determine. The second one is that the redundancy was shown by the mutual information between two features and the class labels was not considered. Because of feature interaction, two correlated features may become complementary to each other when considering the class labels [129].
To address these problems, Peng et al. [130] combined the use of mu- tual information (filter) with a wrapper method that considers the class la-
bels. A feature selection evaluation criterion named minimal-redundancy- maximal-relevance criterion (mRMR) was first developed, where mutual information was also used to measure therelevanceand the redundancyof the selected features. Based on mRMR, a two-stage algorithm by com- bining the mRMR with other more sophisticated feature selectors (e.g., wrappers) was developed and successfully selected a small number of fea- tures and maintained or increased the classification performance. To avoid the determination of the weighting parameter, Foithong et al. [129] also combined the mutual information based criterion with a wrapper method, where Multilayer perceptron (MLP) [131] with a single hidden layer was trained by using the back-propagation algorithm to evaluate the good- ness of the feature subsets. Later, Liu et al. [132] developed a feature se- lection method based on dynamic mutual information, where the mutual information of each candidate feature was re-calculated on unlabelled in- stances, rather than the whole sampling space. These existing works have shown that the concept of mutual information can be used for feature se- lection, but it has never been applied together with EC based algorithms for feature selection.
Different evolutionary computation techniques have been applied to develop filter feature selection algorithms and typical algorithms will be reviewed in Section 2.7.