• No results found

2.5 Conventional, “Flat” Feature Selection

2.5.2 The Filter Approach

Unlike the wrapper approach, the filter approach conducts the feature selection process by evaluating the quality of a feature or feature subset using a quality measure that is independent from the classification algorithm that will be applied to the selected features. As shown in the flow-chart in Figure 2.10, the subset of features is chosen from the original full set of features according to a certain selection criterion (or feature relevance measure). The selected feature subset is then input into the classification algorithm, the classifier is built and then the predictive accuracy is measured on the testing set and reported to the user. Note that the classifier is built and evaluated only once at the end of the process, rather than being iteratively built and evaluated in a loop, like in the wrapper approach (Figure 2.9). This means the filter approach is much faster than the wrapper approach in general. In this thesis, we propose three filter feature selection methods, which will be described in detail in Chapter 4.

Filter feature selection methods can be mainly categorised into two groups. The first group focuses on measuring the quality (relevance) of each individual feature without taking into account the interaction with other features. Basically,

Chapter 2. Background on Data Mining 27

Full Set

of

Features

Feature

Relevance

Measure

Select

Subset of

Features

Building

Classifier

& Testing

Report

Accuracy

Figure 2.10 Flow-Chart of the Filter Feature Selection Approach - Adapted from [84]

the relevance of each feature will be evaluated by a certain criterion, such as the mutual information with the class variable, the information gain [131], etc. Then all features will be ranked in descending order according to the corresponding relevance measure. Only the top-k features will be selected for the classification stage, where k is a user-defined parameter. This type of methods is simple, but it ignores the interaction between features, and therefore it can select redundant features.

The second group of filter methods aims at selecting a subset of features to be used for classification by considering the interaction between features within each evaluated candidate subset of features. For example, one of the most well-known multivariate filter feature selection methods is called Correlation-based Feature Selection (CFS) [48,49,137], which is based on the following hypothesis:

“A good feature subset is one that contains features highly correlated with (predictive of ) the class, yet uncorrelated with (not predictive of ) each other” – Hall, 1999.

The approach used by the CFS method for evaluating the relevance (Merit) of a candidate subset of features based on the above hypothesis is based on Equa- tion 2.9, which is based on Pearson’s linear correlation coefficient (r) used for standardised numerical feature values. In Equation 2.9, k denotes the number

M erits=

krcf p

Chapter 2. Background on Data Mining 28

of features in the current feature subset; rcf denotes the average correlation be- tween class and features in that feature subset;rf f denotes the average correlation between all pairs of features in that subset. The numerator measures the pre- dictive power of all features within that subset, which is to be maximised; while the denominator measures the degree of redundancy among those features in the subset, which is to be minimised.

Another part of CFS is the search strategy used to perform a search in the feature space. A lot of heuristic search methods have been applied, e.g. Hill- climbing search, Best First search and Beam search [103], and recently genetic algorithms [64,65]. However, the CFS method based on genetic algorithms ad- dresses the task of multi-label classification, where an instance can be assigned two or more class labels simultaneously, a more complex type of classification task which is out of the scope of this thesis.

The search strategy implemented in the Weka version of CFS, used in our ex- periments reported in other chapters isBackward-Greedy-Stepwise, which conducts a backward greedy search in the feature subset space. The termination criterion is when the deletion of any remaining feature leads to a decrease on validation results.

Another example of multivariate filter method is Markov Blanket-based feature selection [10,42,105,132,138,139]. Given a Directed Acyclic Graph (DAG) where each node represents a variable, the Markov Blanket Mf for an individual feature f is defined as the set of all parent and child features off, and the other features that are parents of f’s child features. The features within Mf are the most rele- vant features with respect to f, since f is statistically independent from all other features outside the Markov Blanket givenMf. As an example is shown in Figure 2.11, where only the nodes in black denote the features within the Markov Blanket of the Class attribute.

A well-known Markov Blanket discovery algorithm is Incremental Association Markov Blanket (IAMB) [114]. IAMB consists of two stages, namely the Grow stage and the Shrink stage. In the Grow stage, features which are outside the Markov Blanket will be considered to be added into the set of Candidate Markov Blanket (CMB), where some features will be removed at the Shrink stage. The construction of CMB starts from an empty set, then each feature will be heuristi- cally evaluated whether its inclusion into the existingCMB maximises a heuristic

Chapter 2. Background on Data Mining 29

functionf(X;T|CM B), e.g. the mutual information, which measures the degree of relevance between featureXand the target attributeT given the set of features in the CMB. Before formally adding each candidate feature X intoCMB, IAMB will check whether feature X and target feature T are not independent given CMB, mathematically shown as ¬I(X;T|CM B). At the second stage (Shrink stage),

IAMB removes in turn the features from CMB which are independent from T given CMB excluding those features, using the function I(X;T|CM BX).

X1

X2

X3

X5

Class

X4

X9

X8

X7

X6

X10

X12

X11

Figure 2.11 Example of the Markov Blanket for the Class Attribute