Filter Methods - Feature Selection and Reduction

2.4 Feature Selection and Reduction

2.4.1 Filter Methods

In filter method, features are ranked in order to determine their importance. A highly ranked feature is considered as the more important feature and lowly ranked, the less important. A number of performance criteria have been proposed for filter method to estimate the goodness of a feature such as Fisher score (Gu et al., 2012), Pearson correlation coefficient (Guyon and Elisseeff, 2003), mutual information (Peng et al., 2005; Yu and Liu, 2003) and ReliefF (Robnik-Šikonja and Kononenko, 2003; Moore and White, 2007). These features have been succinctly described in the following sections.

Fisher score

The Fisher score aims to find a subset of features, such that the distances between data points in different classes are large, while in the same class are small. Precisely, the ‘Fishers score Gu et al. (2012) for the ith feature Fi is calculated in Equation 2.4. Where

x_{i j} and σi j are the mean and variance of the ith feature in the jth class respectively, nj

is the number of instances in the jth class, and σi is the mean of the ith feature. The

disadvantage of Fishers score is that it is not good at handling irrelevant and redundant features. F_i= n ∑ j=1 n_j( ¯x_{i j}− ¯xi)2 n ∑ j=1 njσ_{i j}2 (2.4)

The top ranked features are then selected after computing the Fisher score but because the scores are independently computed, the features selected are suboptimal. Most importantly, because features are selected not based on their importance with other features, features with low Fishers score but combine effectively with others may be elimi- nated.

Pearson correlation coefficient

Pearson correlation coefficient is the simplest method for understanding the relationship that exists between the dependent and independent variable. The Pearson correlation

coefficient (PCC) used in Guyon and Elisseeff (2003) ranks features by calculating lin- ear correlations between individual features and class labels in classification.The Pearson correlation coefficient is defined as in Equation 2.5

R_i= _p cov(Xi,Y )

var(X i)var(Y ) (2.5)

where cov designates the covariance and var the variance. The estimate of Riis give

by Equation 2.6.

R_i=

∑

k=1

(x_k,i− ¯xi)(yk− ¯y)

r _m ∑ k=1 (xk,i− ¯xi)2 m ∑ k=1 (yk− ¯y)2 (2.6)

Where xi stands for the feature value of the ith sample and x is the mean of these

feature values. yiare the labels and y is the mean of all yiin the sample.

The major advantage of this method is that it is faster and easier to calculate and should be use for ranking features in subset selection Guyon and Elisseeff (2003). It should also be used when there is a high correlation between a feature and the class of the data under consideration Shardlow (2016).

Mutual Information

Mutual Information Peng et al. (2005); Yu and Liu (2003) is one of the popular feature selection methods, as it is computational efficiency and simple to interpret. It is used to calculate the information gain between the ith feature fi and the class labels C given by

Equation 2.7. To determine if a feature is important, there should exist a shared information between the feature and the class.

IG( fi,C) = H( fi) − H( fi|C), (2.7)

where H( fi) as in Equation 2.8 is the entropy of fiand H( fi|C) as in Equation 2.8 is

2.4. FEATURE SELECTION ANDREDUCTION 33 H( fi) = −

_∑

j p(xj)log2(p(xj)), (2.8) H( fi|C) = −

∑

k p(ck)

∑

j p(xj|ck)log2(p(xj|ck)) (2.9)

The advantage of this method is that it is independent on the classification scheme used but can use any classification scheme to provide error rates. It is also able to treat multi-class cases directly rather breaking them into several two-class problems (Guyon and Elisseeff, 2003). However, the disadvantage is that filters based on mutual information generic feature selection, which is not fine-tuned by the learning algorithm.

ReliefF

Relief select features to separate instance from different classes Kira and Rendell (1992); Robnik-Šikonja and Kononenko (2003); Moore and White (2007). The Relief score of the ithfeature Siis defined by Equation 2.10.

Si= 1 2 l

∑

k=1 d(X_ik− XiMk) − d(Xik− XiHk), (2.10)

where Mk denotes the values on the ith feature of the nearest instances to xk with the

same class label, while Hkdenotes the values on the ith feature of the nearest instances to

x_k with different class labels. d is a distance measure. ReliefF was originally designed for two-class problem (Kira and Rendell, 1992) and could not handle noise efficiently. How- ever, a multi-class equivalent has been introduced by Kononenko (1994), which provides an extension to equation 2.10 and also improves it’s noise handling capabilities. Due to its inability to eliminate redundancy, recently, Liu et al. (2015) has proposed RS-ReliefF which has been shown to remove redundant data and also improve classification rates on many datasets. Wu and Wang (2015) have also combined sequential forward selection (SFS) with ReliefF (ReliefF-SFS), which was shown to remove the redundant features more effectively than the ReliefF methods alone and also improve the classification accuracy on a music genre dataset. Finally, another method that has shown improvement over

the ReliefF feature selection method is the one proposed by Moore and White (2007), which he called Tuned ReliefF (TuRF). The method symmetrically removes worst performing attributes by re-estimating the ReliefF weights.

The merits of all the filter methods above are that they are classifier-independent and effective with regards to computational cost (Lee et al., 2012). They also scale to large datasets with many features sets, thus performing faster than other methods especially wrapper methods. The major disadvantage of this method is that it may select redundant variables because it does not consider the relationships between variables.

One major work to overcome the disadvantage of the filter method is the correlation- based feature selection method Hall (1999). Hall have determined its performance and accuracy to be similar to wrappers and even under some conditions better. This approach requires no learning algorithms and no threshold settings, as in the case of wrapper, it depends on the learning algorithm and can measure the correlation between variables. The correlation-based approach is based on the assumption that good features are highly correlated with the class but yet uncorrelated with each other (Hall, 1999; Chandrashekar and Sahin, 2014). It eliminates well over half the features, is computationally faster than wrappers and uses pairs of features and subsets (Hall, 1999). The steps involved in the correlation-based approach are:

• Nominal and numeric data are first discretized

• Calculate the feature-class correlation and feature-feature inter-correlations

• Search feature subsets using either best first, forward selection or backward selection algorithms based on their merits which is calculated use Equation 2.11

M_s= krc f¯

pk + k(k − 1) ¯r_{f f} (2.11)

Where Msis the merit of feature subset containing k features, ¯rc f is the mean feature-

2.4. FEATURE SELECTION ANDREDUCTION 35

In document Automatic classification of flying bird species using computer vision techniques (Page 52-56)