2.2 Feature selection methods
2.2.1 Filter methods
Filter methods are based on performance evaluation metric calculated directly from the data, without direct feedback from predictors that will finally be used on data with reduced number of features (Guyon, 2006). As mentioned above, these algorithms are usually computationally less expensive than wrappers or embedded methods. In this subsection, the most popular filters are described, which will be used throughout this thesis.
Chapter 2. Foundations of feature selection
2.2.1.1 Chi-squared
This is an univariate filter based on the χ2 statistic (Liu & Setiono, 1995) and which
evaluates each feature independently with respect to the classes. The higher the value of chi-squared, the more relevant is the feature with respect to the class.
2.2.1.2 Information Gain
The Information Gain filter (Quinlan, 1986) is one of the most common univariate methods of evaluation attributes. This filter evaluates the features according to their information gain and considers a single feature at a time. It provides an orderly classi- fication of all the features, and then a threshold is required to select a certain number of them according to the order obtained.
2.2.1.3 Correlation-based Feature Selection, CFS
This is a simple multivariate filter algorithm that ranks feature subsets according to a correlation based heuristic evaluation function (M. A. Hall, 1999).The bias of the evaluation function is toward subsets that contain features that are highly correlated with the class and uncorrelated with each other. Irrelevant features should be ignored because they will have low correlation with the class. Redundant features should be screened out as they will be highly correlated with one or more of the remaining features. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features.
2.2.1.4 Consistency-based Filter
The filter based on consistency (Dash & Liu, 2003) evaluates the worth of a subset of features by the level of consistency in the class values when the training instances are projected onto the subset of attributes. From the space of features, the algorithm generates a random subset S in each iteration. If S contains fewer features than the current best subset, the inconsistency index of the data described by S is compared with the index of inconsistency in the best subset. If S is as consistent or more than
2.2 Feature selection methods
the best subset, S becomes the best subset. The criterion of inconsistency, which is the key to success of this algorithm, specify how large can be the reduction of dimension in the data. If the rate of consistency of the data described by selected characteristics is smaller than a set threshold, it means that the reduction in size is acceptable. Notice that this method is multivariate.
2.2.1.5 Fast Correlation-Based Filter, FCBF
The fast correlated-based filter method (Yu & Liu, 2003) is a multivariate algorithm that measures feature-class and feature-feature correlation. FCBF starts by selecting a set of features that is highly correlated with the class based on symmetrical uncertainty (SU), which is defined as the ratio between the information gain and the entropy of two features. Then, it applies three heuristics that remove the redundant features and keep the features that are more relevant to the class. FCBF was designed for high- dimensionality data and has been shown to be effective in removing both irrelevant and redundant features. However, it fails to take into consideration the interaction between features.
2.2.1.6 INTERACT
The INTERACT algorithm (Z. Zhao & Liu, 2007) uses the same goodness measure as FCBF filter, i.e. SU, but it also includes the consistency contribution, which is an indicator about how significantly the elimination of a feature will affect consistency. The algorithm consists of two major parts. In the first part, the features are ranked in descending order based on their SU values. In the second part, features are evaluated one by one starting from the end of the ranked feature list. If the consistency contribu- tion of a feature is less than an established threshold, the feature is removed, otherwise it is selected. The authors stated that this method can handle feature interaction, and efficiently selects relevant features.
2.2.1.7 ReliefF
The filter ReliefF (Kononenko, 1994) is an extension of the original Relief algorithm (Kira & Rendell, 1992). The original Relief works by randomly sampling an instance
Chapter 2. Foundations of feature selection
from the data and then locating its nearest neighbor from the same and opposite class. The values of the attributes of the nearest neighbors are compared to the sampled instance and used to update relevance scores for each attribute. The rationale is that a useful attribute should differentiate between instances from different classes and have the same value for instances from the same class. ReliefF adds the ability of dealing with multiclass problems and is also more robust and capable of dealing with incomplete and noisy data. This method may be applied in all situations, has low bias, includes interaction among features and may capture local dependencies which other methods miss.
2.2.1.8 minimum Redundancy Maximum Relevance, mRMR
The mRMR method (H. Peng, Long, & Ding, 2005) selects features that have the high- est relevance with the target class and are also minimally redundant, i.e., selects features that are maximally dissimilar to each other. Both optimization criteria (Maximum- Relevance and Minimum-Redundancy) are based on mutual information.
2.2.1.9 Md
The Mdfilter (Seth & Principe, 2010) is an extension of mRMR which uses a measure
of monotone dependence (instead of mutual information) to assess relevance and irrel- evance. One of its contributions is the inclusion of a free parameter (λ) that controls the relative emphasis given on relevance and redundancy. In this thesis, two values of lambda will be tested: 0 and 1. When λ is equal to zero, the effect of the redundancy disappears and the measure is based only on maximizing the relevance. On the other hand, when λ is equal to one, it is more important to minimize the redundancy among variables. These two values of λ were chosen in this thesis because we are interested in checking the performance of the method when the effect of the redundancy disappears. Also, Seth and Principe (2010) stated that λ = 1 performs better than other λ values.
Table 2.2 reports the main characteristics of the filters employed in this thesis. With regard to the computational cost, it can be noticed that some of the proposed filter techniques are univariate. This means that each feature is considered separately, thereby ignoring feature dependencies, which may lead to worse classification perfor- mance when compared to other types of feature selection techniques. However, they
2.2 Feature selection methods
have the advantage, in theory, of being scalable. Multivariate filter techniques were introduced, aiming to incorporate feature dependencies to some degree, but at the cost of reducing their scalability.
Table 2.2: Summary of filters
Uni/Multivariate Ranker/Subset Chi-Squared Univariate Ranker Information Gain Univariate Ranker
ReliefF Multivariate Ranker
mRMR Multivariate Ranker
Md Multivariate Ranker
CFS Multivariate Subset
FCBF Multivariate Subset
INTERACT Multivariate Subset Consistency Multivariate Subset