Methods of feature selection - Recent work on missing data

3.3 Recent work on missing data

3.4.2 Methods of feature selection

Feature selection is divided into filter and wrapper methods. Filter methods are considered a preprocessing step that ranks features and selects from them according to the rank. In wrapper methods, the performance of data mining algorithms is used to determine how good a feature subset is in an iterative manner. Wrapper methods are highly computational given a large number of features because they require repeat evaluations with prediction algorithms in order to reach the feature set. Too many iterations can cause this method to tend to over fit (Bolón-Canedo et al. (2014); Chandrashekar and Sahin (2014); Tuv et al. (2009). The GRiST dataset has too many features for a wrapper approach and so the work presented in this research is based on filter methods.

The main idea behind filter methods is to rank features according to a predefined criterion, which is simple yet often successful in various practical applications (Bolón-Canedo et al.,

3.4 Feature selection

2014; Chandrashekar and Sahin, 2014). The filter method is applied before classification to enhance the quality of the subset or data provided (Chandrashekar and Sahin, 2014; Tang et al., 2014). The idea behind improving the quality of the provided subset is to choose one that is most relevant to the output class. Relevance is measured as its dependence on the class labels. Hence features with low relevance to the class are discarded (Chandrashekar and Sahin, 2014; Tang et al., 2014).

Redundancy and irrelevant features are additional problems for filter methods. An irrelevant feature is one that is not related to the output variable and a redundant one does not introduce any additional information compared to those already selected. The presence of irrelevant or redundant features degrades the created classification models and may need to be removed to improve the speed of learning algorithm as well as produce a better classification model (Bolón-Canedo et al., 2014; Chandrashekar and Sahin, 2014; Tang et al., 2014).

The most common filter criteria used are correlation and information measures which are explained next.

Correlation criteria

A criterion that is commonly used to measure the relationship between two variables is correlation. Equation 3.1 shows the Pearson correlation coefficient r between feature xiand class Y (Chandrashekar and Sahin, 2014; Krzysztof et al., 2007)

r = cov(x_i,Y )

pvar(x_i)⇤ var(Y ), (3.1)

where xi is the ith feature; Y is the class, cov(xi,Y ) is the covariance between the ith

feature and the class; and var(xi)⇤ var(Y ) is the multiplication of variance of the ithfeature by variance of the class.

Correlation-based Feature Selection (CFS) is a filter-based algorithm that uses correlation to select a subset of features. The CFS applies a forward selection method and the subset evaluation is the correlation criteria.

First, two matrices are created, feature-feature correlation and feature-class correlation, from the training dataset. CFS uses these matrices to choose features that maximise correla-tion with the class vector and minimise correlacorrela-tion between each other (Hall, 2000; Senliol et al., 2008). This would reduce the redundancy between selected features and thus increase the quality of the chosen subset as a representation of the dataset (Hall, 2000; Senliol et al., 2008). The proposed methodology will be based on correlation and mutual information methods, it will be compared to CFS approach in chapter 5.

Pearson correlation is a measure of the strength of linearity between one variable and another. It is a simple and fast calculation but is not useful if linearity does not exist (Chandrashekar and Sahin, 2014; Krzysztof et al., 2007). Mutual information is a measure that is not restricted to linear relation and is described next.

Mutual Information

Chosen features should be highly informative of various classes of the dataset. Mutual information measures the amount of information between two random variables X and Y by calculating entropy. Shannon’s entropy H(Y ) is defined as a measure of the uncertainty of a certain random variable. It is a representation of how much information is absent when the outcome if not known. This measure shows how many bits of information can be used by all combinations of a random variable Y by using Equation 3.2 where p(y)is the probability of occurrence of y (Chandrashekar and Sahin, 2014; Liu et al., 2014).

H(Y ) =

Â

p(y)log₂p(y) (3.2)

3.4 Feature selection

After calculating the uncertainty about Y , the conditional entropy H(Y |X), which is the amount of information gained from Y given the value of X, is represented by Equation 3.3, where p(x,y)is the probability of co-occurrence of x and y; and p(y|x) is the probability of occurrence of y given the value of x.

H(Y |X) =

Â

p(x,y)log₂p(y|x) (3.3)

I(Y ;X) = H(Y ) H(Y |X) (3.4)

Thus the mutual information between variable X and the class Y is defined by Equation 3.4. If X is completely independent of Y then the mutual information between them will be zero because the uncertainty of Y is not changed by knowing X, which is completely uninformative. The advantage of mutual information is that it is independent of the linear relation between the two variables. It is commonly used to evaluate the redundancy of features (Doquire and Verleysen, 2012; Peng et al., 2005; Yu and Liu, 2004).

Minimum redundancy maximum relevance

The Minimum Redundancy Maximum Relevance (mRMR) method combines two constraints in order to select a candidate feature. Both constraints are based on mutual information. Max-imum relevance D(S,c) tries to find a feature that increases the average mutual information of the selected subset, S, with the output class, c, when added as shown in Equation 3.5. |S| is the number of features chosen and I(xi;c) represent the mutual information between feature xifrom the chosen subset and class c (Peng et al., 2005).

max D(S,c), D = 1

|S|

Â

xieS

I(x_i;c) (3.5)

The other constraint is minimum relevance, R(S), which attempts to add a feature that has the least average mutual information among the chosen set of features thus reducing redundancy as shown in Equation 3.6.

min R(S), R = 1

|S|²

Â

xi,xjeSI(xi,xj) (3.6) Maximizing the difference between the maximum relevance and the minimum redundancy constraints produced the mRMR criteria shown in Equation 3.7. This equation is intended to choose the mth feature from the set of available features denoted X Sm 1.

xjeX Smaxm 1

In this research, correlation, mutual information and mRMR will be applied and tested for the selection of features. More details of their experiments and method of use will be presented in chapters 4 and 5.

In document A novel dynamic feature selection and prediction algorithm for clinical decision involving high-dimensional and varied patient data (Page 56-60)