Consistency and selection of features

3.10 Simulation Study

3.10.1 Consistency and selection of features

Note that the use of RFE for feature selection has been in practice for well over a decade and it is a well-accepted technique in classification. The main aim of this section is to evaluate our consistency results, and a method for selection of the subset of features. We consider two different data-generating mechanisms, one in the classical classification setting and the other in regression. For each of these examples we again look at three different scenarios. For the first scenario, the total number of covariates is 15 of which only 4 are important. For the second scenario, there are 30 covariates with only 7 important ones. The third scenario has 50 covariates with 3 that are important. For the classification example we consider the hinge loss LHL as the surrogate loss and the SVM function is computed using the Gaussian RBF kernel kγ(x1, x2) =

exp{−1

γ2kx1−x2k

2}. The covariatesXwere generated uniformly on the segment [−1, 1]

and the output vectorY was generated as Y = sign(ω0X), where ω is the vector of coefficients with the first few elements non-zero, corresponding to the important features, chosen at random from a list of coefficients [−1,−0.5,0.5,1] and the rest are zero. We initialize the original SVM function using a 5-fold cross validation on the kernel width

SVM-RBF d= 15,d0= 4 d= 30,d0= 7 d= 50,d0= 3

(vs LASSO) n=100 n=200 n=400 n=100 n=200 n=400 n=100 n=200 n=400

Prop no errors (a) 0.97 (0.94) 1 (1) 1 (1) 0.62 (0.46) 1 (0.96) 1 (1) 0.95 (0.97) 1 (1) 1 (1) Prop 1 error (b) 0.03 (0.06) 0 (0) 0 (0) 0.34 (0.49) 0 (0.04) 0 (0) 0.05 (0.93) 0 (0) 0 (0) Prop>1 error (c) 0 (0) 0 (0) 0 (0) 0.04 (0.05) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

SVR-Linear d= 15,d0= 4 d= 30,d0= 7 d= 50,d0= 3

(vs LASSO) n=100 n=200 n=400 n=100 n=200 n=400 n=100 n=200 n=400

Prop no errors (a) 1 (0.93) 1 (1) 1 (1) 1 (0.47) 1 (0.91) 1 (1) 1 (0.98) 1 (1) 1 (1) Prop 1 error (b) 0 (0.07) 0 (0) 0 (0) 0 (0.48) 0 (0.09) 0 (0) 0 (0.02) 0 (0) 0 (0) Prop>1 error (c) 0 (0) 0 (0) 0 (0) 0 (0.05) 0 (0) 0 (0) 0 (0) 0 (0) 0 (0)

Table 3.1: Accuracy of RFE (vs LASSO)

γ and the regularization parameterλ and they were chosen from the set of values

nλ, γ

= (0.01×10i, j) i={0,1,2,3,4}, j ={1,2,3,4} (3.12)

wheren is the sample size for the given setting.

In the second case we used an SVR function with a linear kernelk(x1, x2) =hx1, x2i

to treat the regression setting. The loss function we considered is the-insensitive Loss

L(x, y, f(x)) = max{0,|y−f(x)| −}with = 0.1. Covariates are generated as before while Y is now generated as Y =ω0X+1₃Ndim(X)(0,1). As before we initialize with a

5-fold Cross Validation onλ.

We repeat the process for different sample sizesn={100,200,400}. We also repeat the simulations 100 times each to note down the proportion of times the RFE made no errors (a), made only one error (b) or made more than 1 error (c) (See Table 3.1), where a mistake is made if the rank of any non-important feature is found to be higher than that of any important one. We compare the performance of RFE with LASSO in both settings (logistic regression with LASSO or linear regression with LASSO de- pending on the example), and the results for LASSO are given in the parentheses. The simulated relationships between the outputY and the input X being linear, we should

Figure 3.1: Reverse Scree Graph for one run of the simulations for (a) SVM with Gaussian Kernel (b) SVR with Linear Kernel withd= 30, d0 = 7

expect LASSO to work as well in these settings. However as seen in Table 3.1, RFE dominates LASSO in smaller sample sizes, while in larger sample sizes both perform equally well. The entire methodology was implemented in the MATLAB environment. For the implementation we used the SPIDER library for MATLAB4_{, which already has}

a feature elimination algorithm based on RFE and we modified it accordingly to suit our criterion for reduction. The codes for the algorithm and the simulations are given in 3.13.

One important question we inevitably face in feature elimination is when to stop. Note that our theoretical results suggest the existence of a gap0 and our results show

that asymptotically the difference in the empirical versions of the objective functions exceed it whenever we move beyond the correct dimension. Practically it is almost impossible to characterize this gap for a given setting, but the existence of this gap can be observed from the values of the objective function at each stage of the algorithm. One idea that can be implemented is that of a ‘reverse Scree graph’ (See section on Scree graphs in chapter 6 of Jolliffe (2002)). Implementation of the Scree graph is a

4_{The Spider library for Matlab can be downloaded from}_{http://www.kyb.tuebingen.mpg.de/bs/} people/spider/

Figure 3.2: Linear-Quadratic mixture change point analysis for (a) SVM with Gaussian Kernel for comparable cross validation values of λ and kernel width γ and (b) SVR with Linear Kernel for comparable cross validation values ofλ, withd = 30,d0 = 7 for

varying sample sizes. The bold dots represent the estimated change points.

Figure 3.3: Linear-Quadratic mixture change point analysis for (a) SVM with Gaussian Kernel for comparable cross validation values of λ and kernel width γ and (b) SVR with Linear Kernel for comparable cross validation values ofλ, withd = 50,d0 = 3 for

varying sample sizes. The bold dots represent the estimated change points.

well-formulated idea in choosing the correct number of Principal Components in PCA and that same idea can be applied here as well. We plot the values of the objective function inf i∈Z\Jλ f_D,λ,HJ∪{i} 2 HJ∪{i} +RL,D fD,λ,HJ∪{i}

at each run of the algorithm in a graph. Figure 3.1 justifies such an argument.

features to be eliminated, we tried some ad-hoc model diagnostic tools. From a heuristic standpoint, the phenomenon captured in Figure 3.1 seems to suggest that if we fit a regression model to the observed objective function values in the scree plot, we will expect a change in the slope of the regression line right after we start eliminating significant covariates because of the aforementioned gap. One plausible way to analyze this gap is to fit a change point regression model of the observed values on the number of cycles of RFE and to infer that the estimated change point is the ad-hoc stopping rule, so as to eliminate all features ranked below that point. For the asymptotic belief that the change in the objective function is negligible to the left of the change point, we fit a linear trend there. However to the right of the change point, these changes might show non-linear trends, and hence we tried linear and quadratic trends to model that. The quadratic trend seemed to work better. Some plots (see Figures 3.2, 3.3) are given here to show our analysis where we show the mixture of linear-quadratic fits.

So heuristically it is possible to justify the choice of the correct dimensions (features) based on a reverse scree graph. Otherwise some other user-defined choices for the gap size can be used to determine how many features are required in a specific setting.

In document Dasgupta_unc_0153D_14888.pdf (Page 83-87)