An information criterion for variable selection in Support Vector Machines.

(1)

An information criterion for variable selection in

Support Vector Machines

Gerda Claeskens, Christophe Croux and Johan Van Kerckhoven

DEPARTMENT OF DECISION SCIENCES AND INFORMATION MANAGEMENT (KBI)

Faculty of Economics and Applied Economics

(2)

An information criterion for variable selection in Support

Vector Machines

Gerda Claeskens [email protected]

Christophe Croux [email protected]

Johan Van Kerckhoven [email protected]

ORSTAT and University Center for Statistics Katholieke Universiteit Leuven

B-3000 Leuven, Belgium

Abstract

Using support vector machines for classification problems has the advantage that the curse of dimensionality is circumvented. However, it has been shown that even here a reduction of the dimension of the input space leads to better results. For this purpose, we propose two information criteria which can be computed directly from the definition of the support vector machine. We as-sess the predictive performance of the models selected by our new criteria and compare them to a few existing variable selection techniques in a simulation study. Results of this simulation study show that the new criteria are very competitive compared to the others in terms of out-of-sample error rate while being much easier to compute. When we repeat this comparison on a few real-world benchmark datasets, we arrive at the same findings.

Key words: variable selection, support vector machine, information criterion,

supervised classification

1 Introduction

In many statistical applications, and in particular in regression analysis for predictive purposes, it is advised to select a subset of the variables to model the available

(3)

training data. The reasons for this are manyfold: first of all, having to estimate many parameters leads to an ill-fitting model for the data. Another good reason for employing variable selection are the poor predictions that result from fitting a model with too many variables. This is especially true in classification problems, as the likelihood of achieving a perfect separation of the training data increases with the number of predictor variables. Moreover, adding more variables leads to an increased variability.

In this paper, we study classification using the support vector machine (SVM). We start from a training set {(xi, yi)} containing n observations. Each p-dimensional

observation xi has a class label yi assigned to it, which can be either +1 or −1. We

then want to find a function f (·) such that, for an observation x, the predicted class ˆ

y = +1 if f (x) is positive, and ˆy = −1 if f (x) is negative. Naturally, we want this

function to assign the correct class labels to the training observations (low in-sample error rate). This can be better achieved by using all p available variables. On the other hand, we also want this function to accurately classify new observations (low out-of-sample error rate). For this reason, it is better to perform classification with only a few important variables. To be able to determine which set of variables should best be included, we propose two information criteria.

It can be argued that variable selection would not be really necessary in the sup-port vector machine setting (see for example Cristianini and Shawe-Taylor, 2000, Hastie, Tibshirani, and Friedman, 2001, or Sch¨olkopf and Smola, 2002), since it manages to circumvent the so-called “curse of dimensionality”. This reasoning is, however, only true to some extent. While the SVM approach avoids fitting a num-ber of parameters equal to the dimension of the input space, the high probability of a perfect separation in high-dimensional problems remains. Hence, the risk of obtaining a decision rule with poor generalisation properties (high out-of-sample error rate) cannot be avoided. Guyon et al. (2002) illustrate this and show on an example that variable selection can even further improve the SVM’s performance.

(4)

crite-rion. This assigns a “goodness of fit” value to each subset of the variables under consideration, and selects the one with the best value for the criterion as the most appropriate model. Examples of such criteria are the Akaike information criterion (AIC, Akaike, 1973) and the Bayesian information criterion (BIC, Schwarz, 1978) for (generalised) linear regression. For support vector machines, however, we find that so far only a few information criteria have been developed. One of these existing criteria is the kernel regularisation information criterion (KRIC) of Kobayashi and Komaki (2006), which was originally proposed for parameter tuning in the SVM instead of for variable selection. In this paper, we propose two new information criteria, one shares some properties with the AIC, the other with the BIC. We com-pare their performance to various other variable selection techniques for the SVM. More precisely, we want these new criteria to select models with good predictive properties, and at the same time, we want the criteria to be easily computable.

Although we restrict ourselves to using a criterion to select the “best” subset of input variables, this is by no means the only possibility. It is also possible to do a form of variable selection in feature space (for example Shih and Cheng, 2005) instead of in input space. Another possibility is to select, in input space, a set of “maximally separating directions” (Fortuna and Capson, 2004). Various other authors have suggested using a different formulation for the support vector machine, such that this formulation automatically is capable of variable selection. Examples of this can be found in Bi et al. (2003), Zhu et al. (2004), Neumann, Schn¨orr and Steidl (2005), Lee et al. (2006), Wang, Zhu, and Zou (2006), Zhang (2006), and Lin and Zhang (to appear).

In Section 2 of this paper we give a short refresher on the support vector machine setting, and introduce the notations that we use in the rest of this paper. We also review some of the already available techniques which assign “goodness-of-fit” values to each subset of input variables. Finally, we look at techniques to speed up the variable selection process without losing too much optimality of the selected subset of variables. Then, in Section 3, we define the new information criteria and

(5)

highlight the advantages that the new criteria have with respect to the ones already developed. In Section 4 we describe the results of a simulation study to compare the performance of the new criteria with respect to the methods for variable selection in support vector machines described in Section 2.2. In Section 5 we compare the different techniques on a few real-world benchmark datasets. Finally, we summarise our findings and phrase some conclusions in Section 6.

2 Problem Setting

We introduce the support vector machine setting and some notation in Section 2.1. In Section 2.2 we review some existing criterion-based variable selection techniques, and in Section 2.3 we investigate some methods to speed up the computation time.

2.1 The support vector machine

Assume that we have a training sample (xi, yi), 1 ≤ i ≤ n, where xi is a

p-dimensional vector containing the explicative variables, and yi ∈ {−1, +1} is the

class label. The goal is to estimate a target function f (x) in the space of explicative variables such that f (xi) > 0 for yi = +1, and f (xi) < 0 for yi = −1. Taking a

complicated function f (·) such that f (x) = 0 perfectly separates both classes leads to poor generalisation properties, whereas taking the function too smooth fails to catch some specific properties of the training data.

We start with linear support vector machines, where f (x) is of the form

f (x) = w0x + b.

(6)

minimisation problem: min w,b,ξi ( 1 2kwk 2_{+ C} n X i=1 ξi ) (1) subject to      yi(w0xi+ b) ≥ 1 − ξi, ξi ≥ 0, i = 1, . . . , n.

The ξi in the problem are slack margin variables, indicating how close a point xi lies

to the separating boundary (if ξi < 1), or how badly it is misclassified (if ξi > 1),

and the C is a tuning parameter, controlling how much weight is put on trying to achieve perfect separation.

The dual problem can be solved more easily, and has the following form: min α ( 1 2α 0_{Qα −} n X i=1 αi ) subject to      0 ≤ αi ≤ C, i = 1, . . . , n, P_n i=1yiαi = 0.

In this problem, αi is the weight given to the observation (xi, yi) in the solution,

and Q is a positive semi-definite matrix with entries Qi,j = yiyjx0ixj. The vector w

can be found using the following relation:

w =

n

X

i=1

yiαixi.

The negative intercept b can be found by computing

b = r2− r1 2 , where r1 = P 0<y_Piαi<C(Qα)i− 1 0<yiαi<C1 r2 = P 0>y_Piαi>−C(Qα)i− 1 0>yiαi>−C1 .

(7)

If no i exist for which 0 < yiαi < C, then take r1 = 1 2 ³ min αi=0,yi=1 (Qα)i− max αi=C,yi=1 (Qα)i ´ ,

and analogously for r2, with yi = −1. This leads to a decision function of the form:

f (x) = w0_{x + b.}

Note that we can write ξi = [1−yiai]+, where [x]+ = max{0, x} and where ai = f (xi).

The linear SVM as described above can be extended towards more complex decision functions in a rather straightforward way. Therefore we replace the inner products x0

ixj in the definition of Q by a more general kernel function K(xi, xj). See

Cristianini and Shawe-Taylor (2000) for the properties that these kernel functions must have. In turn, thus will lead to a more general decision function

f (x) =

n

X

i=1

yiαiK(xi, x) + b.

Popular choices of kernel function are the linear kernel, which corresponds to the situation above, the polynomial kernel of the form K(x, z) = (c0+ γx0z)d, and the

radial basis kernel K(x, z) = exp(−γkx − zk2_{), where c}

0, γ and d are regularisation

parameters which can be tuned for optimal performance of the classifier. In this more general setting, we have

kwk2 ₌

n

X

i,j=1

yiyjαiαjK(xi, xj) = α0Qα

for the squared norm of the weight vector, where Qi,j = yiyjK(xi, xj).

2.2 Existing variable selection techniques

The issue of variable selection (or feature selection) in support vector machines has been investigated by a few authors. Most techniques which have been developed so far rely on either a transformation of the input space, or on using a different objective

(8)

function in the SVM formulation (1). We will not concern ourselves with those, but only with the techniques that use the data as they are. One of the more common techniques consists of ranking the variables in some way according to importance and picking out the most important ones (e.g. Guyon et al., 2002, Rakotomamonjy, 2003, Shih and Cheng, 2005, and Zhang et al., 2006), where the number of retained variables is determined a priori. More details on this are in Section 2.3. A drawback of this method is that outside information is needed to determine the number of retained variables.

It is of course preferable to select the variables using only information contained in the data. Kearns et al. (1997) give an overview and compare the use of cross-validated error rate (CV), guaranteed risk minimisation (GRM) by Vapnik (1982), and an application of minimum description length by Rissanen (1989). Their exper-iments demonstrated that neither criterion is consistently better than the others as the sample size n varies. More recent work includes the kernel regularisation infor-mation criterion (KRIC) by Kobayashi and Komaki (2006). Although this criterion was originally developed to tune the constant C in the SVM definition (1), and by extension to tune the kernel parameters, it can be used without much adjustment for variable selection purposes. For comparison purposes in the remainder of this paper, we consider variable selection based on cross-validation, GRM, and on the KRIC. Each of these will be explained in more detail below.

In cross-validation based selection, we estimate the out-of-sample error rate of each model under consideration using a 10-fold cross-validation scheme. Each of these models corresponds to a subset S of input variables, where S contains exactly those variables included in the model. This will return an estimate ˆε(S) of the

out-of-sample error rate. Then, we select the model with the lowest value of ˆε(S), where S ranges over all subsets of variables under consideration.

(9)

estimated out-of-sample error rate, using

GRM (S) = ˆε(S) + |S| n

¡

1 +p1 + ˆε(S)(n/|S|)¢. (2)

Here, |S| stands for the number of input variables in the set S. Recall that n is the number of observations in the training sample. Once again, we select the model with the lowest value of GRM (S), where S ranges over all subsets of variables under consideration. Since we need to train 10 support vector machines to estimate the out-of-sample error rate for just one model, it is easily seen that the computational overhead can be immense.

Before we recall the definition of the KRIC of Kobayashi and Komaki (2006), we need to introduce some notation. First, denote xi,S as a subvector of xi, consisting

of elements xij, with j ∈ S. Then, we estimate a SVM

min w,b,ξ ( 1 2kwSk 2 _{+ C} n X i=1 ξi,S ) subject to      yi(w0Sxi,S+ bS) ≥ 1 − ξi,S, ξi,S ≥ 0, i = 1, . . . , n

on the observations (xi,S, yi). For the dual problem this results in training

min αS ( 1 2α 0 SQSαS− n X i=1 αS,i ) subject to      0 ≤ αS,i ≤ C, i = 1, . . . , n, P_n i=1yiαS,i = 0

where αS = (αS,1, . . . , αS,n) and [QS]i,k = yiykK(xi,S, xk,S). The decision rule as a

result of training this model is

fS(x) = n

X

i=1

yiαi,SK(xi,S, xS) + bS,

with xS defined in a similar way as xi,S, but for the vector x. Once again, we observe

(10)

length n, with components ti,S = η2 exp(−ηai,Syi) (1 + exp(−ηai,Syi))2 and mi,S = −η yiexp(−ηai,Syi) 1 + exp(−ηai,Syi) , i = 1, . . . , n.

Here we choose η = log(2) such that log(1+exp(−ηx)) and η[1−x]+coincide for x =

0, see Kobayashi and Komaki (2006) for further motivation. Taking λ = C−1_{log 2,}

Kobayashi and Komaki (2006) define the kernel regularisation information criterion (KRIC) for the logistic Bayesian model for SVMs as

KRIC(S) = 2 ·_Xn

i=1

log¡1 + exp(−ηai,Syi)

¢

(3)

+ trace((QSdiag(tS) + λIn)−1QS(diag(mS)2− n−1mSmtS))

¸

.

This is an estimate of the regularisation information criterion for the logistic Bayesian model for support vector machines, which can be recognised as a regularisation model (Kobayashi and Komaki, 2006). Alternatively, we use Sollich’s Bayesian model for SVMs (Sollich, 2002). This approach leads to a KRIC which has a similar form as the one in (3). Using

ν(ai,S) = (1 + exp(−2C))−1(exp(−C[1 − ai,S]+) + exp(−C[1 + ai,S]+)),

the expression for the KRIC for the Sollich Bayesian model for SVMs can be written as

KRICS(S) = KRIC(S) − 2n log

n

X

i=1

ν(ai,S). (4)

We notice that the computation of the KRIC includes inverting an n×n-matrix with only a few zeroes. Because of this, computing the KRIC is a very time-consuming operation if the sample size n becomes large.

Because of the additional computation time needed to compute the CV error rate or the KRIC, these criteria are less useful when a large number of different models needs to be evaluated.

(11)

2.3 Ranking techniques

A major problem is the exponential growth of the number of models under consid-eration as the number of variables grows. For this reason, a full subset search is computationally unfeasible even for problems with only a small number of dimen-sions (p = 15 for example). To avoid this problem, several techniques have been introduced to dramatically decrease the number of models considered while still se-lecting a model that is “almost” the best model. Chen, Li and Li (2005) for example suggest a genetic algorithm to arrive at the selected subset of variables, while Peng, Long and Ding (2005) suggest a combined backward elimination/forward selection strategy. However, both of these techniques still suffer from the possibility that a very large amount of models needs to be checked before arriving at a solution.

For this reason we follow the technique of variable ranking. This consists of assigning a “value of importance” to each variable and sorting the variables according to their importance. This results in a series of p stacked models, and as such, only

p evaluations of the variable selection criterion will be needed. The most commonly

used algorithm is the SVM recursive feature elimination (SVM-RFE) technique from Guyon et al. (2002). For a linear SVM, they suggest to rank the variables by

w2

j, with wj the j-th component of the weight vector w. Note that this ranking

technique only makes sense when the variables are standardised to have mean 0 and variance 1. Otherwise it is possible that an irrelevant variable gets selected as most important when its standard deviation is several magnitudes higher than the standard deviations of the other variables. Rakotomamonjy (2003) extends this ranking criterion to make it applicable to support vector machines with a non-linear kernel, and proposes several other ranking criteria. Additionally, Zhang et al. (2006) suggest using

sj = |wj(mj,+1− mj,−1)|,

(12)

method is suggested by Shih and Cheng (2005), who propose using the Fisher score

Sj =

|m_qj,+1− mj,−1|

σ2

j,+1+ σ2j,−1

for a linear SVM, where σ2

j,+1 and σj,−12 are the within-class variances of variable j.

The main advantage of using the Fisher score is that it is not necessary to train any support vector machine to rank the variables by this criterion.

In this paper, we use the Fisher score to rank the variables, as well as the SVM-RFE algorithm with variable influence

∆kwSk2(j) ∆

=¯¯kwSk2− kwS\{j}k

¯ ¯

as suggested by Rakotomamonjy (2003). For clarity, we will briefly recall the SVM-RFE algorithm below:

Step 1: Initialise S = {1, . . . , p}, the subset of unranked features, and r = (), the list of ranked features.

Step 2: Repeat the following steps until S = ∅.

Step 2a: Train a support vector machine on (xi,S, yi), and let

kwSk2 = α0SQSαS

the squared norm of the weight vector, where xi,S, QS and αS are defined as

in Section 2.1.

Step 2b: For each j ∈ S, train a new support vector machine on (xi,S\{j}, yi). This

gives kwS\{j}k2 = α0S\{j}QS\{j}αS\{j} for each j ∈ S. Step 2c: Denote m = argmin j |kwSk 2_{− kw} S,(j)k2|

(13)

The vector r contains the ranked variables, with the first element the most important one. A disadvantage of this method is that the number of SVMs to be trained is O(p2_{). However, this can be overcome by using α}

S instead of αS\{j} in Step

2b, such that kwS\{j}k2 ≈ α0SQS\{j}αS. Rakotomamonjy (2003) argues that this

will not affect the ranking significantly, while still allowing a major reduction in computational time, bringing the number of SVMs to be estimated to O(p). We employ this approximation in the simulation study in Section 4 and in the real data example in Section 5.

3 The new Information Criteria

In this section, we propose two new information criteria and give a motivation for their use. We also list a major advantage compared to the variable selection techniques discussed in Section 2.2.

As stated in the previous section, evaluating the CV error rate or the KRIC of a particular support vector machine model requires a high number of additional computations. For this reason, we propose a new criterion which uses information already available in the SVM, without additional complicated computations. This criterion is based on how badly the SVM violates the margin constraints, which are

written as _n

X

i=1

ξi,S,

where ξi,S is the margin slack of observation i in the support vector machine trained

on the variables with indices in S, where S is a subset of {1, . . . , p}. Alternatively, we can use the logarithm of this sum, analogous to Bai and Ng (2002) for selecting the number of factors in factor analysis. However, in the SVM setting this has the drawback that the value is undefined if the sum equals zero, which can happen if the data are perfectly separable. Also, Bai and Ng (2002) advise using a log-transform for scalar invariance reasons. Since we follow the advice to standardise

(14)

the variables before training the SVM, for better ranking as explained in Section 2.3, we automatically have scalar invariance of the sum of the margin slacks. For these reasons, we choose not to take the log-transform.

Generally (but not always), P_iξi,S will decrease as more variables are added.

Therefore we must also add a penalty term related to the number of included vari-ables to ensure a tradeoff between accuracy and simplicity of the chosen model. We suggest adding a linear penalty term, such that we get an information criterion of the form IC(S) = n X i=1 ξi+ C(n)|S|,

where S is the set of variables included in the model.

The first choice for C(n) is C(n) = 2, a constant. This leads to our first support vector machine Information Criterion (SV MIC):

SV M ICa(S) =

n

X

i=1

ξi+ 2|S|. (5)

The SV M ICa is an easily computable approximation of the KRIC of Kobayashi and Komaki (2006), up to constant factors, for the linear support vector machine. To better understand this, note first that log¡1 + exp(−ηai,Syi)

¢

is a continuous approximation of the hinge loss function η[1 − yiai,S]+ = ηξi,S for all 1 ≤ i ≤ n.

Hence, the first term in the KRIC can be approximated, up to a constant factor, by P

iξi,S. For the approximation in the second term, note first that

(QSdiag(tS) + λIn)−1QS(diag(mS)2− n−1mSmtS)

= (QSdiag(tS) + λIn)−1QSdiag(tS)diag(tS)−1(diag(mS)2− n−1mSmtS)

≈ V diag(tS)−1(diag(mS)2 − n−1mSmtS),

with V = (QSdiag(tS) + λIn)−1QSdiag(tS) a symmetric, positive semi-definite

ma-trix. The matrix V can then be written as V = U1L1U10, where U1 is an orthogonal

matrix, and L1 a diagonal matrix consisting of the eigenvalues of V . Since QS is of

rank |S| for |S| < n, we have that

(15)

for λ small, with A− _{the pseudo-inverse of a matrix A. As a result,} L1 ≈   I|S| 0 0 0n−|S|   . Hence we find that

trace((QSdiag(tS) + λIn)−1QS(diag(mS)2− n−1mSmtS))

≈ trace  U1   I|S| 0 0 0n−|S|   U0 1diag(tS)−1(diag(mS)2− n−1mSmtS)   ≈ trace     I|S| 0 0 0n−|S|   U0 1diag(tS)−1(diag(mS)2− n−1mSmtS)U1   . The selector matrix in the first factor indicates that only the first |S| diagonal elements are added. Hence, we can approximate the above expression by

|S| n trace(U 0 1diag(tS)−1(diag(mS)2− n−1mSmtS)U1) or by |S| n trace(diag(tS) −1_(diag(m S)2− n−1mSmtS)),

since the trace is invariant to an orthogonal transformation. A quick calculation reveals that the i-th diagonal element of this matrix is equal to

n − 1 n t −1 i,Sm2i,S = n − 1 n exp(−ηai,Syi).

To further facilitate computations we replace this by 1. Although this approximation might be crude for a single term, we found empirically that it works well for the summation over the entire training set. Hence, we arrive at

|S| n trace(diag(tS) −1_(diag(m S)2− n−1mSmtS)) ≈ |S| n trace(In) = |S|,

which is the linear penalty term in the SV M ICa. The newly proposed criterion

SV M ICa for support vector machines shares the form of the penalty with the

(16)

value of the maximised log likelihood of the model, plus two times the number of parameters to be estimated (that is, 2|S|). Because the penalty 2|S| is not dependent on the sample size n, we expect that both criteria share some properties, such as having the tendency to not select the most parsimonious model. For the AIC, Woodroofe (1982) has shown that in the limit for n → ∞, the expected number of superfluous parameters is less than one.

Our second proposed criterion follows the spirit of Schwarz’s (1978) Bayesian in-formation criterion (BIC). This criterion is defined similarly as the AIC, but instead of the penalty 2|S|, it uses log(n)|S|. The BIC has been shown to be consistent (Haughton 1988, 1989). This means that if the true model is contained in the search list, the criterion will (in the limit for n → ∞) select this correct model. For a related construction for factor models, see Bai and Ng (2002).

This motivates us to take C(n) = log(n), and we define our second criterion

SV MICb(S) =

n

X

i=1

ξi+ log(n)|S|. (6)

It is immediately obvious that the computational cost of both SV MICs is much lower than of the cross-validated error rate (10 more SVMs to train for 10-fold cross-validation) and of the kernel regularisation information criterion KRIC (which needs computations of the order O(n3_{) due to the matrix inversion). The best case}

is when the ξi,S are directly available. Computing the SV MICs is only an O(n)

computation in that case, and usually even less when employing the property that

ξi,S 6= 0 ⇔ αi,S= 1.

When only αS and QS are available, ξi,S has to be computed using the relation

ξi,S = h 1 − yi n X j=1 αj,S>0 αj,S[QS]ij i +.

This means that in the worst case, the computation time of the SV MICs is O(n2_),

(17)

4 Simulation Setup

In this experiment, we perform M = 100 simulation runs with the following settings. We generate n ∈ {25, 50, 100, 200} independent observations xi, 1 ≤ i ≤ n of

dimension p ∈ {25, 50, 100, 200}, with distribution N (0, σ2_I

p) where σ2 = 1. For

each observation, we also generate a class label yi ∈ {−1, +1}, with P (yi = 1) = 1/2.

Finally, we let µ = (+1/2, −1/2, −1/2, +1/2, 0, . . . , 0) of dimension p, and set xi =

xi+yiµ to separate the two classes to some extent. This is equivalent to drawing the

observations xi with class label yi = +1 from a normal distribution N (µ, σ2Ip), and

the observations xj with class label yj = −1 from a normal distribution N (−µ, σ2Ip).

This implies that the optimal separating hyperplane would be x0_{µ = 0, such that}

ˆ

y = +1 if x0_{µ > 0. This hyperplane would yield an out-of-sample misclassification}

rate of P (Y = +1)P (X0_{µ + kµk}2 2 < 0) + P (Y = −1)P (X0µ − kµk22 > 0) = 1 2P (σkµk2Z < −kµk 2 2) + 1 2P (σkµk2Z > kµk 2 2) = Φ ³ − kµk2 σ ´

where Z is a standard normally distributed random variable. In our example, with

σ = 1 and kµk2 = 1, we find an optimal out-of-sample misclassification error of

0.159.

During each simulation run, we first standardise the variables to improve the numerical performance of the SVM algorithm. We also need this standardisation for the variable ranking step which comes next. Then, we rank the variables using either the Fisher score or the variable influence as described in Section 2.3. For each of the nested models obtained in the variable ranking step, we compute SV MICa and

SV M ICb, see equations (5) and (6) for the definition. We compare the performance

of these two criteria with the CV error rate criterion, using a ten-fold cross-validation scheme, and Vapnik’s GRM as in (2). An important remark is that we do not compute the CV error rate in the usual way, namely determining it for each model

(18)

under consideration, because this can lead to a (severely) biased estimate (Zhang et al., 2006, for illustration). We will instead employ the CV2 method, which includes the feature selection procedure in the cross-validation, as suggested by Zhang et al. (2006). When using the regular CV method, the input variables are first ranked, and for each cross-validation fold, the error rate of the model including the first k ranked variables is estimated using cross-validation, where k ranges from 1 to p. In the CV2 method, the variable ranking is done in each part of the 10-fold cross-validation, and can vary from one part to the other.

Finally, we compare the new selection criteria with the KRICs proposed in Kobayashi and Komaki (2006), even though this criterion is originally meant only to be used to tune the regularisation parameter C in the support vector machine. The expression for the KRIC for the logistic Bayesian model for SVMs can be found in (3), and in (4) for the KRIC for the Sollich model for SVMs. As stated before, it is obvious that the major drawback of these criteria is the computation time when the sample size becomes large. In that case, it becomes even slower to compute than the CV error rate and general risk minimisation criteria.

The experiment will be repeated two times, each time with a different kernel

K(x1, x2) for the support vector machine:

• Linear kernel: K(x1, x2) = x01x2, which is the standard inner product, and

which leads to a linear decision rule.

• Quadratic kernel: K(x1, x2) = (γx01x2+ 1)2, with γ = 1/p, the inverse of the

number of variables. This will lead to a quadratic decision rule.

The tuning parameter C in each support vector machine that we train is chosen to be C = 1, as we standardise the explicative variables a priori. This setting for C is the standard setting for the svm procedure in the R software package. Finally, we test the accuracy of the selected models by estimating their out-of-sample error rate, based on a test sample of 10000 observations. These observations are generated in the same way as the training sample.

(19)

Tables 1 and 2 report the estimated out-of-sample error rates. For the first table we used variable ranking based on the input variable’s influence on kwk2_{. The}

second table shows results for variable ranking using the Fisher scores Sj. An overall

observation is that the error-rate based selection criteria (CV and GRM) have the worst performance. The performances of the KRICs and the new SV MICs are comparable. More precisely, we observe that the KRICs are better as a variable selection method for small sample sizes (n = 25), while the SV MICs give better results for larger sample sizes. This is especially apparent when the quadratic kernel is used. The additional computational overhead of the KRICs criteria still steers our preference to our much simpler SV MICa and SV MICb though. For a small number of observations compared to the number of variables, we also note that SV MICa slightly outperforms SV MICb in terms of out-of-sample error rate, and that the converse is true with many observations and fewer variables. Investigating which variable ranking criterion is better, results in the case of linear kernels that there is a strong preference for ranking with the Fisher score as compared to ranking with variable influence on kwk2_{. For the quadratic kernel however, it is slightly better to}

rank the variables based on variable influence on kwk2_.

Upon closer inspection, we observe that the differences in out-of-sample error rates become smaller as the number of variables grows. However, SV MICa and

SV MICb are still somewhat ahead, and have the advantage that they are much

easier (and less time-intensive) to compute than the other criteria, especially the KRICs, as their computational time is of order O(n3_{). Also, as n grows larger, we}

see that the out-of-sample error rates of the models obtained by our two suggested criteria are converging towards the theoretically obtained minimal out-of-sample error rate of 15.9%.

Since we would expect that selection based on cross-validated error rate would have the best performance, it is interesting to see which models are actually chosen by the different criteria, and hence why CV error performs so poorly as a model selection criterion. This information is reported in Table 3 for a few specific

(20)

set-linear quadratic p = 25 p = 50 p = 100 p = 25 p = 50 p = 100 n = 25 SV M ICa 32.2% 34.6% 37.4% 31.3% 35.8% 43.3% SV M ICb 32.6% 35.3% 37.3% 34.2% 39.3% 48.3% CV 33.5% 35.3% 37.8% 33.8% 39.6% 42.8% GRM 36.2% 37.4% 38.6% 37.7% 43.6% 49.2% KRIC 31.3% 34.4% 37.0% 29.5% 33.3% 37.1% KRICS 31.5% 34.4% 37.1% 30.2% 33.9% 37.7% n = 50 SV M ICa 24.4% 28.5% 30.9% 22.7% 24.4% 26.4% SV M ICb 24.6% 27.7% 29.1% 25.0% 26.8% 30.8% CV 27.1% 29.5% 31.0% 26.7% 29.8% 34.1% GRM 31.1% 31.4% 32.1% 31.8% 33.9% 40.3% KRIC 25.7% 29.8% 31.0% 23.6% 27.6% 31.1% KRICS 26.0% 30.2% 31.3% 24.8% 29.1% 32.5% n = 100 SV M ICa 19.9% 22.9% 19.4% 19.7% SV M ICb 19.6% 20.2% 19.9% 19.8% CV 24.6% 25.8% 23.8% 24.2% GRM 30.1% 29.9% 30.6% 30.5% KRIC 21.8% 26.9% 20.0% 22.6% KRICS 22.3% 27.3% 21.7% 24.7% n = 200 SV M ICa 17.8% 20.1% SV M ICb 16.9% 17.1% CV 22.7% 22.4% GRM 28.9% 29.4% KRIC 18.7% 18.3% KRICS 19.2% 20.3%

Table 1: Out-of-sample error rates for the different simulation settings where the variable ranking has been done by variable influence on kwk2_{. The left table shows}

these rates for SVMs with a linear kernel, and the right table shows these rates for a quadratic kernel. The top row shows the number of variables p, while the leftmost row shows the number of observations n. For each setting, six numbers are given, corresponding to the different selection criteria: SV MICa, SV MICb, cross-validated error rate, general risk minimisation, and the KRICs based on the logistic Bayesian model and on the Sollich’s Bayesian model for SVMs.

(21)

linear quadratic p = 25 p = 50 p = 100 p = 25 p = 50 p = 100 n = 25 SV M ICa 29.4% 31.6% 33.9% 30.7% 35.3% 43.3% SV M ICb 31.6% 32.6% 35.0% 33.8% 38.5% 48.4% CV 31.8% 33.5% 34.4% 32.9% 38.5% 42.7% GRM 34.5% 35.4% 35.7% 36.6% 42.6% 48.7% KRIC 29.0% 33.2% 34.9% 28.4% 33.0% 37.1% KRICS 29.9% 33.2% 34.9% 30.1% 34.1% 38.2% n = 50 SV M ICa 21.6% 23.3% 24.6% 21.3% 23.0% 25.6% SV M ICb 23.2% 24.8% 25.0% 24.3% 26.8% 30.2% CV 25.5% 26.3% 28.0% 25.9% 28.1% 33.8% GRM 29.6% 30.5% 30.9% 31.7% 33.5% 40.1% KRIC 24.9% 28.7% 30.1% 22.5% 27.1% 30.9% KRICS 25.9% 29.7% 30.8% 25.1% 29.3% 32.8% n = 100 SV M ICa 18.5% 19.2% 18.5% 18.5% SV M ICb 18.9% 19.0% 19.1% 19.5% CV 23.8% 25.4% 19.2% 22.0% GRM 30.1% 29.6% 30.2% 30.7% KRIC 20.6% 26.8% 20.0% 22.6% KRICS 21.7% 27.8% 22.0% 25.1% n = 200 SV M ICa 17.0% 20.3% SV M ICb 16.8% 16.8% CV 21.5% 21.4% GRM 29.3% 29.6% KRIC 18.0% 18.1% KRICS 18.9% 20.6%

Table 2: As in Table 1, but this time the variable ranking has been done using the Fisher scores Sj.

(22)

ting. For each of these setting, it shows how many times the correct subset of input variables, containing only the first four input variables, was chosen (C, correct). This table also shows how many times a too-sparse group of variables was selected (U, underfitting), and how many times a too-rich group of variables was chosen (O, overfitting). The good performance of SV MICa and SV MICb might be due to the fact that these criteria seem to have the tendency to select a set of variables which includes all significant ones as the number of observations grows. The simu-lation results indicate that SV MICa behaves like AIC with its tendency to overfit. The SV MICb seems to share the property of BIC that it selects the correct model more often, if at least this true model is one of the possibilities to select from. The cross-validated error rate, and the general risk minimisation in particular, seem to have the tendency to ignore variables which nevertheless are important. As a conse-quence, the models that these criteria select are of poor predictive quality. The two KRICs of Kobayashi and Komaki (2006) share the overselection property exhibited by SV M ICa, but the KRICs select excessive variables even more frequently than

SV M ICa. This can explain why these criteria perform somewhat worse when the

number of observations is somewhat large, and why they outperform the proposed

SV M ICs when the number of observations is small, since the latter tend to underfit

the model in the case of few observations.

This concludes the results for the case of two populations coming from an iden-tical distribution, differing only in mean. Another case that we examined is where the variances of the two populations differ from each other. For example, when the observations in class +1 are sampled from a N (µ1, Σ1) distribution, and the

observations in class −1 are sampled from a N (µ2, Σ2) distribution. We performed a simulation study, in a similar way as the previous one, where the samples have been drawn from N (µ, Ip) for class +1, and from N (−2µ, 4Ip) for class −1.

The results of this simulation are summarised in Tables 4 and 5. In this set-ting, we observe similar results as in the setting where both populations had equal variance. Selection based on CV error rate and on GRM still perform rather bad,

(23)

Linear; n = 25; p = 25 C U O R SV M ICa 1 22 1 76 SV M ICb 0 42 0 58 CV 0 38 4 58 GRM 0 77 0 23 KRIC 1 1 7 91 KRICS 0 0 9 91 Quadratic; n = 25; p = 25 C U O R SV M ICa 3 36 0 61 SV M ICb 0 64 0 36 CV 1 40 5 54 GRM 0 75 0 25 KRIC 0 1 25 74 KRICS 0 0 49 51 Linear; n = 200; p = 25 C U O R SV M ICa 22 0 76 2 SV M ICb 77 9 10 4 CV 7 48 43 2 GRM 1 98 1 0 KRIC 6 0 93 1 KRICS 1 0 99 0 Quadratic; n = 200; p = 25 C U O R SV M ICa 2 0 98 0 SV M ICb 67 14 6 13 CV 4 43 49 4 GRM 1 99 0 0 KRIC 8 0 84 8 KRICS 0 0 100 0 Linear; n = 25; p = 100 C U O R SV M ICa 0 8 0 92 SV M ICb 0 20 0 80 CV 0 23 6 71 GRM 0 56 0 44 KRIC 0 1 0 99 KRICS 0 0 1 99 Quadratic; n = 25; p = 100 C U O R SV M ICa 0 35 0 65 SV M ICb 0 63 0 37 CV 0 33 10 57 GRM 0 64 0 36 KRIC 0 0 41 59 KRICS 0 0 56 44

Table 3: Frequencies of which models have been selected by SV MICa, SV M ICb, cross-validated error rate (CV), general risk minimisation (GRM), KRIC, and KRICS for each simulation setting. The tables give the results when the variables are ranked by variable influence on kwk2_{. The column labelled C (correct) denotes}

the number of times the four significant variables were selected, without any others. The column labelled U (underfit) gives the number of times that not all significant variables, and no others, were selected, while the column O (overfit) gives the num-ber of times all significant variables and at least one non-significant variable has been selected. The last column, labelled R, reports the number of times none of the three former situations occurred.

(24)

especially for larger sample sizes. As before, the performances of the KRICs and

SV M ICs are similar. More precisely, the SV M ICs have an improved performance

with respect to the KRICs when the sample size is large (n ≥ 50) and the linear kernel is used, and the KRICs work slightly better for small sample sizes (n = 25). For the quadratic kernel, we notice a good performance of the KRICs, which is only matched by SV MICa for larger sample sizes.

Table 6 shows the models selected by the different criteria. From this table, we make the same observations about the criteria’s behaviour as in the setting where the populations have equal variances, but this time only when the linear kernel is used for the SVM. For the polynomial kernel we see that in the setting with different population variances, the SV MICs have more difficulty selecting all the relevant variables than the KRICs, which explains why these criteria have an improved performance in that case.

Finally, we have conducted a simulation experiment where the input variables were strongly correlated. First, the observations were generated as in the first sim-ulation experiment. Then, we applyied the transformation

xij = ρxikj + ²ij with ²ij ∼ N (0, ρ

2_{) i.i.d.}

where i = 1, . . . , n, kj is chosen arbitrarily between 1 and 4, and 4 < j ≤ p/2,

such that about half of the unimportant input variables are correlated with the four important ones. The parameter |ρ| < 1 controls the degree of correlation. We have chosen ρ = 0.8 and found similar results (not reported) as for the case where the variances of both class-population differ.

5 Tests on Real Data

We compare the performance of the new methods with that of the other discussed criteria on several real-world datasets. For comparative purposes, we will use some of the benchmark datasets used in Rakotomamonjy (2003), and in R¨atsch et al. (2001).

(25)

linear quadratic p = 25 p = 50 p = 100 p = 200 p = 25 p = 50 p = 100 p = 200 n = 25 SV M ICa 28.9% 33.9% 35.6% 36.5% 29.2% 35.1% 42.1% 50.1% SV M ICb 30.1% 34.2% 35.7% 36.4% 31.8% 39.6% 48.2% 50.1% CV 30.4% 35.1% 36.0% 36.4% 31.8% 38.1% 42.2% 44.7% GRM 32.7% 35.3% 36.9% 36.6% 35.4% 42.8% 49.4% 50.1% KRIC 29.0% 32.7% 34.8% 36.4% 25.7% 30.5% 35.0% 38.9% KRICS 28.8% 32.5% 34.8% 36.1% 25.8% 31.3% 36.2% 40.4% n = 50 SV M ICa 23.3% 27.1% 28.3% 20.5% 23.1% 26.5% SV M ICb 23.9% 25.7% 27.4% 23.5% 26.1% 30.4% CV 26.1% 27.7% 28.7% 25.9% 28.3% 34.5% GRM 28.9% 29.1% 29.9% 30.6% 33.2% 40.5% KRIC 24.2% 27.7% 28.4% 19.0% 23.8% 28.2% KRICS 24.6% 27.6% 28.4% 19.5% 25.1% 30.1% n = 100 SV M ICa 19.0% 21.8% 14.6% 17.9% SV M ICb 18.1% 19.3% 18.5% 18.4% CV 22.7% 23.5% 20.8% 22.0% GRM 27.6% 26.9% 27.8% 27.7% KRIC 20.5% 24.8% 14.2% 18.1% KRICS 21.0% 25.0% 14.5% 19.5% n = 200 SV M ICa 17.0% 9.9% SV M ICb 15.9% 12.9% CV 21.4% 19.6% GRM 27.0% 29.3% KRIC 17.9% 10.1% KRICS 18.3% 9.7%

Table 4: Out-of-sample error rates for the different simulation settings with the unequal variance populations where the variable ranking has been done by variable influence on kwk2_{. The left table shows these rates for SVMs with a linear kernel,}

and the right table shows these rates for a quadratic kernel. The top row shows the number of variables p, while the leftmost row shows the number of observations

n. For each setting, six figures are given, corresponding to the different selection

criteria: SV MICa, SV MICb, cross-validated error rate, general risk minimisation, and the KRICs based on the logistic Bayesian model and on the Sollich’s Bayesian model for SVMs.

(26)

linear quadratic p = 25 p = 50 p = 100 p = 200 p = 25 p = 50 p = 100 p = 200 n = 25 SV M ICa 28.0% 30.2% 31.5% 33.2% 28.9% 35.8% 41.7% 50.1% SV M ICb 29.2% 31.3% 32.3% 34.4% 31.8% 40.0% 48.1% 50.1% CV 28.4% 31.4% 32.6% 34.2% 28.7% 37.6% 42.3% 44.4% GRM 31.6% 33.1% 33.7% 35.6% 34.7% 42.4% 48.7% 50.1% KRIC 27.5% 30.7% 32.6% 33.5% 24.9% 30.8% 36.0% 40.0% KRICS 27.7% 30.5% 33.0% 33.7% 26.2% 32.3% 38.1% 41.8% n = 50 SV M ICa 20.5% 21.7% 23.1% 19.3% 22.2% 25.8% SV M ICb 21.9% 22.7% 23.7% 22.2% 26.2% 30.4% CV 24.9% 25.2% 25.2% 24.5% 27.6% 33.7% GRM 28.6% 28.4% 28.7% 30.2% 32.7% 40.4% KRIC 23.6% 26.8% 26.7% 19.1% 23.9% 28.8% KRICS 24.3% 27.1% 27.5% 19.9% 26.1% 32.3% n = 100 SV M ICa 17.4% 17.8% 15.2% 17.0% SV M ICb 17.4% 18.0% 16.4% 17.8% CV 21.5% 22.7% 19.9% 21.5% GRM 27.6% 27.0% 27.1% 28.3% KRIC 20.0% 25.0% 14.5% 18.5% KRICS 20.9% 25.5% 14.9% 20.3% n = 200 SV M ICa 16.1% 9.8% SV M ICb 15.6% 13.2% CV 20.7% 17.6% GRM 27.0% 26.8% KRIC 17.0% 10.3% KRICS 17.8% 9.8%

Table 5: As in Table 4, but this time the variable ranking has been done using the Fisher scores Sj.

(27)

Linear; n = 25; p = 25 C U O R SV M ICa 0 22 1 77 SV M ICb 0 47 0 53 CV 1 40 1 58 GRM 0 76 0 24 KRIC 0 0 6 94 KRICS 0 0 8 92 Quadratic; n = 25; p = 25 C U O R SV M ICa 1 36 0 63 SV M ICb 1 57 0 42 CV 1 39 8 52 GRM 0 70 0 30 KRIC 0 0 25 75 KRICS 0 0 50 50 Linear; n = 200; p = 25 C U O R SV M ICa 11 0 85 4 SV M ICb 69 10 16 5 CV 6 56 37 1 GRM 0 100 0 0 KRIC 5 0 93 2 KRICS 0 0 99 1 Quadratic; n = 200; p = 25 C U O R SV M ICa 0 20 0 80 SV M ICb 0 45 0 55 CV 0 33 4 63 GRM 0 56 0 44 KRIC 0 0 40 60 KRICS 0 0 53 47 Linear; n = 25; p = 200 C U O R SV M ICa 0 1 0 99 SV M ICb 0 8 0 92 CV 0 22 2 76 GRM 0 46 0 54 KRIC 0 1 0 99 KRICS 0 0 0 100 Quadratic; n = 25; p = 200 C U O R SV M ICa 0 52 0 48 SV M ICb 0 54 0 46 CV 0 22 5 73 GRM 0 54 0 46 KRIC 0 0 46 54 KRICS 0 0 56 44

Table 6: Frequencies of which models have been selected by SV MICa, SV M ICb, cross-validated error rate (CV), general risk minimisation (GRM), KRIC, and KRICS for each simulation setting with the unequal variance populations. The tables give the results when the variables are ranked by variable influence on kwk2_.

The column labelled C (correct) denotes the number of times the four significant variables were selected, without any others. The column labelled U (underfit) gives the number of times that not all significant variables, and no others, were selected, while the column O (overfit) gives the number of times all significant variables and at least one non-significant variable has been selected. The last column, labelled R, reports the number of times none of the three former situations occurred.

(28)

The datasets used are the Pima Indians Diabetes database (768 observations, 8 variables), the Statlog Cleveland Heart Disease database (303 observations, 14 vari-ables), and Leo Breiman’s ringnorm and twonorm datasets (both 7400 observations, 20 variables). These datasets are available from the UCI Machine Learning Repos-itory (the first two), and the Delve ReposRepos-itory (last two). We perform 100 random splits of the data in a training sample and a test sample, where the size of the training sample is chosen as √2n, with n the total number of observations in the dataset. We chose the size of the training set such that there is a sufficient amount of observations in the test sample to estimate the out-of-sample error rate. On the other hand, we chose a reasonably small training sample to keep the computational time reasonable, because training a SVM and especially computing the KRIC after-wards takes increasingly longer for larger sample sizes. For each of these partitions, we perform the variable selection scheme on the training sample exactly as in the simulation study. We first rank the variables to retain p stacked subsets of input variables, and then use the information criteria to select the variables that best ex-plain the training data. Then, we predict the class labels for the test sample, and use these predictions to estimate the out-of-sample error rate. As before, we use both variable ranking based on variable influence on kwk2 _{and on Fisher score, and}

we use both a linear kernel and a quadratic kernel.

The estimated out-of-sample error rates are presented in Table 7 for each dataset and estimation setting. We observe that the KRICs are the preferred choice of variable selection criterion in terms of out-of-sample error rate for the ‘twonorm’ and ‘heart’ datasets. For the ‘ringnorm’ and ‘diabetes’ datasets the difference in performance between the KRICs and our newly proposed SV MICs is less clear. These results are consistent across all settings. We also observe that the CV error rate and especially the GRM have a poor performance, which is in the line of the results obtained in the simulation.

From these results, and the results obtained in Section 4, we suggest to use either the SV M ICa or the SV MICb if a preliminary analysis of the data or a

(29)

Linear, ∆kwk2

(j) diabetes heart ringnorm twonorm

SV M ICa 28.6% 27.0% 31.1% 9.9% SV M ICb 29.0% 27.6% 34.9% 13.5% CV 28.6% 27.6% 33.9% 20.5% GRM 29.6% 29.3% 39.2% 31.4% KRIC 28.5% 25.4% 30.1% 8.0% KRICS 28.6% 25.3% 29.9% 7.5%

Linear, Sj diabetes heart ringnorm twonorm

SV M ICa 28.0% 27.6% 30.8% 10.1% SV M ICb 28.6% 28.2% 35.2% 15.0% CV 28.8% 26.8% 32.8% 21.0% GRM 29.1% 28.8% 39.3% 30.8% KRIC 27.5% 24.5% 29.6% 6.8% KRICS 28.3% 25.2% 29.2% 6.6% Quadratic, ∆kwk2

(j) diabetes heart ringnorm twonorm

Quadratic, Sj diabetes heart ringnorm twonorm

Table 7: Out-of-sample error rates obtained on the different data sets. The variable ranking scheme and the kernel used is denoted in the upper left cell of each table. The top row shows the dataset used, while the leftmost row shows the model selection criteria: SV MICa, SV MICb, cross-validated error rate, general risk minimisation, and the KRICs based on the logistic Bayesian model and on the Sollich’s Bayesian model for SVMs.

(30)

priori knowledge indicates that the true decision function is almost linear. When it differs strongly from a linear function, the researcher has a choice between the ease of computation of the support vector machine information criteria, or the somewhat improved predictive performance, though with higher computational cost, of the kernel regularisation information criterion.

6 Conclusions

In this paper we considered the issue of variable selection in support vector machines. We proposed two new information criteria, SV M ICa and SV MICb, which allow us to evaluate the suitability of the selected subset of variables for predictive pur-poses, without much computational overhead. We provided an argumentation for these criteria, linking SV M ICa to the KRIC of Kobayashi and Komaki (2006), and justifying SV MICb with the need for a consistent selection criterion. We demon-strated the effectiveness of these criteria in a simulation study, where we compared their predictive performance to the aforementioned KRIC, and to selection based on cross-validated error rate and general risk minimisation. Especially for decision func-tions which are close to an affine function, we found that SV MICa and SV MICb performed the best of all tested criteria, and were also the easiest to compute. For more complicated decision functions, we found that SV MICa still performs well for selecting models with good out-of-sample properties. We repeated the experiment on several real data examples, and the result confirmed the good properties of these newly proposed criteria.

References

Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle, in Second International Symposium on Information Theory, (eds. B. Petrov and F. Csáki), Akadémiai Kiadó, Budapest, 267–281.

(31)

Bai, J. and Ng, S. (2002). Determining the number of factors in approximate factor models. Econometrica, 70, 191–221.

Bi, J., Bennett, K. P., Embrechts, M., Breneman, C. M. and Song, M. (2003). Dimensionality Reduction via Sparse support vector machines. Journal of

Ma-chine Learning Research, 3, 1229–1243.

Chen, S.-W., Li, Z.-R. and Li, X.-Y. (2005). Prediction of antifungal activity by support vector machine approach. Journal of molecular structure: THEOCHEM, 731, 73–81.

Cristianini, N. and Shawe-Taylor, J. (2000). An introduction to support vector ma-chines and other kernel-based learning methods. Cambridge University Press,

Cambridge.

Fortuna, J. and Capson, D. (2004). Improved support vector classification using PCA and ICA feature space modification. Pattern Recognition, 37, 1117–1129. Guyon, I., Weston, J., Barnhill, S. and Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422. Hastie, T., Tibshirani, R. and Friedman, J. (2001). The elements of statistical

learning: data mining, inference, and prediction. Springer, New York.

Haughton, D. (1988). On the choice of a model to fit data from an exponential family, The Annals of Statistics, 16, 342–355.

Haughton, D. (1989). Size of the error in the choice of a model to fit data from an exponential family, Sankhy¯a, Series A, 51, 45–58.

Kearns, M., Mansour, Y., NG, A. Y. and Ron, D. (1997). An Experimental and Theoretical Comparison of Model Selection Methods. Machine Learning, 27, 7–50.

Kobayashi, K. and Komaki, F. (2006). Information Criteria for support vector machines. IEEE Transactions on Neural Networks, 17, 571–577.

Lee, Y., Kim, Y., Lee, S., and Koo, J.-Y. (2006). Structured multicategory support vector machines with analysis of variance decomposition. Biometrika, 93, 555– 571.

(32)

Lin, Y. and Zhang, H. H. (to appear). Component Selection and Smoothing in Multivariate Nonparametric Regression. Annals of Statistics.

Neumann, J., Schn¨orr, C. and Steidl, G. (2005). Combined SVM-Based Feature Selection and Classification. Machine Learning, 61, 129–150.

Peng, H., Long, F. and Ding, C. (2005). Feature Selection Based on Mutual Infor-mation: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy.

IEEE transactions on Pattern Analysis and Machine Intelligence, 27, 1226–

1238.

Rakotomamonjy, A. (2003). Variable Selection Using SVM-based Criteria. Journal

of Machine Learning Research, 3, 1367–1370

R¨atsch, G., Onoda, T. and M¨uller, K.-R. (2001). Soft Margins for AdaBoost.

Machine Learning, 42, 287–320.

Rissanen, J. (1989). Stochastic complexity in statistical inquiry, World Scientific Series in Computer Science, volume 15. World Scientific, Singapore.

Sch¨olkopf, B. and Smola, A.J. (2002). Learning with Kernels. MIT Press,

Cam-bridge.

Schwarz, G. (1978). Estimating the dimension of a model, The Annals of Statistics, 6, 461–464.

Shih, F. Y. and Cheng, S. (2005). Improved feature reduction in input and feature spaces. Pattern Recognition, 38, 651–659.

Sollich, P. (2002). Bayesian Methods for support vector machines: evidence and predictive class probabilities. Machine Learning, 46, 21–52.

Woodroofe, M. (1982). On model selection and the arc sine laws. The Annals of

Statistics, 10, 1182–1194.

Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Springer,

New York.

Wang, L., Zhu, J. and Zou, H. (2006). The doubly regularized support vector machine. Statistica Sinica, 16, 589–615.

Zhang, H. H. (2006). Variable Selection for SVM via Smoothing Spline ANOVA.

(33)

Zhang, X., Lu, X., Shi, Q., Xu, X.-Q., Leung, H.-C. E., Harris, L. N., Iglehart, J. D., Miron, A., Liu, J. S., and Wong, W. H. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data.

BMC Bioinformatics, published 10 April 2006.

Zhu, J., Rosset, S., Hastie, T. and Tibshirani, R. (2004). 1-norm support vector machines. Neural Information Processing Systems, 16.