The new information criteria - Predictive modelling: variable selection and classification effi

As stated in the previous section, evaluating the CV error rate or the KRIC of a particular support vector machine model requires a high number of additional computations. For this reason, we propose two new criteria which use informa-tion already available in the SVM, without addiinforma-tional complicated computainforma-tions.

The criteria are based on how badly the SVM violates the margin constraints, which are written asP_n

i=1ξ_i,S, where ξ_i,S is the margin slack of observation i in the support vector machine trained on the variables with indices in S, where S is a subset of {1, . . . , p}. Alternatively, we can use the logarithm of this sum, anal-ogous to Bai and Ng (2002) for selecting the number of factors in factor analysis.

However, in the SVM setting this has the drawback that the value is undefined if the sum equals zero, which can happen if the data are perfectly separable. Also, Bai and Ng (2002) advise using a log-transform for scalar invariance reasons.

Since we follow the advice to standardise the variables before training the SVM, for better ranking as explained in Section 4.2.3, we automatically have scalar

68 4.3. The new information criteria

invariance of the sum of the margin slacks. For these reasons, we choose not to take the log-transform.

Generally (but not always), P

iξ_i,Swill decrease as more variables are added.

Therefore we add a penalty term related to the number of included variables to ensure a tradeoff between accuracy and simplicity of the chosen model. We suggest adding a linear penalty term, such that we get an information criterion of the form

IC(S) = Xn i=1

ξ_i+ C(n)|S| , (4.7)

where S is the set of variables included in the model.

A first choice is to take C(n) constant in (4.7). It is interesting to note that IC(S) is then, up to constant factors, an easily computable approximation of the KRIC of Kobayashi and Komaki (2006), hereby providing a theoretical justifica-tion for its use. To better understand this, note first that log¡

1 + exp(−ηa_i,Sy_i)¢ is a continuous approximation of the hinge loss function η[1 − y_ia_i,S]₊= ηξ_i,Sfor all 1 ≤ i ≤ n. Hence, the first term of the KRIC can be approximated, up to a constant factor, by P

iξ_i,S. For the approximation of the second term in (4.5), rewrite

W = (Q_Sdiag(t_S) + λI_n)⁻¹Q_S(diag(m_S)²− n⁻¹m_Sm^t_S)

= V diag(t_S)⁻¹(diag(m_S)²− n⁻¹m_Sm^t_S),

with V = (A + λI_n)⁻¹A a symmetric, positive semi-definite matrix and A = Q_Sdiag(t_S). Denoting A⁻ the generalised inverse of A, and using a series expan-sion around λ = 0, gives that the leading term of V = A⁻(I + λA⁻)⁻¹A is equal to A⁻A. This expansion converges as long as the eigenvalues of λA⁻ are strictly less than one, which can be obtained by taking λ small enough. We now use a singular value decomposition of both A and A⁻and use the fact that the singular values of A⁻ are the reciprocals of the non-zero singular values of A, to obtain that the product A⁻A is a n × n diagonal matrix with on the diagonal |S| ones and the remaining entries zero. Thus, the leading term of trace(W ) equals the sum of |S| diagonal entries of the matrix diag(t_S)⁻¹(diag(m_S)² − n⁻¹m_Sm^t_S)).

Chapter 4. A new information criterion for SVM 69

The i-th diagonal element of this matrix is equal to n − 1

n t⁻¹_S,im²_S,i= n − 1

n exp(−ηa_i,Sy_i).

To further facilitate computations we replace this by 1, motivated by the fact that ηa_i,Sy_i is often small. Although this approximation might be crude for a single term, we found empirically that it works well for the summation over the entire training set. Hence, we arrive at the approximation trace(W ) ≈ |S| which is the linear penalty term in (4.7).

Taking the constant value C(n) = 2, leads to our first new support vector machine information criterion (SVMIC):

SVMICa(S) = Xn

i=1

ξ_i+ 2|S|. (4.8)

The newly proposed criterion SVMICa for support vector machines shares the form of the penalty with the well-known Akaike (1973) information criterion.

This AIC is defined as minus twice the value of the maximised log likelihood of the model, plus two times the number of parameters to be estimated (that is, 2|S|). Because the penalty 2|S| is not dependent on the sample size n, we expect that both criteria share some properties, such as having the tendency to not select the most parsimonious model. For the AIC, Woodroofe (1982) has shown that in the limit for n → ∞, the expected number of superfluous parameters is less than one.

To support the definition of SVMICa , we ran a simulation experiment and compared the values of KRIC and SVMICa for 100 models. The sample size is n = 50, with 10 variables of which only the first 4 variables are different from zero. A detailed description of the simulation setting can be found in Section 4.4.

We used a linear kernel. Figure 4.1 reports these numerical results and shows a high correlation (0.975) between the values of the two criteria. Other simulation settings gave comparable correlation values.

Our second proposed criterion follows the spirit of Schwarz’s (1978) Bayesian information criterion (BIC). This criterion is defined similarly as the AIC, but instead of the penalty 2|S|, it uses log(n)|S|. The BIC has been shown to be

70 4.3. The new information criteria

20 25 30 35 40 45

30405060

SVMICa

KRIC

Figure 4.1: Values of KRIC and SVMICa in a simulation experiment, showing high correlation (0.975).

consistent (Haughton 1988, 1989). This means that if the true model is contained in the search list, the criterion will (in the limit for n → ∞) select this correct model. For a related construction for factor models, see Bai and Ng (2002). This motivates us to take C(n) = log(n), and we define our second criterion

SVMICb(S) = Xn i=1

ξ_i+ log(n)|S|. (4.9)

It is immediate that the computational cost of both SVMICs is much lower than of the validated error rate (10 more SVMs to train for 10-fold cross-validation) and of the kernel regularisation information criterion KRIC (which needs computations of the order O(n³) due to the matrix inversion). The best case is when the ξ_i,S are directly available. Computing the SVMICs is only an O(n) computation in that case, and usually even less when employing the property that

ξ_i,S 6= 0 ⇔ α_i,S = 1.

Chapter 4. A new information criterion for SVM 71

When only α_S and Q_S are available, ξ_i,S is computed using the relation

ξ_i,S = h

1 − y_i Xn j=1 αj,S>0

α_j,S[Q_S]_ij i

This means that in the worst case, the computation time of the SV M ICs is O(n²), which is still faster than using either CV error rate or KRIC.

4.4 Simulation results

We perform M = 100 simulation runs with the following settings. We gener-ate n ∈ {25, 50, 100, 200} independent observations x_i, 1 ≤ i ≤ n of dimension p ∈ {25, 50, 100, 200}, with distribution N (0, σ²I_p) where σ² = 1. For each obser-vation we generate a class label y_i ∈ {−1, +1}, with P (y_i= 1) = 1/2. Finally, we let µ = (1/2, −1/2, −1/2, 1/2, 0, . . . , 0) of dimension p, and set x_i ← x_i+ y_iµ to separate the two classes to some extent. This implies that the optimal separating hyperplane is x⁰µ = 0, such that ˆy = +1 if x⁰µ > 0, resulting in a generaliza-tion error rate of Φ(−kµk₂/σ), with Φ the cumulative distribution function of a standard normal. In our example, with σ = 1 and kµk₂ = 1, we find an optimal generalization error rate of 0.159.

During each simulation run, we standardize the variables to improve the nu-merical performance of the SVM algorithm. The variables are ranked using either the Fisher score or based on the variable influence on w, as described in Section 4.2.3. For each of the nested models obtained in the variable ranking step, we compute (i) SVMICa and (ii) SVMICb as in (4.8) and (4.9). We compare their performance to (iii) ten-fold CV, (iv) Vapnik’s GRM as in (4.4), (v) KRIC for the logistic Bayesian model for SVMs as in (4.5), and (vi) KRIC for the Sol-lich model for SVMs as in (4.6). An important remark is that for ten-fold CV, we employ the CV2 method, which includes the feature selection procedure in each cross-validation step, as suggested by Zhang et al. (2006). Computing the CV error rate in the usual way can lead to a (severely) biased estimate of the generalization error, and using CV2 reduces this bias.

72 4.4. Simulation results

The experiment is repeated with two different kernels (i) a linear kernel K(x₁, x₂) = x⁰₁x₂leading to a linear decision rule (ii) a quadratic kernel K(x₁, x₂) = (γx⁰₁x₂+ 1)², with γ = 1/p, the inverse of the number of variables, leading to a quadratic decision rule. The tuning parameter C in each SVM that we train is chosen to be C = 1, as we standardize the explicative variables a priori. This is also the standard setting for C for the svm procedure in the R software package.

We experimented with other values of C in the range from 0.1 up to 10, and found only minor differences in the simulation outcomes. We test the accuracy of the classifiers computed from the selected input variables by estimating their gener-alization (out-of-sample) error rate from a test sample of 10000 new observations.

These observations are generated in the same way as the training sample.

Table 4.1 reports the generalization error rates, obtained by averaging over the 100 simulation runs. An overall observation is that the error-rate based selection criteria (CV and GRM) have the worst performance. The performances of the KRICs and the new SVMICs are comparable. More precisely, we observe that the KRICs are better as a variable selection method for small sample sizes (n = 25), while the SVMICs give better results for larger sample sizes. This is especially apparent when the quadratic kernel is used. For a small number of observations compared to the number of variables, we also note that SVMICa slightly outperforms SVMICb in terms of generalization error rate, and that the opposite is true with many observations and fewer variables. The differences in generalization error rates become smaller as the number of variables grows. This is particulary true for CV, whose relative performance becomes better at large sample sizes. But SVMICa and SVMICb are still somewhat ahead, and have the advantage that they are much easier (and less time-intensive) to compute than the other criteria, included the KRICs having a computation time of order O(n³).

Note that, as n grows, the generalization error rates of the models obtained by our two suggested criteria are converging towards the theoretically obtained minimal generalization error rate of 15.9%. Investigating which variable ranking criterion is better, results in case of linear kernels to a strong preference for ranking with the Fisher score. For the quadratic kernel, it is slightly better to rank the variables based on variable influence on kwk².

Chapter 4. A new information criterion for SVM 73

Linear kernel

n p SVMICa SVMICb CV GRM KRIC KRICS

25 25 32.2 29.4 32.6 31.6 33.5 31.8 36.2 34.5 31.3 29.0 31.5 29.9 50 34.6 31.6 35.3 32.6 35.3 33.5 37.4 35.4 34.4 33.2 34.4 33.2 100 37.4 33.9 37.3 35.0 37.8 34.4 38.6 35.7 37.0 34.9 37.1 34.9 50 25 24.4 21.6 24.6 23.2 27.1 25.5 31.1 29.6 25.7 24.9 26.0 25.9 50 28.5 23.3 27.7 24.8 29.5 26.3 31.4 30.5 29.8 28.7 30.2 29.7 100 30.9 24.6 29.1 25.0 31.0 28.0 32.1 30.9 31.0 30.1 31.3 30.8 100 25 19.9 18.5 19.6 18.9 24.6 23.8 30.1 30.1 21.8 20.6 22.3 21.7 50 22.9 19.2 20.2 19.0 25.8 25.4 29.9 29.6 26.9 26.8 27.3 27.8 200 25 17.8 17.0 16.9 16.8 22.7 21.5 28.9 29.3 18.7 18.0 19.2 18.9 Quadratic kernel

n p SVMICa SVMICb CV GRM KRIC KRICS

25 25 31.3 30.7 34.2 33.8 33.8 32.9 37.7 36.6 29.5 28.4 30.2 30.1 50 35.8 35.3 39.3 38.5 39.6 38.5 43.6 42.6 33.3 33.0 33.9 34.1 100 43.3 43.3 48.3 48.4 42.8 42.7 49.2 48.7 37.1 37.1 37.7 38.2 50 25 22.7 21.3 25.0 24.3 26.7 25.9 31.8 31.7 23.6 22.5 24.8 25.1 50 24.4 23.0 26.8 26.8 29.8 28.1 33.9 33.5 27.6 27.1 29.1 29.3 100 26.4 25.6 30.8 30.2 34.1 33.8 40.3 40.1 31.1 30.9 32.5 32.8 100 25 19.4 18.5 19.9 19.1 23.8 19.2 30.6 30.2 20.0 20.0 21.7 22.0 50 19.7 18.5 19.8 19.5 24.2 22.0 30.5 30.7 22.6 22.6 24.7 25.1 200 25 20.1 20.3 17.1 16.8 22.4 21.4 29.4 29.6 18.3 18.1 20.3 20.6

Table 4.1: Simulated average generalization error rate (%) for the six methods using two different kernels. For each method, the number on the left resulted from ranking by variable influence on kwk², and the number on the right in each column is from ranking by the Fisher scores S_j.

Figure 4.2 presents the values of the 100 simulated generalization errors as boxplots, giving insight in the variability of the variable selection methods. For most of the cases it turns out that cross-validation is highly variable, while GRM has a small variability. This good property of GRM is, however, accompanied by a much higher average generalization error rate. Comparing the different information criteria shows that SVMICa is quite comparable to the KRICs. The

74 4.4. Simulation results

SVMICa SVMICb CV GRM KRIC KRICS

0.150.200.250.30

(a)

SVMICa SVMICb CV GRM KRIC KRICS

0.150.200.250.300.35

(b)

SVMICa SVMICb CV GRM KRIC KRICS

0.150.200.250.300.35

(c)

SVMICa SVMICb CV GRM KRIC KRICS

0.200.250.300.350.400.450.50

(d)

Figure 4.2: Generalization error rates for 100 simulation experiments, for n = 100, p = 25 (a) linear kernel, ranking with kwk², (b) linear kernel, ranking with Fisher score, (c) quadratic kernel, ranking with kwk², and for (d) n = 25, 100 variables, linear kernel and ranking with kwk².

SVMICb has a larger variability. In the setting with small sample size (n = 25) and relatively large number of variables (100), all methods, except for GRM, are

Chapter 4. A new information criterion for SVM 75

comparable with respect to variability, but GRM has again the largest median error rate. Our main conclusion from this analysis is that SVMICa has a similar variability than the KRIC criteria, but SVMICb has a larger variability. Recall that the average error rates, as reported in Table 1, were of similar magnitude for all the four information criteria. Hence, when needing to choosing between the two newly proposed information criteria, we have a preference for SVMICa.

Given the variability of the generalization errors over the 100 simulation runs, see the boxplots in Figure 4.2, it is important to test whether the averages re-ported in Table 1 are also significantly different from each other. We performed standard t-tests, and most difference are indeed significant. For example, for the settings presented in Figure 1, we obtained that, at the 1% level, (a) all differ-ences are significant, except between SVMICb and the 2 KRiCs (b) all differdiffer-ences are significant, except between SVMICa and the 2 KRICs (c) all differences are significant, except between SVMICb and the 2 KRICs (d) the differences with the GRM method are significant, the others not.

Furthermore, we investigate which models are actually chosen by the different criteria. This information is reported in Table 4.2. For each setting, it shows how many times the correct subset of input variables, containing only the first four input variables, was chosen (C, correct). This table also shows how many times a too-sparse group of variables was selected (U, underfitting), and how many times a too-rich group of variables was chosen (O, overfitting). So an overfit means that all correct variables are selected, but in addition some superfluous ones, while an underfit selects a subset of the important variables, but no irrelevant variables are included. The good performance of SVMICa and SVMICb might be due to the fact that these criteria seem to have the tendency to select a set of variables which includes all significant ones as the number of observations grows.

The simulation results indicate that SVMICa behaves like AIC with its tendency to overfit. The SVMICb seems to share the property of BIC that it selects the correct model more often, if at least this true model is one of the possibilities to select from. The cross-validated error rate, and the general risk minimisation in particular, seem to have the tendency to ignore variables which nevertheless are important. As a consequence, the models that these criteria select are of

76 4.4. Simulation results

Kernel: Linear Quadratic

Models selected: C U O R C U O R

n = 25; p = 25 SVMICa 1 22 1 76 3 36 0 61

SVMICb 0 42 0 58 0 64 0 36

CV 0 38 4 58 1 40 5 54

GRM 0 77 0 23 0 75 0 25

KRIC 1 1 7 91 0 1 25 74

KRICS 0 0 9 91 0 0 49 51

n = 200; p = 25 SVMICa 22 0 76 2 2 0 98 0

SVMICb 77 9 10 4 67 14 6 13

CV 7 48 43 2 4 43 49 4

GRM 1 98 1 0 1 99 0 0

KRIC 6 0 93 1 8 0 84 8

KRICS 1 0 99 0 0 0 100 0

n = 25; p = 100 SVMICa 0 8 0 92 0 35 0 65

SVMICb 0 20 0 80 0 63 0 37

CV 0 23 6 71 0 33 10 57

GRM 0 56 0 44 0 64 0 36

KRIC 0 1 0 99 0 0 41 59

KRICS 0 0 1 99 0 0 56 44

Table 4.2: Simulated frequencies of selected models, with variable ranking done by influence on kwk². Here ‘C’ denotes correct selection, ‘U’ is underfitting, ‘O’

is overfitting, and ‘R’ for all other situations.

poor predictive quality. The two KRICs of Kobayashi and Komaki (2006) share the overselection property exhibited by SVMICa, but the KRICs select excessive variables even more frequently than SVMICa. This can explain why these criteria perform somewhat worse when the number of observations is large, and why they outperform the proposed SVMICs when the number of observations is small, since the latter tend to underfit the model in the case of few observations.

This concludes the results for the case of two populations coming from an identical distribution, differing only in mean. Another case that we examined is where the variances of the two populations differ from each other. We performed

Chapter 4. A new information criterion for SVM 77

Linear kernel

n p SVMICa SVMICb CV GRM KRIC KRICS

25 25 28.9 28.0 30.1 29.2 30.4 28.4 32.7 31.6 29.0 27.5 28.8 27.7 50 33.3 30.2 34.2 31.3 35.1 31.4 35.3 33.1 32.7 30.7 32.5 30.5 100 35.6 31.5 35.7 32.3 36.0 32.6 36.9 33.7 34.8 32.6 34.8 33.0 200 36.5 33.2 36.4 34.4 36.4 34.2 36.6 35.6 36.4 33.5 36.1 33.7 50 25 23.3 20.5 23.9 21.9 26.1 24.9 28.9 28.6 24.2 23.6 24.6 24.3 50 27.1 21.7 25.7 22.7 27.7 25.2 29.1 28.4 27.7 26.8 27.6 27.1 100 28.3 23.1 27.4 23.7 28.7 25.2 29.9 28.7 28.4 26.7 28.4 27.5 100 25 19.0 17.4 18.1 17.4 22.7 21.5 27.6 27.6 20.5 20.0 21.0 20.9 50 21.8 17.8 19.3 18.0 23.5 22.7 26.9 27.0 24.8 25.0 25.0 25.5 200 25 17.0 16.1 15.9 15.6 21.4 20.7 27.0 27.0 17.9 17.0 18.3 17.8 Quadratic kernel

n p SVMICa SVMICb CV GRM KRIC KRICS

25 25 29.2 28.9 31.8 31.8 31.8 28.7 35.4 34.7 25.7 24.9 25.8 26.2 50 35.1 35.8 39.6 40.0 38.1 37.6 42.8 42.4 30.5 30.8 31.3 32.3 100 42.1 41.7 48.2 48.1 42.2 42.3 49.4 48.7 35.0 36.0 36.2 38.1 200 50.1 50.1 50.1 50.1 44.7 44.4 50.1 50.1 38.9 40.0 40.4 41.8 50 25 20.5 19.3 23.5 22.2 25.9 24.5 30.6 30.2 19.0 19.1 19.5 19.9 50 23.1 22.2 26.1 26.2 28.3 27.6 33.2 32.7 23.8 23.9 25.1 26.1 100 26.5 25.8 30.4 30.4 34.5 33.7 40.5 40.4 28.2 28.8 30.1 32.3 100 25 14.6 15.2 18.5 16.4 20.8 19.9 27.8 27.1 14.2 14.5 14.5 14.9 50 17.9 17.0 18.4 17.8 22.0 21.5 27.7 28.3 18.1 18.5 19.5 20.3 200 25 9.9 9.8 12.9 13.2 19.6 17.6 29.3 26.8 10.1 10.3 9.7 9.8

Table 4.3: As Table 1, but now for two populations with different variances

a simulation study, in a similar way as the previous one, where the samples have been drawn from N (µ, I_p) for class +1, and from N (−2µ, 4I_p) for class −1.

The results of this simulation are summarized in Tables 4.3 and Table 4.4. We observe similar results as in the case where both populations had equal variance.

Selection based on CV error rate and on GRM still perform rather poor. As before, the performances of the KRICs and SVMICs are similar. More precisely, the SVMICs have an improved performance with respect to the KRICs when the sample size is large (n ≥ 50) and the linear kernel is used, and the KRICs

78 4.4. Simulation results

Kernel: Linear Quadratic

Models selected: C U O R C U O R

n = 25; p = 25 SVMICa 0 22 1 77 1 36 0 63

SVMICb 0 47 0 53 1 57 0 42

CV 1 40 1 58 1 39 8 52

GRM 0 76 0 24 0 70 0 30

KRIC 0 0 6 94 0 0 25 75

KRICS 0 0 8 92 0 0 50 50

n = 200; p = 25 SVMICa 11 0 85 4 0 20 0 80

SVMICb 69 10 16 5 0 45 0 55

CV 6 56 37 1 0 33 4 63

GRM 0 100 0 0 0 56 0 44

KRIC 5 0 93 2 0 0 40 60

KRICS 0 0 99 1 0 0 53 47

n = 25; p = 200 SVMICa 0 1 0 99 0 52 0 48

SVMICb 0 8 0 92 0 54 0 46

CV 0 22 2 76 0 22 5 73

GRM 0 46 0 54 0 54 0 46

KRIC 0 1 0 99 0 0 46 54

KRICS 0 0 0 100 0 0 56 44

Table 4.4: As Table 2, but now for two populations with different variances work slightly better for small sample sizes (n = 25). For the quadratic kernel, we notice a good performance of the KRICs, which is only matched by SVMICa for larger sample sizes. From Table 4.4 we can again make the same observations as before when the linear kernel is used. For the quadratic kernel the SVMICs have more difficulty selecting all the relevant variables than the KRICs, which explains why the latter criteria have an improved performance here.

We also conducted a simulation experiment where the input variables were strongly correlated. First, the observations were generated as in the first simula-tion experiment. Then, we applied the transformasimula-tion

x_ij = ρx_ik_j+ ²_ij with ²_ij ∼ N (0, ρ²) i.i.d.

where i = 1, . . . , n, k_j is chosen arbitrarily between 1 and 4, and 4 < j ≤ p/2,

Chapter 4. A new information criterion for SVM 79

such that about half of the unimportant input variables are correlated with the four important ones. The parameter |ρ| < 1 controls the degree of correlation.

We have chosen ρ = 0.8 and found similar results (not reported) as for the case where the variances of both class-population differ.

In document Predictive modelling: variable selection and classification efficiencies.. (Page 83-95)