Investigation - Breaking the SVM - Application of Statistical Computing to Statistical Learning

3.3 Breaking the SVM

3.3.4 Investigation

We shall use two randomly generated multivariate datasets; one for a training

set, and the other for a test set. Both sets are obtained using an equicovariance

matrix for Σ, with ρ = 0.8, and a proportional mean difference for µ as shown in

Appendix C. The dimensions of the test set are 1000 × 20, whereas the training set

consists of three different sets with dimensions: 30 × 20, 100 × 20 and 1000 × 20.

The reason for using different training set sizes is to help us study the behaviour

of each classifier in different scenarios depicted by different training sample sizes.

Thereafter, we train each classifier three times; each time a different training set

size is used. After a classifier is trained, it is tested on 100 different randomly

each class. This procedure helps to eliminate biased conclusions, because different

aspects of the data are accounted for. The error rate of a classifier for each training

sample size, is the average error rate over 100 different test sets. In Table 3.1, we

present the average error rates, compared to the SVM.

Sample Sample size IFDA EFDA RFDA FDA SVM

1 n = 30 0.2528 0.013 0.0483 0.0482 0.0289

2 n = 100 0.0671 0.0105 0.0218 0.0213 0.0230 3 n = 1000 0.0154 0.0091 0.0093 0.0091 0.0161

Table 3.1: Average error rates for each classifier, given different sizes of the training sets. A linear kernel was considered for SVM, which in effect, is as good as using the datasets in their untransformed form. This way, we eliminate the introduction of bias in the error rates of SVM, in comparison with the error rates of the other classifiers.

Differences in error rates for each training sample size can be regarded as marginal,

except in the case of IFDA for the n = 30 sample size. Here, we observed a

marked difference in comparison with the error rates of other classifiers. By and

large, IFDA is the least well performing classifier when compared with the other

classifiers, and we link this development to absence of estimates of covariances

among pairs of explanatory variables.

Both FDA and RFDA are neck and neck in their error rates, and we attribute

this development to the fact that we used only numerically stable datasets in the

investigation. As a result, the value of regularization parameter λ required for

RFDA is as small as 0.0001, which in general is not enough to initiate major

changes in the structure of the covariance matrix. In other words, the effect on

the covariance matrix is similar to when λ = 0. This development further shows

that RFDA is superior to FDA only when datasets are numerically unstable.

CHAPTER 3. INVESTIGATIVE AND ANALYTICAL STUDY OF · · · 102

where we remarked that if data are normal with different means, and a common

covariance matrix, then FDA is ideal. In the case of n = 30, we find a situation

where the parameters are poorly estimated, unlike in the cases involving 100 and

1000 sample sizes. For this reason, it was not out of place that SVM performed

better when n = 30, unlike in the other two sample sizes where FDA performed

relatively better.

The error rates of EFDA is consistently lower irrespective of the size of the training

set. It shows that informed modification of FDA, under some assumptions, can

give rise to a classifier that consistently performs better than SVM. The problem

with EFDA is the inability to replicate the same effect given real world datasets.

For instance, we applied the classifier on three datasets, namely Appendicitis,

Australia and Coil2000 datasets described in Chapters 4 and 5, and the error rates

are 0.2188, 0.3140 and 0.3424 respectively. Also, the error rates of SVM on the

same datasets are 0.1875, 0.1400 and 0.0597 respectively. In comparison, EFDA

performed rather poorly. But in spite of this outcome on real world datasets,

the performances of EFDA on the simulated datasets show that the classifier can

consistently perform better than SVM, on any real world dataset that meets the

Regression Discriminant Analysis

(RDA)

4.1 Introduction

In Chapter 1, we identified regression and classification as valid tools for prediction,

and further provided a number of commonly shared characteristics that the two

prediction tools have in common. In particular, we mentioned that:

(a) There exist a matrix of input data, and vector of output required for training

and testing of both regression and classification models.

(b) We stated that since the dimensions of the input data in most cases is at least

n × 2, there can be concern about numerical stability of methods in both cases.

obtain a regression or classification function that optimally performs may be a

CHAPTER 4. REGRESSION DISCRIMINANT ANALYSIS (RDA) 104

challenging task.

(d) We added that the assessment of prediction tools is via prediction error, and the

method of calculating such error depends on the prediction tool in question. For

instance in regression we use the mean square error, whereas in classification

we often use the error rate.

(e) Finally, we remarked that a very important characteristic is the fact that equa-

tions describing both regression and classification functions, can be similarly

expressed (see (1.2) and (1.3)).

In the light of these, we claim that it is possible to use regression as a tool for

classification. We therefore propose a classification function based on multiple

regression, and claim that it is identical to FDA; hence we name this classifier

Regression Discriminant Analysis (RDA). We further claim (Section 4.3) that re-

gression variants, namely ridge regression and Lasso can be used as valid binary

classifiers.

In the section that follows, we shall provide a mathematical backing to the claim

that RDA is identical to FDA. Also, some data-based illustrations will be provided.

In document Application of Statistical Computing to Statistical Learning (Page 120-124)