3.3 Breaking the SVM
3.3.4 Investigation
We shall use two randomly generated multivariate datasets; one for a training
set, and the other for a test set. Both sets are obtained using an equicovariance
matrix for Σ, with ρ = 0.8, and a proportional mean difference for µ as shown in
Appendix C. The dimensions of the test set are 1000 × 20, whereas the training set
consists of three different sets with dimensions: 30 × 20, 100 × 20 and 1000 × 20.
The reason for using different training set sizes is to help us study the behaviour
of each classifier in different scenarios depicted by different training sample sizes.
Thereafter, we train each classifier three times; each time a different training set
size is used. After a classifier is trained, it is tested on 100 different randomly
each class. This procedure helps to eliminate biased conclusions, because different
aspects of the data are accounted for. The error rate of a classifier for each training
sample size, is the average error rate over 100 different test sets. In Table 3.1, we
present the average error rates, compared to the SVM.
Sample Sample size IFDA EFDA RFDA FDA SVM
1 n = 30 0.2528 0.013 0.0483 0.0482 0.0289
2 n = 100 0.0671 0.0105 0.0218 0.0213 0.0230 3 n = 1000 0.0154 0.0091 0.0093 0.0091 0.0161
Table 3.1: Average error rates for each classifier, given different sizes of the training sets. A linear kernel was considered for SVM, which in effect, is as good as using the datasets in their untransformed form. This way, we eliminate the introduction of bias in the error rates of SVM, in comparison with the error rates of the other classifiers.
Differences in error rates for each training sample size can be regarded as marginal,
except in the case of IFDA for the n = 30 sample size. Here, we observed a
marked difference in comparison with the error rates of other classifiers. By and
large, IFDA is the least well performing classifier when compared with the other
classifiers, and we link this development to absence of estimates of covariances
among pairs of explanatory variables.
Both FDA and RFDA are neck and neck in their error rates, and we attribute
this development to the fact that we used only numerically stable datasets in the
investigation. As a result, the value of regularization parameter λ required for
RFDA is as small as 0.0001, which in general is not enough to initiate major
changes in the structure of the covariance matrix. In other words, the effect on
the covariance matrix is similar to when λ = 0. This development further shows
that RFDA is superior to FDA only when datasets are numerically unstable.
CHAPTER 3. INVESTIGATIVE AND ANALYTICAL STUDY OF · · · 102
where we remarked that if data are normal with different means, and a common
covariance matrix, then FDA is ideal. In the case of n = 30, we find a situation
where the parameters are poorly estimated, unlike in the cases involving 100 and
1000 sample sizes. For this reason, it was not out of place that SVM performed
better when n = 30, unlike in the other two sample sizes where FDA performed
relatively better.
The error rates of EFDA is consistently lower irrespective of the size of the training
set. It shows that informed modification of FDA, under some assumptions, can
give rise to a classifier that consistently performs better than SVM. The problem
with EFDA is the inability to replicate the same effect given real world datasets.
For instance, we applied the classifier on three datasets, namely Appendicitis,
Australia and Coil2000 datasets described in Chapters 4 and 5, and the error rates
are 0.2188, 0.3140 and 0.3424 respectively. Also, the error rates of SVM on the
same datasets are 0.1875, 0.1400 and 0.0597 respectively. In comparison, EFDA
performed rather poorly. But in spite of this outcome on real world datasets,
the performances of EFDA on the simulated datasets show that the classifier can
consistently perform better than SVM, on any real world dataset that meets the
Regression Discriminant Analysis
(RDA)
4.1
Introduction
In Chapter 1, we identified regression and classification as valid tools for prediction,
and further provided a number of commonly shared characteristics that the two
prediction tools have in common. In particular, we mentioned that:
(a) There exist a matrix of input data, and vector of output required for training
and testing of both regression and classification models.
(b) We stated that since the dimensions of the input data in most cases is at least
n × 2, there can be concern about numerical stability of methods in both cases.
(c) Also, in instances where the input data is high dimensional, we noted that to
obtain a regression or classification function that optimally performs may be a
CHAPTER 4. REGRESSION DISCRIMINANT ANALYSIS (RDA) 104
challenging task.
(d) We added that the assessment of prediction tools is via prediction error, and the
method of calculating such error depends on the prediction tool in question. For
instance in regression we use the mean square error, whereas in classification
we often use the error rate.
(e) Finally, we remarked that a very important characteristic is the fact that equa-
tions describing both regression and classification functions, can be similarly
expressed (see (1.2) and (1.3)).
In the light of these, we claim that it is possible to use regression as a tool for
classification. We therefore propose a classification function based on multiple
regression, and claim that it is identical to FDA; hence we name this classifier
Regression Discriminant Analysis (RDA). We further claim (Section 4.3) that re-
gression variants, namely ridge regression and Lasso can be used as valid binary
classifiers.
In the section that follows, we shall provide a mathematical backing to the claim
that RDA is identical to FDA. Also, some data-based illustrations will be provided.