2.6 Conclusion
2.7.4 Support vector machines
Support Vector Machines (SVM) are a family of supervised machine learning techniques. They were originally introduced by Vapnik and other co-authors (Boser et al. [79]) and several extensions were successively proposed. When used for a two-class classification problem where the set of binary labeled training patterns is linearly separable, the SVM separates both classes with a
hyperplane that is maximally distant from them (‘the maximal margin hyper- plane’). If linear separation is not possible, the feature space is enlarged using basis expansions such as polynomials or splines. However, explicit specification of this transformation is not necessary, but a kernel function that computes inner products in the transformed space is required. We have fitted the SVM models with the svm function available in the library e1071 of the R system (Dimitriadou et al. [80]), which offers an interface to the award-winningC++implementation, LIBSVM, by Chan & Lin. The data set is described by n training vectorsxi, yi, i= 1,2, ..., nwhere the p-dimensional vectors xi contain the predictor features and the n labelsyi{−1,1} identify the class of each vector. Among the several variants of SVM existing in the R library e1071, following Meyer (2004) we have used C-classification with the Radial Basis Kernel because of the small number of parameters to be tuned (only two). The primal quadratic programming problem to be solved is:
minω,b,ξ 12ωtω+Cni=1ξ i= 1,2, ..., n
C >0 is a parameter controlling the trade-off between margin and error, and
n
i=1ξ is an upper bound on the sum of distances of the wrongly classified cases to their correct plane. The dual problem is
minα 12αtQα−etα 0≤α ≤C ytα = 0
where e is the n-vector of all ones, and Q is a positive semi-definite matrix defined by Qij =yiyjK(xi, xj), i, j = 1,2, ..., n, being K(xi, xj) =φ(xi)φ(xj) the kernel function. No explicit construction of the nonlinear mapping φ(x) is needed, what is called the kernel trick. A vector x is classified by the decision function
sign(ni=1yiαiK(xi, x) +b)
depending on the margins mi = ni=1yiαiK(xi, x) +b, i = 1,2, ..., n. The greater the absolute value of the margin, the more reliable is the computed classification. The exploratory analysis of the margins could help to enlighten the SVM model (Furey et al. [81]). The Radial Basis Function (RBF) was our choice for K:
K(u, v) = exp(−γu−v2)
Note that the solution to the quadratic programming problem is global, avoiding the non optimality of the neural network training algorithms. So, two parameters must be tuned: C and γ. An estimation of the probability of both classes can be computed with the following expressions:
P[Class labeled“0”] = 1+1emi; P[Class labeled“1”] = 1+emiemi
Currently we used default value for the parameter C and γ of the svm function in the R library e1071, defined as 1/p, with p being the number of predictors, while C = 1.
Measurement and Visualization
of Data Complexity for
Classification Problems with
Imbalanced Data
We introduce a complexity measure for classification problems that takes ac- count of deterioration in classifier performance due to class imbalance. The measure is based on k-nearest neighbors. We explore the choices of k and the distance metric through a simulation study, and illustrate the use of our measure, and related data visualization techniques, with real data sets from the literature.
3.1
Introduction
Theoretically [1] and empirically, there is general understanding that different types of data require different kinds of classification. Not every imbalanced data set can be a problem for learning [2, 3]. It seems that the problem is rather with the small class in the presence of other factors such as class overlap [4]. The experiment conducted by Jo and Japkowicz [2] also suggested that degradation in class learning is not directly related to class imbalance
alone, but rather to small disjunct (small subsets of isolated examples). Given the suggestion that the imbalanced class distribution may not be the only problem, we see the need for a structured way of investigating and ex- plaining what intrinsic features of the data are affecting the degraded learning performance of an imbalanced data set. We think that the answer could be found through data complexity analysis or exploratory data analysis (EDA), and that data complexity measures can be used to devise a structured study for learning about class imbalance problems.
The classification error of an optimum Bayes classifier is an important parameter for pattern classification (the science of making inferences from perceptual data) and feature selection [5]. This parameter is recognized as a benchmark for other classification techniques and also as a “goodness” crite- rion for feature selection used in the classification. The Bayes error provides the lower error bound that can be achieved by any pattern classifier [6, 7]. This error rate will be greater than zero whenever class distributions over- lap. When all the class priors and conditional likelihoods are completely known, in theory the Bayes rate can be obtainable [7], but only in simple situations and for Gaussian distributions can it be calculated directly. To measure the divergence between two Gaussian distributions the statistical literature suggests: Mahalanobis distance, Euclidean distance (most popu- lar), J-Coefficient, Kullback-Leibler divergence, χ2 divergence, Minkowsi L2 distance, Hellinger coefficient and Bhattacharyya distance. For more details see H.-H. Bock [8], Chapter: Dissimilarity Measures for Probability Distribu- tions. However when the pattern distributions are unknown, the Bayes error cannot be readily computed. Thus, one cannot know how much classification error is due to class density overlap and how much is due to limitations of the training data (such as class imbalance) or deficiencies in the classifier. It is important to not only design a good classifier but to have a limit or bound on the achievable classification rate for a given data set [9]. Such an estimate will help researchers to decide whether to improve the current classifier, to use another classifier on the same data set or to acquire more data.
In the literature data complexity measures have been reviewed by Ho [10], who composed different complexity methods and was able to uncover
different data structures in real data sets, observing a relationship between the overall classification performance and data complexity. Details of these measures are given below. Another work by Weng and Poon [11] also looked at data complexity and tried to analyze classifier performance using the data complexity measurements of Ho [10]. They looked at just two data sets so did not do any correlation analysis to make the relation more explicit.
In this chapter we utilize data complexity to investigate the relationship between the degradation of classifier performance and certain features of the training data, specifically class imbalance and class overlap. Unlike alterna- tive performance measures such as G-Mean and Sensitivity, our complexity is independent of the choice of the classifier. Data complexity in our context is the quantification of the degree of difficulty in class learning from a given data set. In this study we investigate class imbalance with theoretical known Bayes error. Our approach is also motivated geometrically in that we are looking at the location of cases as data points in a suitable space and how they are distributed within each class. This approach leads naturally to the consideration of data visualization as an additional tool for examining data complexity. We propose a new complexity measure designed to be sensitive to class imbalance, and investigate its properties and uses.
In the next section we review existing complexity measures. We then introduce our proposed measure based onk-Nearest Neighbors for estimation of Bayes error, and discuss visualization methods. Section 3.4 compares the properties of several complexity measures via a simulation study. In section 3.5 we illustrate our methodology using real data sets from the literature. We conclude with a discussion and suggestions for further research.