Support vector machines - Complexity measurement for dealing with class imbalance problems in c

2.6 Conclusion

2.7.4 Support vector machines

Support Vector Machines (SVM) are a family of supervised machine learning techniques. They were originally introduced by Vapnik and other co-authors (Boser et al. [79]) and several extensions were successively proposed. When used for a two-class classiﬁcation problem where the set of binary labeled training patterns is linearly separable, the SVM separates both classes with a

hyperplane that is maximally distant from them (‘the maximal margin hyperplane’). If linear separation is not possible, the feature space is enlarged using basis expansions such as polynomials or splines. However, explicit specification of this transformation is not necessary, but a kernel function that computes inner products in the transformed space is required. We have fitted the SVM models with the svm function available in the library e1071 of the R system (Dimitriadou et al. [80]), which offers an interface to the award-winningC₊₊implementation, LIBSVM, by Chan & Lin. The data set is described by n training vectorsx_i, y_i, i= 1,2, ..., nwhere the p-dimensional vectors x_i contain the predictor features and the n labelsy_i{−1,1} identify the class of each vector. Among the several variants of SVM existing in the R library e1071, following Meyer (2004) we have used C-classification with the Radial Basis Kernel because of the small number of parameters to be tuned (only two). The primal quadratic programming problem to be solved is:

min_ω,b,ξ 1₂ωtω+Cn_i₌₁ξ i= 1,2, ..., n

C >0 is a parameter controlling the trade-oﬀ between margin and error, and

i=1ξ is an upper bound on the sum of distances of the wrongly classiﬁed cases to their correct plane. The dual problem is

min_α 1₂αtQα−etα 0≤α ≤C ytα = 0

where e is the n-vector of all ones, and Q is a positive semi-definite matrix defined by Q_ij =y_iy_jK(x_i, x_j), i, j = 1,2, ..., n, being K(x_i, x_j) =φ(x_i)φ(x_j) the kernel function. No explicit construction of the nonlinear mapping φ(x) is needed, what is called the kernel trick. A vector x is classified by the decision function

sign(n_i₌₁y_iα_iK(x_i, x) +b)

depending on the margins m_i = n_i₌₁y_iα_iK(x_i, x) +b, i = 1,2, ..., n. The greater the absolute value of the margin, the more reliable is the computed classiﬁcation. The exploratory analysis of the margins could help to enlighten the SVM model (Furey et al. [81]). The Radial Basis Function (RBF) was our choice for K:

K(u, v) = exp(−γu−v2)

Note that the solution to the quadratic programming problem is global, avoiding the non optimality of the neural network training algorithms. So, two parameters must be tuned: C and γ. An estimation of the probability of both classes can be computed with the following expressions:

P[Class labeled“0”] = ₁₊1_e_mi; P[Class labeled“1”] = ₁₊emi_e_mi

Currently we used default value for the parameter C and γ of the svm function in the R library e1071, deﬁned as 1/p, with p being the number of predictors, while C = 1.

Measurement and Visualization

of Data Complexity for

Classiﬁcation Problems with

Imbalanced Data

We introduce a complexity measure for classiﬁcation problems that takes ac- count of deterioration in classiﬁer performance due to class imbalance. The measure is based on k-nearest neighbors. We explore the choices of k and the distance metric through a simulation study, and illustrate the use of our measure, and related data visualization techniques, with real data sets from the literature.

3.1 Introduction

Theoretically [1] and empirically, there is general understanding that different types of data require different kinds of classification. Not every imbalanced data set can be a problem for learning [2, 3]. It seems that the problem is rather with the small class in the presence of other factors such as class overlap [4]. The experiment conducted by Jo and Japkowicz [2] also suggested that degradation in class learning is not directly related to class imbalance

alone, but rather to small disjunct (small subsets of isolated examples). Given the suggestion that the imbalanced class distribution may not be the only problem, we see the need for a structured way of investigating and ex- plaining what intrinsic features of the data are aﬀecting the degraded learning performance of an imbalanced data set. We think that the answer could be found through data complexity analysis or exploratory data analysis (EDA), and that data complexity measures can be used to devise a structured study for learning about class imbalance problems.

The classification error of an optimum Bayes classifier is an important parameter for pattern classification (the science of making inferences from perceptual data) and feature selection [5]. This parameter is recognized as a benchmark for other classification techniques and also as a “goodness” crite- rion for feature selection used in the classification. The Bayes error provides the lower error bound that can be achieved by any pattern classifier [6, 7]. This error rate will be greater than zero whenever class distributions overlap. When all the class priors and conditional likelihoods are completely known, in theory the Bayes rate can be obtainable [7], but only in simple situations and for Gaussian distributions can it be calculated directly. To measure the divergence between two Gaussian distributions the statistical literature suggests: Mahalanobis distance, Euclidean distance (most popu- lar), J-Coefficient, Kullback-Leibler divergence, χ2 divergence, Minkowsi L2 distance, Hellinger coefficient and Bhattacharyya distance. For more details see H.-H. Bock [8], Chapter: Dissimilarity Measures for Probability Distribu- tions. However when the pattern distributions are unknown, the Bayes error cannot be readily computed. Thus, one cannot know how much classification error is due to class density overlap and how much is due to limitations of the training data (such as class imbalance) or deficiencies in the classifier. It is important to not only design a good classifier but to have a limit or bound on the achievable classification rate for a given data set [9]. Such an estimate will help researchers to decide whether to improve the current classifier, to use another classifier on the same data set or to acquire more data.

In the literature data complexity measures have been reviewed by Ho [10], who composed diﬀerent complexity methods and was able to uncover

different data structures in real data sets, observing a relationship between the overall classification performance and data complexity. Details of these measures are given below. Another work by Weng and Poon [11] also looked at data complexity and tried to analyze classifier performance using the data complexity measurements of Ho [10]. They looked at just two data sets so did not do any correlation analysis to make the relation more explicit.

In this chapter we utilize data complexity to investigate the relationship between the degradation of classifier performance and certain features of the training data, specifically class imbalance and class overlap. Unlike alterna- tive performance measures such as G-Mean and Sensitivity, our complexity is independent of the choice of the classifier. Data complexity in our context is the quantification of the degree of difficulty in class learning from a given data set. In this study we investigate class imbalance with theoretical known Bayes error. Our approach is also motivated geometrically in that we are looking at the location of cases as data points in a suitable space and how they are distributed within each class. This approach leads naturally to the consideration of data visualization as an additional tool for examining data complexity. We propose a new complexity measure designed to be sensitive to class imbalance, and investigate its properties and uses.

In the next section we review existing complexity measures. We then introduce our proposed measure based onk-Nearest Neighbors for estimation of Bayes error, and discuss visualization methods. Section 3.4 compares the properties of several complexity measures via a simulation study. In section 3.5 we illustrate our methodology using real data sets from the literature. We conclude with a discussion and suggestions for further research.

In document Complexity measurement for dealing with class imbalance problems in classification modelling : a thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy, Massey University, 2012 (Page 79-84)