2.3 Dealing with Class Imbalance Problems
2.3.2 Cost-Sensitive Learning
Class imbalance and cost-sensitive learning are related to each other. The differences between different misclassification errors can be quite large. Cost sensitive learning solutions incorporating both the data and algorithmic level approaches assume higher misclassification costs for samples in the minority class and seek to minimize these high cost errors. A cost-sensitive learn- ing system can be used in applications where the misclassification costs are known. Cost-sensitive learning systems attempt to reduce the cost of misclas- sified examples, instead of classification error. These methods allow for the fact that the value of correctly identifying the positive (rare) class outweighs the value of correctly identifying the common class. For two-class problems this is done by associating a greater cost with false negatives than with false positives. This strategy is appropriate for most medical diagnosis tasks be- cause a false positive typically leads to more comprehensive (i.e., expensive) testing procedures that will ultimately discover the error, whereas a false negative may cause a life-threatening condition to go undiagnosed, which
could lead to death. Assigning a greater cost to false negatives than to false positives will improve performance with respect to the positive (rare) class.
If for example this misclassification cost ratio is 3:1, then a space that has ten majority class examples and four minority class examples will nonethe- less be labeled as the minority class. Thus non-uniform costs can bias the classifier to perform well on the minority class, where in this case the bias is desirable. One problem with this approach is that specific cost information is rarely available. This is partially due to the fact that these costs often de- pend on multiple considerations that are not easily compared. For example, in the medical diagnosis task the considerations involve the probability that an undiagnosed condition will lead to death, the “cost” of a false positive on a patient’s well being, etc. Thus, without specific cost information, it may be more practical to only predict the rare class and generate an ordered list of the best minority predicting rules. Then one can decide where to place the threshold after data mining is complete. In mathematical notation, let the (i, j) entry in a cost matrix C be the cost of predicting class i when the true class is j. If i = j then the prediction is correct so cost of correct classification is zero, i.e., C(i, i) = 0, while ifi=j the prediction is incorrect. The optimal prediction for an examplex is the class i that minimizes.
L(x, i) = ΣjP(j|x)C(i, j) (2.1) For each i, L(x, i) is a sum over the alternative possibilities for the true class ofx. In this framework, the role of a learning algorithm is to produce a classifier that for any example x can estimate the probability P(j|x) of each class j being the true class ofx. For example making the prediction imeans acting as if i is the true class of x. It is noteworthy that the outputs of the research on class-dependent cost-sensitive learning have been good solutions to learning from imbalanced data sets [7, 14]. Cost-Sensitive learning has been suggested as a good solution to these class-imbalance tasks, yet it is not clear how the class-imbalance affects the cost sensitive classifier.
Liu and Zhou [35] gave an empirical study for the influence of class im- balance on Cost-Sensitive Learning. From their experiment they conclude
that class imbalance often affects the performance of cost sensitive classifiers. Cost-sensitive classifiers generally favour the original class distribution when misclassification costs differ slightly, while a balanced class distribution is more favorable when costs differ seriously.
MetaCost [36] is another method to make the classifier cost-sensitive. This classifier is equivalent to passing the base learner to Bagging (see sec- tion 2.4), which is in turn passed to a cost-sensitive classifier operating on minimum expected cost i.e., Equation 2.1. The difference is that MetaCost produces a single cost-sensitive classifier of the base learner. It is a two phase procedure, where in the first stage, an internal cost sensitive model is learned using a base cost sensitive learning algorithm. In the second stage the MetaCost procedure estimates class probabilities using bagging, relabels the training examples with minimum expected cost classes, and finally rebuilds the model using the modified training set.
AdaBoost (see section 2.3.4) has been made cost sensitive [37], so that examples belonging to the minority class that are misclassified are assigned higher weights than those belonging to the majority class. The resulting system, Adacost, has been shown empirically to produce lower cumulative misclassification costs than Adaboost.
2.3.2.1 Tuning parameter for Classifiers
Drummond and Holte [38] reported that using classifier C4.5 [39] at its de- fault setting, over-sampling is ineffective, often producing little or no change using modified cost sensitive technique or class distribution. Moreover they noted that over-sampling prunes less, therefore generalizes less, than under- sampling, and modification of C4.5’s parameters to increase the influence of pruning and avoidance of other factor such as over-fitting, can improve the performance of over-sampling. Similarly Japkowicz and Stephen [40] argued that, for severely highly imbalanced data sets, unpruned C4.5 models are better than the pruned versions. Wu and Chang [41] proposed an algorithm for Support Vector Machine (SVM) by changing the kernel function to im- prove the accuracy of the minority class. Veropoulos et al. [42] proposed
that using a different penalty constant for different classes will improve the accuracy of the minority class. On the basis of these studies we can argue that tuning the classifier’s parameter (detail is given in Appendix 2.7) can help in building better models for imbalanced data sets.
Kaizhu Hang et al. [43] presented Biased Minimax Probability Machine (BMPM) to solve imbalance problem. With reliable mean vectors and co- variance matrices for the minority and majority class, BMPM can derive the decision hyper-plane by adjusting the lower bound of the real accuracy of the testing set.
2.3.2.2 One class learning
One class learning is recognition based learning, where a model can be cre- ated with examples from the target class only. This learning provides an alternative to discriminant analysis. One-class SVM [44] is an example of one-class learning, for which Manevitz and Yousef [44] found one-class SVM to be competitive with two-class learning. However they believe that the results from a classifier trained using only positive class examples will not be as good as the result using positive and negative class examples.
Kubat et al. [26] introduced a technique “SHRINK” to cope with the problem of imbalance. This technique is another example of one class learn- ing. This system labels a mixed space (where both minority and majority class examples are found) as the positive (minority class) regardless whether the positive examples prevail in the region or not, which changes the learner’s focus: Then it searches the best positive space, i.e., the one with the best positive to negative ratio.
Ripper [45] is a rule induction method that uses a divide-and-conquer approach to build rules on the training set iteratively. Each rule is grown by adding new conditions until no majority class examples are found. It normally generates rules for each class, so it can be viewed as one class learning.
Raskutti and Kowalczyk [46] compare one-class SVM and two class SVM, suggesting that one class learning is useful in the presence of extremely un-
balanced data sets composed of high-dimensional feature space. They argue that one-class learning is related to the feature selection method (see below), but is more practical as feature selection is too expensive to apply.