2.3 Dealing with Class Imbalance Problems
2.3.1 Re-sampling Techniques
2.3.1.1 Under-Sampling Techniques
The most simple method of under-sampling is Random Under-sampling [10]. Random Under-Sampling (RUS) is a non-heuristic method aimed to balance the data set by eliminating examples of the majority class. One of the dis- advantages of RUS is that it can throw away potentially useful information, that could be useful for classifier.
There are many empirical under-sampling methods proposed, which can be named as Neighborhood Cleansing Techniques. These are based on the noise model hypothesis, which considers the examples near to the decision boundary of the two classes as noise to be eliminated.
Condensed Nearest Neighbor RuleHart’s Condensed Nearest Neigh- bor Rule (CNN) [22] is used to find a consistent subset of examples. It is based on the idea of a consistent subset of a sample set, which is a subset which can correctly classifies all of the remaining examples in the training sets, when used as a store reference to nearest neighbor rule. A subsetX ⊆X
is consistent with X if using a 1-Nearest Neighbor classifier (1-NN, i.e., mini- mum distance between data points, and if a case has a minimum distance to an example from a different class, this case will be missclassified), X correctly classifies the examples in X. An algorithm to create a subset X from X as an under-sampling method is define as follows: first, randomly draw one ma- jority class example and all examples from the minority class and put these examples in X . Afterwards, use a 1-NN classifier over the examples in X
to classify the examples in X. Every misclassified example from X is moved to X. The notion behind this technique is to find a subset of the training set, which can correctly classify all the remaining examples in the training set, whenX is used for a nearest neighbor (NN) rule. The idea behind this implementation of a consistent subset is to eliminate the examples from the majority class that are distant from the decision border, since these sorts of examples might be considered less relevant for learning. This method is only effective in data sets having less overlap; in case of high overlap between the class, no important reduction in the training set can be achievable.
Wilson’s Edited Nearest Neighbor Rule (ENN)[23] ENN removes any example from the data sets whose class label differs from the class of at least two of its three nearest neighbors.
Neighborhood Cleaning Rule Neighborhood Cleaning Rule (NCL) [24] modifies the ENN in order to increase the data cleaning. For a two- class problem the algorithm can be described in the following way: for each example Xi in the training set, its three nearest neighbors are found. If Xi
belongs to the majority class and the classification given by its three nearest neighbors contradicts the original class of Xi, then Xi is removed. If Xi
belongs to the minority class and its three nearest neighbors misclassifyXi, then the nearest neighbors that belong to the majority class are removed. This method is only effective in data sets having less class overlap; in case of high overlap between the classes all majority class examples near to decision boundary will be eliminated and the training set will result in a poor model for the majority class.
Tomek links Tomek links[25] consider the examples near to the border- line to be more important. The method can be defined as follows: given two examples Xi and Xj belonging to different classes, and d(Xi, Xj) the distance betweenXi and Xj, a pair (Xi, Xj) is called a Tomek link if there is not an exampleXl, such thatd(Xi, Xl)< d(Xi, Xj) ord(Xj, Xl)< d(Xi, Xj). If two examples form a Tomek link, then either one of these examples is noise or both examples form the borderline. Tomek links can be used as an under-sampling method or as a data cleaning method. As an under-sampling method, all those examples which form Tomek link and belonging to the
majority class are eliminated, and as a data cleaning method, Tomek link examples from both classes are removed. This must be used with caution in highly imbalanced data sets in the presence of highly overlapped classes, as we may end up heavily reducing the majority class, hence the accuracy of majority class will be seriously affected.
One-sided selection (OSS) OSS [26] is an under-sampling method re- sulting from the application of Tomek links followed by the application of CNN. Tomek links is used as an under-sampling method and removes noisy and borderline majority class examples. Borderline examples can be consid- ered as unsafe, since a small amount of noise can make them fall on the wrong side of the decision border. CNN aims to remove examples from the majority class that are distant from the decision border. The remaining examples, i.e. majority class examples and all minority class examples are used for learning.
CNN + Tomek linksThis is one of the methods proposed by Batista et al. [18]. It is similar to the OSS, but the method to find the consistent subset is applied before the Tomek links. Their objective is to verify its competi- tiveness with OSS. As finding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set.