Under-Sampling Techniques - Re-sampling Techniques

2.3 Dealing with Class Imbalance Problems

2.3.1 Re-sampling Techniques

2.3.1.1 Under-Sampling Techniques

The most simple method of under-sampling is Random Under-sampling [10]. Random Under-Sampling (RUS) is a non-heuristic method aimed to balance the data set by eliminating examples of the majority class. One of the dis- advantages of RUS is that it can throw away potentially useful information, that could be useful for classiﬁer.

There are many empirical under-sampling methods proposed, which can be named as Neighborhood Cleansing Techniques. These are based on the noise model hypothesis, which considers the examples near to the decision boundary of the two classes as noise to be eliminated.

Condensed Nearest Neighbor RuleHart’s Condensed Nearest Neigh- bor Rule (CNN) [22] is used to ﬁnd a consistent subset of examples. It is based on the idea of a consistent subset of a sample set, which is a subset which can correctly classiﬁes all of the remaining examples in the training sets, when used as a store reference to nearest neighbor rule. A subsetX ⊆X

is consistent with X if using a 1-Nearest Neighbor classifier (1-NN, i.e., minimum distance between data points, and if a case has a minimum distance to an example from a different class, this case will be missclassified), X correctly classifies the examples in X. An algorithm to create a subset X from X as an under-sampling method is define as follows: first, randomly draw one majority class example and all examples from the minority class and put these examples in X . Afterwards, use a 1-NN classifier over the examples in X

to classify the examples in X. Every misclassified example from X is moved to X. The notion behind this technique is to find a subset of the training set, which can correctly classify all the remaining examples in the training set, whenX is used for a nearest neighbor (NN) rule. The idea behind this implementation of a consistent subset is to eliminate the examples from the majority class that are distant from the decision border, since these sorts of examples might be considered less relevant for learning. This method is only effective in data sets having less overlap; in case of high overlap between the class, no important reduction in the training set can be achievable.

Wilson’s Edited Nearest Neighbor Rule (ENN)[23] ENN removes any example from the data sets whose class label diﬀers from the class of at least two of its three nearest neighbors.

Neighborhood Cleaning Rule Neighborhood Cleaning Rule (NCL) [24] modiﬁes the ENN in order to increase the data cleaning. For a two- class problem the algorithm can be described in the following way: for each example X_i in the training set, its three nearest neighbors are found. If X_i

belongs to the majority class and the classiﬁcation given by its three nearest neighbors contradicts the original class of X_i, then X_i is removed. If X_i

belongs to the minority class and its three nearest neighbors misclassifyX_i, then the nearest neighbors that belong to the majority class are removed. This method is only eﬀective in data sets having less class overlap; in case of high overlap between the classes all majority class examples near to decision boundary will be eliminated and the training set will result in a poor model for the majority class.

Tomek links Tomek links[25] consider the examples near to the borderline to be more important. The method can be deﬁned as follows: given two examples X_i and X_j belonging to diﬀerent classes, and d(X_i, X_j) the distance betweenX_i and X_j, a pair (X_i, X_j) is called a Tomek link if there is not an exampleX_l, such thatd(X_i, X_l)< d(X_i, X_j) ord(X_j, X_l)< d(X_i, X_j). If two examples form a Tomek link, then either one of these examples is noise or both examples form the borderline. Tomek links can be used as an under-sampling method or as a data cleaning method. As an under-sampling method, all those examples which form Tomek link and belonging to the

majority class are eliminated, and as a data cleaning method, Tomek link examples from both classes are removed. This must be used with caution in highly imbalanced data sets in the presence of highly overlapped classes, as we may end up heavily reducing the majority class, hence the accuracy of majority class will be seriously aﬀected.

One-sided selection (OSS) OSS [26] is an under-sampling method re- sulting from the application of Tomek links followed by the application of CNN. Tomek links is used as an under-sampling method and removes noisy and borderline majority class examples. Borderline examples can be considered as unsafe, since a small amount of noise can make them fall on the wrong side of the decision border. CNN aims to remove examples from the majority class that are distant from the decision border. The remaining examples, i.e. majority class examples and all minority class examples are used for learning.

CNN + Tomek linksThis is one of the methods proposed by Batista et al. [18]. It is similar to the OSS, but the method to ﬁnd the consistent subset is applied before the Tomek links. Their objective is to verify its competi- tiveness with OSS. As ﬁnding Tomek links is computationally demanding, it would be computationally cheaper if it was performed on a reduced data set.

In document Complexity measurement for dealing with class imbalance problems in classification modelling : a thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy, Massey University, 2012 (Page 47-49)