5.2 Over-sampling using Complexity Measure (OSCM)
5.2.3 SMOTE using complexity measure: (SCM)
In this chapter we propose an algorithm that combines Synthetic Minority Over-sampling Technique (SMOTE) and the complexity measure. We want to utilize SMOTE for improving the accuracy of the minority class and we want to utilize complexity measure to not sacrifice the accuracy over the entire data set and to be able to adapt to the different natures of data sets. The major goal is to better model the minority class in the data set, by providing the classifier not only minority class examples in the data set that were difficult to learn (values detected by complexity measure), but also broader representation of those observations. We want to improve the overall accuracy of the minority class by focusing on the difficult minority class examples, as we want to model this class better. Our goal is to improve accuracy for the minority class.
In the literature all the over-sampling schemes like simple random over- sampling, SMOTE [11], Borderline SMOTE [17] give equal weights to all minority class examples. We believe not every minority class example needs to be over-sampled equally, but proportional to its complexity (level of dif- ficulty). Han et al. [17] presented a modification of SMOTE technique [11] known as borderline-SMOTE (BSM). BSM selects minority class examples which are considered to be on the border of the minority decision region in the feature-space and only perform SMOTE to over-sample those instances, rather than over-sampling all or a random subset, but once again all the ob- servations which are considered as borderline are over sampled equally using SMOTE. Moreover if none of nearest neighbors of a minority class example
are from the same class, it will be regarded as noise and not over sampled. For more detail see [17]. In fact, if we have a severe imbalanced data set, we will see the majority of the minority class completely surrounded by the majority. If we ignore these examples for over sampling we are not adding much of the information about these examples in the classifier, which will result in a much lower predictive model for these examples and we have to heavily over-sample the rest of the examples to achieve a good predictive model for the minority class.
Since our aim is to reduce the bias against the minority class inherent in the learning procedure due to class imbalance, we need to increase the sam- pling weights for the minority class. By introducing the SMOTE procedure proportional to the complexity measurement, we are particularly interested in increasing the probability of selection for the difficult minority class cases that are dominated by the majority class points. It is a well known fact that SMOTE has two parameters; number of nearest neighbors (k) and amount of over-sampling (N), usually at the user’s discretion. In SMOTE and Border line SMOTE the nearest neighbors (k) is fixed as 5 and a number of iterations are used to decide the optimal level of over-sampling (N). In our approach, once the threshold for the complexity measure CM i.e. αis decided the rest of the procedure will be automatic, i.e., no decision about the number of nearest neighbors and over-sampling is required. Unlike the existing over-sampling methods our methods only over-sample or strengthen those examples which are very difficult to learn. The details of our procedure are outlined below.
First calculate the complexity of each observation for the minority class. For those observations whose nearest neighbor majority is from the other class, synthetic examples are generated in proportion to number of examples from the other class, and added to the original training set. As defined earlier, the whole training data is T S, the minority class in T S1 and the majority class is T S0 where
T S1 ={T S11, T S12, ..., T Sn1}, and T S0 ={T S01, T S02, ..., T Sn0}
Step 1 For every T S1i (i = 1,2, ..., n1) in the minority class T S1, we cal- culate its k nearest neighbors in the whole training data T S. The complexity of a minority class data point is measured as:
CM1i = I number of patterns ∈T S1i=1 in T S k ≤0.5 (5.7)
The number of majority class among thek nearest neighbor is denoted by ´n0.
Step 2 In this step we generate ´n0 minority class examples using SMOTE. For eachCM1i, we calculateknearest neighbors inT S1. Using SMOTE as define in Section 5.2.2.2 we generate T S1SMOT E =CM1i∗n´0.
Step 3 T S1new =T S1∪T S1SMOT E
Step 4 We repeat the above procedure for each CM1i.
It should be kept in mind that we already determine k i.e. number of nearest neighbors using equation 5.6. According to the definition of SMOTE, new synthetic data are generated along the line between the minority class examples and their nearest neighbor from the same class. In another varia- tion, we use simple over-sampling using our complexity measure of minority class, i.e. instead of using SMOTE for difficult minority class examples we use simple over-sampling in proportion to number of majority class examples.