2.2 Techniques for handling imbalance class distribution
2.2.6 Sampling based Methods
This is one of the techniques dedicated to handling imbalanced classed data set, and it is regarded thus because for the first time, the (IR) featured in the derivatives and influenced the overall results of the modelling. The main idea behind sampling- based techniques is to balance the classes, this method of handling imbalance data has become one of the most popular due to the ease of use, the process involves
changing the total number of class data item by either increasing the minority class [94][95] known as oversampling or reducing the majority class known as under- sampling.
Oversampling
The oversampling techniques was made popular by the pioneering work of [94] through a process called Synthetic Minority Over-Sampling Technique (SMOTE). It involves artificially generating data item to increase the minority class in the data set to the level where the imbalance ratio (IR); which is the ratio of the majority to the minority class are approximately equal. The (SMOTE)data is generated by the algorithm in 2.11.
xf =xi+<(0,1)(xj−xi) (2.11)
If data set of x(i...j), taking the k-nearest neighbours of sample X as xj, where
xf is the new generated data item, xi is an original data item and <(0,1) is a ran-
dom number within (0,1). Though, this (SMOTE)techniques apparently has many advantages, particularly solving the issues of class imbalance. But, it invariably introduced issues like misclassification cost [96], and some researchers have also en- countered the problems of overfitting which stem from creating a replica of the same dataset and inheriting intrinsic errors therein, hence the necessity of new approaches to solving the issues of class imbalance like having various modifications of oversam- pling have been proposed. The Borderline-SMOTE by [97] where data item at the borderline of K-nearest neighbour are over-sampled is one of such example; also there is random oversampling used by [98] that tend to choose the training data by random selection, this method though improved accuracy, but has led to delay in the execution and overfitting when dealing with large data set. A generative oversampling technique was used by [99], the process involves new data being cre- ated by learning from the training data. This method made it possible that the created data have the basic characteristics of the existing data thereby maintaining the data integrity, but accuracy improvement is limited since the characteristics of the training data is still maintained.
Adaptive Synthetic Sampling
This is another popular oversampling techniques is known by the acronym(ADASYN), is different from the(SMOTE)due to the way it over sample (generate) the minority data items. While (SMOTE)uses the K-Nearest neighbour of the minority class to
decide which data to produce, the(ADASYN) on the other hand uses the distribu- tions level of difficulties of minority classes ability to learn. This means that the minority data items that have the least ability to learn in the training data will be the one to over-sampled (generated).
Undersampling
An alternative technique called undersampling an opposite of oversampling, which is basically reducing the number of majority classed data items to balance the number of the classes in the dataset. This methods have also gained keen research interest in the academia, [100] presented two methods of under-sampling as random and informative; the random process is by choosing and eliminating data from existing class until the classes are balanced, while the informative under-sampling is by eliminating data observation class from the data set based on pre-selected criterion to achieve balance. A process known as active under-sampling by getting rid of the sample of the data items that are far away from the decision boundary was used by [101]. These sampling methods have a problem with performance with large dataset and could lead to removing important data items. Multiple resampling techniques were employed by [44] as it provides better tuning results with every circle of resampling.
A way of integrating over-sampling technique with cross-validation to improve the general performance was proposed by [102]. Cluster sampling method has also be used by [103] which introduces the process of cluster density and boundary density threshold to determine the cluster and sampling boundary, [104] used a method called A Bi-directional Sampling based on K-Means clustering which performed very well with data that has too much noise and few samples. Each of the sampling techniques has its pros and cons, which are very subjective and depending on the context of application and usage [105].
A techniques that could result in an improved performance might not show the same performance when used in different context. Therefore more modifications and improvements in the existing sampling techniques have continued to be presented and developed by researchers based on some local properties of the dataset. For instance, some under sampling have incorporated the mean of the values of the attributes as the metric for deriving the sampled data [106]. One of the main disadvantages of the over-sampling method is the risk of overfitting due to generating a replica of existing data [107]. For under-sampling; the main disadvantage is the possibility of discarding some data that might present potential useful information particularly during the process of variable selection that is cross dependent on other
variables or when the potential data item is far away from the central means of the attributes data items.