Class imbalance, sometimes referred to as unbalanced data, occurs when a dataset does not have an (approximately) equal number of instances from each class, which may be quite severe in some applications [65]. Unbalanced data research focuses on situations where the class balance is nowhere near 50%. Rather, severe class imbalance (1%, 2%, 5%, 10%, etc). The datasets with severe class imbalance requires that algorithms figure of merit is addressed in a different manor than if there were only concept drift in data. Consider datasets with imbalances of 1%, 2%, and 5% and a classifier is generated yielding performances of 99%, 98%, and 95%, respectively. The classifier used in this simple example can be a majority class classifier, which simply classifies a new instance with the label of the class that occurs most often. The majority classifier will have a high overall accuracy, which is generally a good quality, however it will not be able to identify any of the instances that belong to the minority class. From this very generic example, its clear that error is not a best statistic to identify how well the algorithm is performing across all classes. Therefore, researchers dealing with imbalanced datasets will typically present results with statistics other than overall error to access an algorithm on an experiment. Before analyzing measures other than error, let’s focus on issues associated with learning from imbalanced data.
2.3.1 Why Do Classifiers Perform Poorly on a Minority Class?
Class imbalance arises from the under representation of at least one class in a learning problem. The minority class is unfortunately the target class for many classification tasks. Recall the example of credit card fraud in Chapter 1. The number of legitimate transactions essentially dwarfs the number of fraudulent transactions. Then how can the fraudulent transactions be learned when there are so few instances and how the classifier resist biasing towards to majority class? The fraudulent class in this scenario is difficult to learn for couple reasons:
• there are so few instances that there may not be a clear representation of the minority
class feature space
• many classification algorithms tend to minimize an error function, which may not
favor learning a minority class.
The first point is obvious and is partially a motivation for incremental learning, however the second point is worth further discussion. Classifiers typically minimize a global error function during the training process and do not take information about the distribution of the data into account. As a result, instances from the majority class are classified with high accuracy whereas examples from the minority class tend to be misclassified. For example, algorithms such as the multi-layer perceptron neural network (MLPNN) minimize an error function, generally mean-squared error (MSE), during the training phase of the neural network [66]. So, if a classifier is likely to bias its decisions towards a majority class, what can be reduce this effect? Sampling methods, cost-sensitive learning models, and ensemble have demonstrated favourable qualities to avoid bias towards the majority class as discussed in Section 2.3.2, 2.3.3, and 2.3.4.
2.3.2 Sampling Methods
There are several popular methods of handling class imbalance. Some of the more popular
approaches to learn class imbalance occur at a data or algorithmic level [67]. The data
level approaches generally employ some form of sampling to generate a new dataset that is similar to using bootstrap datasets with ensembles. Simple data level approaches use random over-sampling of a minority class or under-sampling of the majority class to reduce the imbalance. The random sampling must be done with care for several reasons. A simple random over/under-sampling of a dataset to create a less imbalanced dataset comes with repercussions. A simple random under-sampling of a dataset will discard instances from the majority class. However, by throwing out instances from the majority class, risks discarding information that can be useful to the classification problem. Over-sampling on the other hand does not discard majority class data, rather it adds exact replicates of the minority class. Using this approach to re-balance a dataset runs the risk of generating a classifier that will overfit the minority class.
Synthetic sampling can be used to reduce the undesirable qualities in random
over/under-sampling of minority/majority class data. Synthetic sampling methods
generally oversample a data set; however the instances added into the new dataset
are synthetic, and not exact replicates of the minority data as performed in random
over-sampling. The synthetic data are generated such that they are ”similar” to other instances in the minority class population. Some synthetic sampling methods not only focus on the generation of synthetic data but also the location of the synthetic instances. For example, synthetic sampling methods may generate synthetic instances that lie near a decision boundary. The synthetic sampling methods have been shown to be less prone to over-fitting classifiers to the minority class.
Over/under or synthetic sampling does not guarantee that the minority class can be adequately learned. Rather, sampling is a fast, cost-effective method of generally increasing the performance on a minority class. Studies have shown that some classifiers on particular datasets are not affected by sampling methods. Regardless, many imbalanced datasets benefit from sampling. Popular sampling approach can be found in Section 3.3.1.
2.3.3 Cost Sensitive Learning
The sampling methods described in the previous section attempt to develop a new dataset that contains less class imbalance. Note, that sampling methods may not convert the imbalanced learning problem into one that is balanced, rather they create a less imbalanced learning problem. Cost-sensitive learning algorithms assign penalties based on a cost matrix, which represents the penalties for different possible correct/incorrect classifications. The cost matrix can be considered as a numerical representation of the penalty of classifying examples from one class to another. The objective of cost-sensitive learning is to generate a classifier that minimizes the overall cost, not error, on the training data set, which is usually the Bayes conditional risk [1, 7]. The cost-sensitive learning problems generally lend themselves better to theoretical analysis than sampling methods.
2.3.4 Ensemble Methods
Ensemble methods are not only popular for reducing error of the final hypothesis, but are also employed to learn an under-represented class. The last two sections have focused on using sampling or cost-sensitive learning to increase the performance on a minority class. Ensembles are widely used for learning class imbalance by combining multiple classifiers, sampling, and cost-sensitive learning. Several existing ensemble techniques minimize the overall cost during training, not the error. Several different ensemble methods
are discussed in more depth as the literature review of the class imbalance is presented in the next chapter.