• No results found

Ensemble Methods for Class Imbalance Learning

1.3 Research Questions

1.3.2 Ensemble Methods for Class Imbalance Learning

A variety of ensemble methods have been proposed to deal with class imbalance problems

based on Bagging and Boosting, and show superiority in certain applications. Although

their standard forms are not very effective in imbalanced scenarios, they can be easily

modified to work with other techniques, such as oversampling and undersampling, in

order to emphasize small classes. The common strategy for Bagging-based methods is

to undersample data from the majority class, and then build component classifiers with

relatively balanced training subsets. AdaBoost, the most well-known member in Boost-

ing family (Schapire, 2002), is often modified by employing the advanced oversampling

strategy at each step of sequential training. It has also been made cost-sensitive by ma-

nipulating its weight-updating rule, which assigns higher costs to rare examples than

common ones (Sun et al., 2007).

However, both resampling-based and cost-sensitive ensemble solutions suffer from some

known drawbacks. Undersampling majority class examples could abandon potentially

useful information (He and Garcia, 2009). Besides, it is always an issue of deciding how

many and which examples should be removed. Advanced oversampling techniques often

require careful parameter settings to manipulate training data before use. For example,

some data generation methods were reported to suffer from the over-generalization prob-

lem (He and Garcia, 2009), depending on the setting for creating synthetic examples. The

major reason for these problems is that they address “imbalance” by changing training

data directly, which could be risky sometimes. Cost-sensitive methods do not work on

the data level, but demand clear cost information of classes prior to learning, which is not

available in most real-world applications. Generally speaking, it is always a crucial and

for these methods.

Hence, we are motivated to seek an alternative way to overcome the problems of

existing ensemble methods. Meanwhile, it should have good generalization especially

in the minority class. Encouraged by the results in chapter 3 where diversity shows a

positive impact on the minority-class and overall performance, we ask if and how we

can take advantage of ensemble diversity to better deal with class imbalance problems.

Following this line, we devote special attention to negative correlation learning (NCL),

a successful ensemble technique that encourages diversity explicitly during training and

presents very good generalization ability with a solid theoretical grounding (Liu and Yao,

1999a,b; Brown et al., 2005). A further related question is “Can NCL methods be good

solutions to class imbalance problems?”. At least, they have the advantages of improving

generalization without changing training data and requiring much prior data knowledge.

There has been very little work on this topic yet, which will be answered in this thesis.

On one hand, it gives a further understanding of the effect of ensemble diversity in solving

a specific and important type of classification problems. On the other hand, it provides a

new way of dealing with real-world problems that suffer from the class imbalance difficulty.

It opens up a practical and novel use of ensemble learning algorithms.

In chapter 5, we explore and exploit diversity through NCL methods to facilitate class

imbalance learning with comprehensive and systematic analyses. In particular, we study

the effectiveness of AdaBoost.NC, a new NCL method for classification ensembles, in

the presence of imbalanced data in depth. More background and motivations about this

algorithm can be found in the next section. The algorithm is proposed in chapter 4.

Another emerging research area in class imbalance learning is concerned with multi-

class imbalance problems, where multiple minority or/and majority classes exist in data.

Most current efforts are focused on two-class tasks. In practice, many applications have

more than two classes with uneven class distributions, such as protein fold classifica-

tion (Zhao et al., 2008; Chen et al., 2006; Tan et al., 2003) and weld flaw classifica-

not observed in two-class problems and have not been addressed so far. Many useful

techniques for two-class tasks were found ineffective on multi-class tasks (Zhou and Liu,

2006b). More investigations are necessary to explainwhat problems multi-class can cause

and how it affects the classification performance. Among limited solutions for multi-class

imbalance problems, most attention in the literature has been devoted to class decom-

position, which breaks the whole problem into several binary sub-problems. It simplifies

the problem. However, each individual classifier is trained without full data knowledge.

It can cause classification ambiguity or uncovered data regions with respect to each type

of decomposition (Tan et al., 2003; Jin and Zhang, 2007; Valizadegan et al., 2008). It is

desirable to develop a more effective method without increasing learning problems. Ac-

cording to the current progress in this topic, we would like toexplore new approaches to

tackling the multi-class difficulties in class imbalance learning effectively and directly.

In chapter 6, we study the impact of two types of multi-class, i.e. multi-minority and

multi-majority, on the performance of two basic resampling strategies that are widely

used in two-class imbalance problems. Based on the results, we propose to use the best

NCL strategy obtained in chapter 5 to handle multi-class imbalance problems.

1.3.3

Negative Correlation Learning for Classification Ensem-