Using Ensemble Classifiers to Solve Multi-class Classification Problems

This section provides a review of “Ensemble” methods for solving the multi-class classification problem. An ensemble model is a composite model comprised of a number

of learners (classifiers), called base learners or weak learners, that are used together to obtain a better classification performance than can be obtained from using a single “stand alone” model. Classification algorithms such as: decision tree, Naive Bayes, CARM, and neural network can be utilised to generate the base classifiers. If the base learners in an ensemble model are all comprised of the same classification algorithm the ensemble model is referred to as a homogeneous learner, while when different classification algorithms are used the ensemble model is referred to as a heterogeneous learner [109]. In general, most ensemble methods are categorised as homogeneous learners [109]. Many researchers [47, 49, 57, 78] have demonstrated that generating a “good” ensemble requires base classifiers that tend to make errors on different groups of examples.

Much research work has been directed, by numerous researchers, at ensemble classification due to the potential benefits of the method with respect to classification effectiveness. The history of ensemble methods goes back to 1977 when the idea of an ensemble, made up of two linear regression models, was reported in [97]. More recently Luo and Liu [66] reported that work, using ensembles of neural networks, conducted by Hansen and Salamon [47] was the most significant in the context of better performance and reduced generalisation error4. Many researchers have demonstrated that using mul- tiple classifiers reduces the generalisation error [8, 29, 77, 85]. In addition, theoretical evidence that bias-error can be reduced by using ensembles of classifiers was presented in [7]. A novel multi-strategy (hybrid) ensemble, that combined a number of ensemble approaches, reported in [101], noted that ensembles of ensembles were more accurate than their component ensembles.

Although many researchers have demonstrated that ensembles often outperform their “base classifiers” when used on their own [27, 44, 79, 109], few have provided a reasonable answer to the question “why are ensembles superior to stand-alone classifiers?”. A suggested answer was provided in [79] that related the better performance of ensembles over single classifiers to the use of all available classification information. A more comprehensive answer was provided by Dietterich in [27],who considered the answer in terms of the following three headings: (i) statistical, (ii) computational and (iii) representational. More specifically:

1. Statistical reason. The nature of the data is such that it is often not possible to choose a particular classification model; there are often many different competing classification models that provide the same accuracy on the dataset. Consequently, combining these classifiers produces an average result that is better than that of the individual classifiers. This will avoid choosing the wrong classifier and circumvent the unrelated errors of individual classifiers.

2. Computational reason. Using ensembles avoids fruitless, and computationally expensive, searches for the “best” classifier.

Generalization: “The most central concept in machine learning, which characterises how well the result learned from a given training dataset can be applied to unseen new data” [109].

3. Representational reason. It is assumed that a given learning algorithm is look- ing for a “best” hypothesis within the hypothesis space, in most machine learning applications the hypothesis space might not contain the true target function, however adopting an ensemble approach can produce a good approximation.

With respect to the consensus that the ensemble concept is a general methodology for improving the accuracy of “stand-alone” classification algorithms; the ensemble approach is applicable and can be employed in all areas where classification techniques can be applied. Examples of application domains where ensemble have been used include: text categorisation [73], bioinformatics [106] (due to their ability to deal with high- dimensionality and complex data structures), manufacturing [69], e-learning evaluation system [54], and medical diagnosis [93].

According to Rokach [89] four main factors can be used to characterise the various ensemble methods:

1. Inter-classifier relationship. This refers to the relationships between classifiers forming the ensemble and how these classifiers affect each other. Two main types of ensemble can be differentiated: concurrent (parallel) and sequential (cascad- ing). The hierarchical ensemble, which is a much more recent approach, can be considered as a special case of a sequential ensemble. Most proposed ensemble models fall into the concurrent category. In a concurrent ensemble the classifiers are independent and their results are combined together using some combination scheme (see factor 2 below). In a sequential ensemble the classifiers are arranged sequentially (or hierarchically). More details concerning concurrent, sequential, and hierarchical ensembles are provided in the following sub-sections.

2. Adopted Combination scheme. Regardless of how an ensemble system might be configured, an important issue is how results are combined to produce a final classification. The simplest approach is to use some kind of voting system [8]. Voting algorithms can be divided into two types: those that adaptively change the distribution of the training set based on the performance of previous classifiers (as in boosting methods) and those that do not (as in Bagging). Averaging is another scheme to combine the results of several classifiers, which is suitable for use with classifiers that generate (say) confidence or probability values. A more complicated combination method can be adopted that utilises the concept of a “meta learner” such as stacking [105]. Stacking is usually used to combine models of different types, however it is not widely used.

3. Ensemble size. This refer to the number of classifiers forming the ensemble. A number of issues should be taken into consideration here: (i) accuracy, (ii) computational complexity and (iii) the number of available processors. Some researchers have claimed that the usage of large numbers of classifiers improves classification accuracy [47], however this is clearly not true with respect to the disjoint partition- ing methods, where if the subset sizes are too small, insufficient information will

be available for learning effective classifiers with which to populate the ensemble [89].

4. Diversity. The concept of the diversity of an ensemble refers to the generation of a set of base classifiers that are as diverse as possible so that they will produce uncorrelated errors; it is suggested that consequently a better overall effectiveness (classification accuracy) can be obtained [51]. The simplest way to obtain a di- versified ensemble is to use different representations of the training data. In other words manipulating the training examples, as in bagging where each classifier is learned using a different subset of the original training data. Manipulating the attribute set is another way of obtaining diversity, however it is not commonly used. The idea is to assign a different attribute set to each classifier [89].

Before continuing with the discussion on the usage of ensemble classifiers to solve the multi-class classification problem a number of open issues associated with the ensemble methodology should first be considered, these can be summarised as follows:

1. The best way to construct ensemble of classifiers. It is generally acknowl- edged that there is no “best” ensemble, the reason for this is simply because there is no “best” classification algorithm. However, some researchers have rec- ommended ways of constructing ensembles for specific situations, for example one study recommend not using sequential ensemble when the data set is highly noisy [85].

2. No comprehensive comparison available in the literature. The available studies vary regarding to the used: (i) ensemble approaches, (ii) evaluation data sets and (iii) evaluation criteria.

3. Computational cost. It is clear that combining a set of classifiers is computationally more expensive than using single “stand-alone” classifier. However, the promising benefit, obtaining accurate classification, generally considered to be worthwhile. In order to address the issue of complexity associated with ensemble systems two options have been suggested: (i) the usage of parallel process- ing, especially for concurrent ensembles as suggested by Breiman [12] and (ii) the elimination of similar representations from ensembles of classifiers, in other words pruning, as suggested by Dietterich [26].

4. Difficulty in understanding the final classification decision. For example, and as noted by Dietterich [26], it is easy to understand the classification result of a single decision tree. However, it is difficult to understand a final classification result of an ensemble comprised of two hundreds decision trees.

The rest of this section is divided into four parts. Part 1 provides a general overview of concurrent ensembles and presents the most popular concurrent ensemble approaches.

Then Part 2 goes on to consider sequential ensembles and provides a review of the most well known sequential ensemble approaches. Part 3 then provides a detailed survey of the domain of binary tree based hierarchical ensemble classification. Followed, in Part 4, by a discussion of DAG based hierarchical ensemble classification. The reason for this division is that the work on hierarchical ensemble classification presented later in this thesis can also be divided into binary tree and DAG based approaches.

In document Hierarchical ensemble classification: towards the classification of data collections that feature large numbers of class labels (Page 37-41)