Diversity - Ensembles of decision trees - Machine learning algorithms

Chapter 2: Machine learning algorithms

2.4 Ensembles of decision trees

2.4.2 Diversity

Diversity is essential to ensemble machine learning. The classifiers must be as diverse

as possible, whilst remaining within the bounds of the problem being considered. The

classifiers must also remain consistent with one another in order to produce

meaningful results. The required diversity can be generated using a variety of

methods. The training data can be manipulated, the feature space can be partitioned or

each classifier can be targeted at a different subset of the problem. In addition,

manipulation of the inducer itself and hybridization of the various types of inducer can

be used to create diversity within an ensemble.

Manipulation of the inducer is probably the simplest way of generating diversity. The

variability can often be generated by the manipulation of the parameters of the

induction algorithm, for example, altering the threshold parameter in the C4.5 decision

tree4 or altering of the topology of neural networks. The starting point for training the

inducer can also be altered, e.g., the initial weights of a neural network. The method

used by an inducer to traverse the so called Ôhypothesis spaceÕ can be varied, leading

the different classifiers to develop varied hypotheses for a classification problem. This

can be done by introducing random variance, or by a method such as collective

performance based strategy,17 whereby a cost penalty is introduced into the training

algorithm, which encourages diversity.

The training data can be split into sub-sets, with each classifier being trained on an

overlapping or disjoint sub-set. Resampling is used to generate overlapping subsets of

the data. Some methods use the distribution of the training data. Others use a random

data, rather than sampling with replacement.

An important method of generating variety is to create new training examples based

on the distribution of the training data. These examples are combined with the training

data to form a new training set. The DECORATE algorithm18 creates these examples

to give maximum variance from the training data. The training is iterative, with the

first iteration on the training data and subsequent iterations with the addition of

artificial examples.

Variance across the ensemble can be introduced via the partitioning of the data into

disjoint partitions. This is often done randomly, and overcomes the bottleneck created

by the size of the data. Each classifier is trained on a disjoint sub-set, but the whole

ensemble processes the total amount. Also it is possible to use clustering techniques,

e.g., SVM cabins19 partitions the data for training multiple SVMs in order to predict

protein solvent accessibility. Both these approaches offer an improvement in accuracy

and a way to overcome performance bottlenecks.

Rather than diversify the data or change the way it is represented, search space

partitioning introduces variation by directing the classifiers in the ensemble to explore

different areas of the search space. Each of these models is constructed independently,

and then aggregated. The subspaces of the feature space can overlap or be disjoint and

how much, if any, overlap between subspaces to allow is an important consideration.

The divide and conquer approach divides the subspace into sub-sets. The instance

space can be divided using either clustering techniques, such as k means clustering, to

na•ve Bayes tree20. The feature sub-set selection approach manipulates the input

attribute set. Each of the classifiers is given a different sub-set of the features, and thus

receives a different projection of the training set. The features can be divided up by a

random selection or by using reducts2. A reduct is the smallest set of features that can

be chosen, whilst retaining the same predictive power as the whole feature set. This

has the limitation of preventing the ensemble size from being larger than the feature

set. A collective feature based strategy21 is also possible, whereby after the initial

random feature selection the sub-sets are refined using an iterative method, such as

genetic algorithms or a hill climbing approach.

Diversity can also be generated by using several different types of classifier to form

the ensemble. This approach also covers combining several classifiers with

mathematical or analytical methods. The different classifiers may identify different

aspects of the training data, and, therefore, this will go some way to overcoming the

natural bias of each individual classifier. For example, Zhou and Jiang combine the

C4.5 decision tree with neural networks22. They first train a neural network. This

ensemble enhances the training set by adjusting the class labels and adding new

examples. The new training set is used to generate a C4.5 tree. This is analogous to

the trepan23 method, described elsewhere in this thesis, in that it provides increased

comprehensibility of the results.

In document Data mining techniques for protein sequence analysis (Page 75-77)