• No results found

2.5 Related Work: EAs for Classification (with Balanced Data)

2.5.1 GP for Classification

Genetic programming has been widely used to successfully evolve reliable and accurate classifiers over a range of classification problems [176][62][80][158][61] [111][155]. While GP for classification (with balanced data) represents a large area of work, this section provides a brief overview of the four main concepts in that pertain specifically to this thesis. These include GP classification models, classification strategies, the fitness function, and GP for ensemble learning.

2.5. RELATED WORK: EAS FOR CLASSIFICATION (WITH BALANCED DATA)33

IF ProgOut > 0 THEN Class1 ELSE Class2;

0.45 F2 0.7 F1 F3 F3 IF * + 0 8 +8 Class1 Class2 Genetic Program: (+ (* 0.45 F2) (IF (− F1 F3) 0.7 F3)) -

Figure 2.7: Classification strategy in GP.

Classification Models in GP

In tree-based GP, different kinds of models can be used to solve a given classifica- tion task due to the flexibility of the GP representation. Two common approaches include representing individuals as decision trees or discriminant functions for classification [64]. In decision trees, leaf nodes represent the class labels while internal nodes represent conditions on the features; the path from the root node to a leaf node represents the process of classifying an input instance. Adiscriminant function is when the GP classifier is represented as a mathematical expression where different operators are applied to the features of the input instances to be classified. This thesis uses discriminant functions for classification.

As a mathematical expression, a discriminant function typically com- putes a single floating-point number which is the output of the GP tree [176][80][158][111][155]. This single output value is then translated into a set of class labels. In binary classification, the division between positive and negative numbers is typically used as the two class boundaries to determine the class labels [181][176][80][158][111]. For example, Figure 2.7 illustrates how the numeric output of a genetic program is used for binary classification, whereF1,F2andF3

represent three input features andP rogOutdenotes the genetic program output. Here an input instance is assigned to class1 if the genetic program output is greater than zero; otherwise, the input instance is assigned toclass2. Using this strategy, the class threshold is fixed at zero.

Dynamic Classification Strategies in Multi-class GP

Recently, new dynamic classification strategies have been developed to translate the numeric output of a GP individual to a set of class labels for multiple- class classification tasks (with balanced data) [186][185][154][120]. In these works, dynamic class boundaries are determined on an individual-by-individual basis for each member in the population; whereas in the traditional strategy discussed above for binary classification, the class boundaries remain fixed for all individuals in the population (i.e. zero is the class threshold). These new approaches include Dynamic Range Selection (DRS) [120], Centred Dynamic Class Boundaries (CDCB) [185], Slotted Dynamic Class Boundaries (SDCB) [154] and a probabilistic classification strategy [186]. As one of the goals in this thesis is to determine whether the traditional (static) strategy is sufficient for these binary classification tasks with unbalanced data compared to a dynamic (non-static) strategy, a brief outline of these approaches is discussed below.

In DRS [120] and SDCB [185], the number line is divided into a fixed number of “slots”, and the real-valued output of an individual (when evaluated on an input instance) is mapped to a particular slot (using a truncation operator). Once all inputs are processed, the class which has the most inputs in a given slot is taken as the class label of that particular slot. However, a major limitation is that the slot sizes, slot range and the truncation operator must be determineda priori; these can be sensitive to the training data where poor initial settings can cause overfitting.

In CDCB [154], the class threshold is selected as the mid-point between two adjacent class centres, one for each class. A class centre is the average of the outputs when all individuals in the population are evaluated on all input instances from a particular class. While this approach requires no prior parameter configuration, the class threshold(s) depend onallindividuals in the population at the current generation. This means that more training can be needed to converge on a good class threshold and accurate GP classifier.

In [186], the outputs of each individual is modeled using Gaussian distribu- tions, one for each class, and a probabilistic technique is used to find the point(s) ofleast overlapbetween these class distributions. This approach requires no extra parameter configuration (unlike DRS and SDCB), is relatively fast to compute, and shows good results compared to the traditional (static) strategy on a range of multi-class tasks.

2.5. RELATED WORK: EAS FOR CLASSIFICATION (WITH BALANCED DATA)35

Fitness Function

In classification, the fitness function defines a measure to calculate the accuracy of a solution, by comparing the predicted class labels with the target (or actual) class labels in the training set. The traditional fitness function for classification uses the overall classification accuracy (this measure was previously shown in Section 2.1.3). Recall that this measures the number of examples correctly labeled by a classifier as a proportion of the total number of training examples.

However, using the overall classification accuracy in the fitness function is known to drive the evolutionary search toward biased classifiers which have high majority class accuracy but poor minority class accuracy, when data is un- balanced [141][57][88][172][130][69]. Also as discussed, this is because the overall accuracy can be influenced by the larger majority class. Two common approaches to address this learning bias in GP include using the average accuracy of the minority and the majority classes, or the AUC in the fitness function [141][57][88]. The reader is directed to Section 3.3 (in the next chapter) for a detailed analysis and discussion on the major advantages and disadvantages when these three fitness functions are used in GP for classification with unbalanced data.

Several other related approaches which develop new fitness functions in GP specifically for classification with unbalanced data are discussed in more detail later in this chapter in Section 2.6.2 (which focuses on related works for class imbalance problems).

GP for Ensemble Learning and Combining Classifiers

GP has also shown success is evolving ensembles of classifiers for classification with balanced data [110][111][126][29][165]. In [110][111], multiple trained classifiers obtained from several learning algorithms such as Naive Bayes, C4.5 decision trees and ANNs are combined into a single genetic program solution. Each base learner is trained using different partitions of the input data and adapted to output a real (floating-point) number (when evaluated on an input instance). Combining classifiers in this approach is shown to outperform the individual classifiers (in terms of AUC) on two benchmark binary tasks from the UCI repository. In [126], an anti-correlation measure is used in the fitness function along with the overall accuracy to encourage diversity between individuals, using a grammar-guided GP (on a 6-multiplexer problem). After the evolution, the entire population is combined into an ensemble.

teams and individual programs are co-evolved in parallel. In [29], teams are evolved using linear GP and evaluated with several ensemble combination schemes on two benchmark classification tasks (from the UCI repository) and a regression task. To create teams, the population is divided into demes, and then sub-divided into teams of individual programs. The ensemble combination schemes include the average of each member’s outputs, a majority vote and two winner-takes-all strategies; and two weighting schemes where teams and weights are co-evolved in parallel, or optimised after each generation (using a perceptron). The best combination scheme is found to be problem-specific (none shows the best results for all tasks) but the majority vote and weighting schemes consistly show good results.

In [165], four teaming-based selection methods are compared including a new class of “orthogonal evolutionary” selection algorithms (on two multi-class UCI classification tasks). In teaming, selection is done exclusively between teams or between individuals; while the new algorithms use individual selection with team replacement, and vice versa. The new methods produce better results than canonical methods but the best selection method is found to be problem-specific. It is important to note two major differences between teaming in GP and the ensemble methods used in this thesis. Firstly, teaming produces teams of

weak individuals that cooperate strongly together, as shown in both [29] and [165]. Weak individuals have very poor individual classification ability and are only effective when combined with other weak individuals in a team. Secondly, in teaming, two selection strategies are typically used (as discussed in [165]): selection of individuals within a team, and selection of teams. In this thesis, the GP classifiers are relatively strong individuals with good accuracy on the two classes, and individuals in the population are only combined into an ensemble after the training phase.