• No results found

class regions, or 0 otherwise.

However, preliminary experiments which compared Corr and Dist using either Ave or indicator function Izt on the tasks, found that the Ave-based approach produced slightly lower AUC results in the evolved solutions than the indicator function-based approach. These preliminary results are omitted here as they are not the main focus of this chapter but can be seen in Appendix B (in Section B.2.1). The indicator function Izt outperforms the Ave-based approach because Izt has the desirable property that solutions whose outputs do not adhere to the desired class ordering are assigned poor fitness values early in the evolution. These solutions are then phased-out out of the evolution relatively early in the process due to selection pressure. In contrast, theAve-based approach adopts a “fairer” strategy which assigns moderate-level fitness values to solutions whose outputs only partially adhere to the desired class ordering. These solutions then remain in the population for longer.

For these reasons, the indicator function Izt is the preferred method inCorr andDistto ensure that majority and minority class outputs are negative and non- negative, respectively, in the evolved solutions.

4.4

Experimental Setup

This section outlines the GP evolutionary parameters and the statistical signifi- cance testing techniques used in the experimental results.

4.4.1

GP Evolutionary Parameters

The same evolutionary parameters from the previous chapter are also used in these experiments. To recap, crossover, mutation and elitism rates are 60%, 35% and 5%, respectively, and tournament selection is used with a tournament size of 7. The maximum program depth is 8 to restrict very large programs in the population, and the population size is 500. The evolution is allowed to run for a maximum of 50 generations, or is terminated early if a solution with a maximum fitness value on the training set is found.

As discussed in the previous chapter, this configuration of parameters is recommended in the literature. To concentrate on the effects of the fitness functions in the GP algorithm, it is important that the configuration of evolu- tionary parameters is kept consistent. As the goal of this chapter is to compare

the different fitness functions in the evolution, fine-tuning this configuration of evolutionary parameters for better classification performances is outside the scope of this study.

4.4.2

Statistical Significance Testing of the AUC

Similar to the experimental results in the previous chapter, Tukey’s Honestly Significant Difference (HSD) [166] is used to find the statistically significant differences in AUC for the solutions evolved using the fitness functions. Tukey’s multiple comparisons test compares the average AUC of the fittest evolved solu- tions from each GP system (using a particular fitness function) to all others, and outputs a confidence interval for each pairwise comparison between GP systems. However, as these experimental results compare 11 different fitness functions, 55 confidence intervals are returned from Tukey’s multiple comparisons test when each GP system is compared to all others, as shown below.

n= k(k2−1) = 11(112−1) = 55

where k is the number of GP fitness functions. This means that 55 confidence intervals (of the AUC) for the different fitness functions must be compared to one another to find the statisticallysignificantly bettermean AUC values. A confidence interval between two fitness functions is calculated using Eq. (4.11) below for all fitness functions.

¯

yi−y¯j± q(α,k,M√2−k)SE q

2

n for alli, j = 1,2, ..., kwherei6=j (4.11) In Eq. (4.11), y¯i and y¯j are the mean AUC for two fitness functions, n is the number of GP runs (50),k is the number of fitness functions (11), andSE is the standard deviation of the entire sample. The constant valueqis the critical value for the studentised range statisticQ[121]. This is obtained using a look-up table1 for three variables: α, k and M. Here are α is the level of significance, k is the number of fitness functions, andM is the total sample size (i.e. total number of experiments for all fitness functions for a given task). Askis 11,M is 550 (50 runs ofkfitness functions), and αis 0.05 (5% level of significance), the look-up value forqaccording to [121] is 4.51, as shown below.

q(α, k, M −k) =q(0.05,11,539) = 4.51

1The distribution ofQhas been tabulated and appears in many textbooks on statistics or online

4.4. EXPERIMENTAL SETUP 91 0.78 0.8 0.82 0.84 0.86 0.88 Corr Dist Amse AucF AucE Wmw AveM Acc Ave Bands Incr Ion AUC 0.76 0.78 0.8 0.82 0.84 0.86 0.88 0.9 Corr Dist Amse AucF AucE Wmw AveM Acc Ave Bands Incr Ion AUC 1 1 2 2 2 2 3 3 4 4 4 1 >> 3−4 2 >> 4 (a) (b)

Figure 4.3: Confidence intervals of the AUC for the different fitness functions for the Ion task. In (a), the interval for Acc is statistically significantly poorer than

Dist and Corr. In (b), the confidence intervals are labelled with their s-ranks where the legend shows significantly better s-ranks.

In Eq. (4.11), q(α,k,M√ −k)

2 SE

q

2

n remains constant for all pairwise comparisons between fitness functions. As a result, these confidence intervals can be visualised for easier interpretability, as shown in Figure 4.3(a) for the Ion task. In Figure 4.3(a), each bar represents the 95% confidence interval of the mean AUC for a particular fitness function, where the horizontal axis shows the AUC. Two fitness functions are significantly different to one another only if their intervals are disjoint, and are not significantly different to one another if their intervals overlap.

For example, Figure 4.3(a) shows that the fitness functionAccis significantly different to Dist and Corr (in terms of average AUC), as the interval for Acc

(highlighted in blue) does not overlap with the intervals for Dist and Corr

(highlighted in red). However, as the interval forAccdoes overlap with all other intervals (dashed), Acc is not statistically significantly different to these fitness functions.

Figure 4.3(a) allows each interval to be easily compared to all other intervals to determine the statistically significant AUC values for the different fitness functions.

4.4.3

Significance Ranking using

S-rank

To summarise which fitness functions have a significantly better AUC compared to others (i.e. when each interval is compared to all others) for a given task,

an identifying number is assigned to each fitness function. This number, called the significance rank (or s-rank), represents a group of fitness functions that are statistically significantly different to other groups on a particular task. The fitness function(s) with the highest average AUC is assigned the best r-rank (1) and s- rank values will increase (s-rank gets worse) as the average AUC of the fitness functions also gets worse, as shown in Figure 4.3(b).

In Figure 4.3(b), the fitness function intervals are shown for Ion when each interval has been labelled with the corresponding s-rank. The legend in Figure 4.3(b) shows which s-rank values are statistically significantly better than other s-rank values, where the symbol denotes a significantly better s-rank. For example, “134” shows that the fitness function(s) with an s-rank of 1 (Dist

andCorrin this case) have a significantly better AUC than the fitness functions with s-ranks 3 and 4. Likewise, “2 4” shows that the fitness function(s) with an s-rank of 2 have a significantly better AUC than those with an s-rank of 4.

The following procedure assigns s-rank values to the fitness functions for a given task.

1. Sort the fitness functions in ascending order (using their average AUC values), as shown in Figure 4.3(a). Select the fitness function (or interval) with the highest AUC as the current interval, and initialise the s-rank to 1. 2. Find all other intervals that are significantly worse than the current interval

(i.e. other intervals that no not overlap with the current interval). For example, if the current interval is Corrin Figure 4.3(b), the other intervals that no not overlap withCorrare those with s-rank values of 3 and 4. 3. Find all other intervals that are not significantly different from the current

interval (i.e. other intervals that overlap with the current interval). For example, if the current interval is Corrin Figure 4.3(b), the other intervals that overlap withCorrare those with s-rank values of 1 and 2.

4. Using the intervals from Step (3), find those intervals that do not overlap with all the intervals in the set from Step (2). For example, if the current interval isCorrin Figure 4.3(b), the only interval from Step (3) that does not overlap with all the intervals from Step (2) is Distsince all other intervals from Step (3) (W mw, AucE, AucF, and Amse) overlaps with at least one interval from Step (2).

5. Assign the intervals from Step (4) with the current s-rank value (e.g. this is 1 on the first iteration), and then increment the s-rank. For example, if the

4.5. EXPERIMENTAL RESULTS 93