2.9 Summary and Further Reading
3.1.4 When Minimum Generation is Known
In order to empirically test the validity of these three methods to generate confi- dence intervals, we ran experiments based on datasets where very large numbers of runs had been executed2. The four datasets were:
1Although Keijzer et al. did not clearly specify their method, they did state that they split
the executed runs into two groups. We study that variation in section 3.2.1.
• Ant: Christensen and Oppacher’s 27,755 runs [22] of the artificial ant on the Santa-Fe trail; panmictic population of 500; best estimate of the true computational effort 479,344 at generation 183; P(18) = 2421
27755 = 0.0872
• Parity: 3,400 runs of even-4-parity without ADFs [71, 72]; panmictic pop- ulation of 16,000; best estimate of true computational effort 421,074 at generation 23; P(23) = 33493400 = 0.985
• Symbreg: Gagn´e’s 1,000 runs4 of a symbolic regression problem (x4+x3+
x2+x) [71]; panmictic population of 500; best estimate of true computa-
tional effort 33,299 at generation 12; P(12) = 593
1000 = 0.593
• Multiplexor: Gagn´e’s 1,000 runs5 of the 11-multiplexor problem [71]; pan- mictic population of 4,000; best estimate of true computational effort 163,045 at generation 25;P(25) = 1000947 = 0.947
The computational effort calculations for each dataset (utilising every run) were treated as a best estimate of the true minimum generation and true mini- mum computational effort.
For each dataset and for each confidence interval generating method, the fol- lowing method was applied. A subset of the whole dataset’s runs were randomly selected (uniformly with replacement). The subset sizes were 25, 50, 75, 100, 200 and 500 runs. These sizes are typical of published work (often 25 to 100 runs, sometimes fewer [71, 72]) and recommendations by statisticians (200 to 500 runs [22, 92]). 10,000 subsets were selected and for each subset the confidence interval generating method was applied. This simulated 10,000 genetic program- ming experiments on each of the four problem domains for each of the six run sizes.
Results and Discussion
For each of the four problem domains and each of the three confidence interval generation methods, table 3.2 gives the average coverage and the average num- ber of valid confidence intervals that were produced from the 10,000 simulated experiments. Table 3.3 gives the same statistics but by run size and method.
So, for example, table 3.2 shows that for the normal approximation method on the Ant problem domain, an average of 97.1% of the confidence intervals
3This occurred at generation 18 as, like Koza, we have counted the first generation as
generation 0, whereas Christensen and Oppacher labelled it generation 1.
4Our thanks go to Christian Gagn´e for this dataset 5Thanks again to Christian Gagn´e for this dataset
Method \Problem Ant Parity Symbreg Multiplexor Average Normal 97.1% 49.4% 95.3% 79.1% 80.3% 7,049 1,787 9,954 4,752 5,885 Wilson-Dependent 95.2% 95.3% 94.9% 95.1% 95.1% 10,000 10,000 10,000 10,000 10,000 Resampling 93.2% 69.9% 94.1% 88.9% 86.5% 10,000 10,000 10,000 10,000 10,000
Table 3.2: Average coverage percentages and average validity statistics by prob- lem domain when the minimum generation is known. Averages are over 25–500 runs.
Method \Runs 25 50 75 100 200 500 Average
Normal 48.1% 70.5% 72.9% 98.0% 96.3% 95.9% 80.3% 2,582 3,704 5,007 6,439 7,907 9,674 5,885 Wilson-Dependent 94.6% 95.7% 95.6% 94.7% 95.5% 94.7% 95.1% 10,000 10,000 10,000 10,000 10,000 10,000 10,000 Resampling 72.0% 82.8% 86.9% 88.8% 94.4% 94.3% 86.5% 10,000 10,000 10,000 10,000 10,000 10,000 10,000
Table 3.3: Average coverage percentages and average validity statistics by run size when the minimum generation is known. Averages are over the four problem domains.
included the true value of the minimum computational effort (compare that to the expected result of approximately 95%). This average was produced over simulated experiment sizes of 25–500 runs. The table also shows that, for the same setup, an average of 7,049 of the 10,000 simulated experiments produced valid confidence intervals.
The resampling method had a very poor minimum average coverage of 69.9% for the Parity domain (see table 3.2). The Normal method also did poorly for that domain with a coverage score of 49.4%. In contrast, the Wilson-Dependent method achieved very good coverage levels across all domains and all run sizes with a minimum coverage of 93.3% (on the Parity domain with 100 runs).
The advantage of the Wilson-Dependent method over the normal approxi- mation method is clearly demonstrated by the validity statistics in the Parity problem. Because the probability of success is so high (0.985 over 3,400 runs), the samples with a low number of runs (25–200) were often unable to satisfy the normal method’s validity criteria of n(1−p) > 5. And even when the validity
criteria were satisfied, for the small runs sizes (i.e. 50 and 75 runs), none of the confidence intervals included the best estimate of the true computational effort. The Wilson-Dependent method, on the other hand, produced valid confidence intervals for all 10,000 samples for every run size and with a coverage of 95.3% for the experiments in that domain. Where it was fair to make a comparison, the widths of the confidence intervals were similar.
The Ant domain exemplifies a low probability of success (P(18) = 0.087). In this case the Normal method had difficulty satisfying its np > 5 criteria, producing valid confidence intervals for only 6% of the samples with 25 runs and 43% with 50 runs. However, for the confidence intervals that it did produce, the proportions that included the true value either exceeded or were very close to the intended 95%. However, yet again the Wilson-Dependent method was the method of choice as it produced confidence intervals for every sample and with an average coverage of 95.2%. Further, for almost every run size the Wilson- Dependent method produced notably tighter confidence intervals.
Finally, the Symbreg domain, with its non-extreme cumulative probability of success (P(12) = 0.593), levelled the playing field for the Normal method. The Normal method produced very good average coverage of 95.3% for an average of 99.5% of the samples. The Wilson-Dependent method did only slightly better in this instance, although the widths of its confidence intervals were a little tighter. The Resampling method did very poorly over lower (25–100) run counts for the parity problem (coverages of 32%–78%). This was due to the low probability that a sample of the population would contain a run that did not find a solution before the minimum generation. For data where the cumulative success rate is very high at the minimum generation, it can now be seen that the resampling method is inappropriate to use.