Although Koza measured minimum computational effort in terms of “individu- als to be processed”, his expectation was made clear when he wrote “E [mini- mum computational effort] is the minimum number of fitness evaluations” [72, page 268]. The transition from considering fitness evaluations to considering individuals to be processed came from Koza’s expectation that the number of fitness evaluations was directly proportional to the number of individuals pro- cessed. This assumption was fair given that throughout all of Koza’s books the number of fitness evaluations executed per individual (and indeed per generation) remained constant.
Given that the number of fitness evaluations vary under fitness-based incre- mental evolution, to maintain Koza’s intention one must measure computational effort based on the number of fitness evaluations. Fortunately, the number of fitness evaluations per individual remains constant across a given generation, so another option is to weight the value of each of the generations based on the num- ber of fitness-case evaluations executed within it: a generation with 100 fitness cases should be considered twice as computationally expensive as a generation where only 50 fitness cases were evaluated.
8.4.1
Related Work
For his PhD thesis, Chris Gathercole studied standard GP with a modification he termed dynamic subset selection (DSS) [44, chapter 6]. For each generation, DSS uses only a subset of the total fitness cases. Selection of the subset is biased towards fitness cases that have proved difficult over the previous generations. Gathercole struggled to compare his results to standard GP and left it up to his readers to decide which assessment was appropriate: “DSS [matched] GP results using many more generations, but only 20% of the number of tree evaluations”. He terminated his runs using neither a maximum number of generations nor a maximum number of evaluations (in fact, his termination criterion is unclear) and yet still compared their final results. This leaves much to be desired. Because of his very limited number of runs, he made no other analysis of his results, but had he executed a number of runs he would have benefited from the use of adjusted generations.
Users of DSS would however not generally be interested in comparing their performance to standard GP. For example, Stephenson et al. used Gathercole’s technique to reduce their computational requirements [105]; they ran their test and control experiments (both using DSS) to 50 generations and then measured the speed-up of their approach—their implicit assumption was that the cost of one generation was equivalent to the next. Given DSS uses a fairly consistent number of fitness cases per generation this assumption holds true. It is only when one wishes to compare DSS against a method that used a different number of evaluations that adjusted generations would show a benefit here.
If however the DSS algorithm were modified such that the number of fitness cases was also dynamic, then just as incremental evolution benefits from adjusted generations, so too would modified-DSS. However, Gathercole himself introduces an idea that would have benefited from adjusted-generations. Limited Error Fitness (LEF) [44, chapter 7] terminates the evaluation of an individual if the individual fails to correctly solve a threshold of fitness cases. In this way the number of evaluations per individual is no longer constant.
In a similar vein to Gathercole’s DSS, Qureshi’s PhD thesis considered the comparison of standard GP to GP where a fixed number (but random selection) of fitness cases were evaluated per generation [98, section 4.7]. He too had a difficult time in that his comparison had neither the same number of generations nor the same number of evaluations. Fortunately for him, no matter which comparison he took—even if it disadvantaged his test cases—his technique outperformed his
control experiment. Thus he avoided having to make a more rigorous comparison. Nonetheless an interesting point in his discussion was that some of his readers may feel that the comparison should be done on an individuals-evolved basis. Please note that adjusting the weightings of the generations does not alter the number of individuals processed.
Finally, if fitness-based rather than success-based analysis is desired, Steffen Christiansen’s y-test was designed specifically for comparisons based on different evaluation counts [21, 23]. Although it appears an excellent method to distin- guish between two experimental results, it is unclear how one might compare two differingy-test results. Adjusting the generations does not leave this as an issue.
8.4.2
Calculating Adjusted Generations
To calculate the adjusted generations, for each run we stored the number of fitness cases evaluated in each generation. This list of evaluations was summed, giving the total number of evaluations for the run. The total number of evaluations was then divided by the number of fitness cases in the final stage. This number was used as the cost of the run as measured in “adjusted generations”.
One might claim that this measure is only a scaled version of measuring the cost of a run in terms of evaluations rather than generations. This is true. The advantage in scaling is that direct comparison with standard GP becomes possible: one can immediately compare success effort and (to a lesser extent) minimum computational effort measures.
Note that this process meant we could post-analyse a standard run and so we did not re-run the experiments for adjusted generations, but instead just re-considered their results given the new generations-to-failure or generations-to- success. Note also that this process does not change whether a run succeeded or failed; it only adjusts a run’s cost.
To calculate the three measures on adjusted generations required modifica- tion only to the minimum computational effort method. The runs were binned into groups one generation wide, effectively meaning the ceiling of the adjusted- generations was used. Neither success effort nor final success proportion required modification to their methods. (Note that the use of adjusted generations has no impact at all on the final success proportion measure.)
Other than staying true to Koza’s intentions, and giving a much closer approx- imation to the true cost of using genetic programing, this modification has little impact on the qualities of his measure. Minimum computational effort with ad-
justed generations still provides “a hardware-independent, software-independent, and algorithm-independent way of comparing the performance of adaptive algo- rithms” [72, page 293]. However, as will be discussed in the following section, when adjusted generations are used, the measure may become only an upper bound for Koza’s minimum computational effort.
This modification does have an impact on the results calculated using success effort, but there is no impact on the final success proportion.