• No results found

5. COEVOLUTIONARY AUTOMATED SOFTWARE CORRECTION

5.2. DESIGN

5.3.5. General Experimentation Results

of runs that yielded a solution for each program using both MOOP and SOOP.

Figure 5.17 shows the success rate for the runs ordered by the number of bugs present

Table 5.7: Scoring Table Used for triangleClassification Equ. Isc. Scl. Inv. NA

Equ. 1.00 0.75 0.25 0.00 -1.00 Isc. 0.75 1.00 0.50 0.00 -1.00 Scl. 0.25 0.50 1.00 0.00 -1.00 Inv. 0.00 0.00 0.00 1.00 -1.00

in the program. With a single bug present, the system performed quite well on all programs, achieving an average 86.25% success rate on these runs. With two bugs the average success rate falls, to 74.58%. With three bugs present, the system really struggled, achieving only 26.88% average success rate. This highlights the system’s need for the addition of partial solution identification techniques.

Figure 5.16: Percentage of Runs Yielding a Solution in General Experiments Ordered by Program

MOOP outperformed SOOP in the majority of the programs considered, and performed at least as good for all programs. This was the expected result, as even

when the problem objectives are not conflicting, MOOP can still be expected to per-form at least as good as SOOP. If the three bug runs are omitted, MOOP achieved an average success rate of 93.75%, whereas SOOP only achieved 67.08%. Both optimiza-tion methods struggled when there were three bugs present in the program, though MOOP still performed better than SOOP on average in these cases.

Figure 5.17: Percentage of Runs Yielding a Solution in General Experiments Ordered by Number of Bugs Present

Solutions presented by the system were subjected to manual testing to deter-mine if the solution was a true solution. This testing used a hand-crafted set of test cases, designed specifically to demonstrate functionality implemented in the ES(s) for the bugs (not just the functionality affected by the bug(s)). The output of the solutions for these tests was compared that of the original unmodified program and if no differences were shown, then the solution was a true solution. Figure 5.18 shows the percentage of solutions presented by the system that are true solutions ordered

by the program used for the runs. Figure 5.19 shows this percentage as well, except ordered by the number of bugs present in the program used for the runs.

Figure 5.18: Percentage of Solutions Yielded in General Experiments that are True Solutions Ordered by Program

Programs with more a complicated test case space will naturally be more prone to the presentation of false solutions, as these programs often have more border cases and/or rarely traversed execution paths. For example, the test case space for the triangleClassification program is defined by a finite set of relationships between the input values. The test case space for the replace program, however, is dramatically more complex, defined by pattern elements, pattern element order, pattern modifiers present, replacement text used, etc. And so, with all other aspects equal, replace will have a higher likelihood of producing a false solution than triangleClassification, since the test case space is so much larger for replace. This is reflected in Figure 5.18, which shows that solutions presented for triangleClassification were always true solutions

and the number of true solutions found for replace decreased as the number of code elements increased.

The percentage of solutions generated that are true solutions is indicative of the CASC verification system’s performance. Like the system’s correction module, the verification module performed very well with up to two bugs present in the program, achieving an average true solution rate of 96.04% with one bug and 85.81% with two bugs. The performance of the verification system cannot be effectively assessed when three bugs were present, since few solutions were presented overall by the system.

Figure 5.19: Percentage of Solutions Yielded in General Experiments that are True Solutions Ordered by Number of Bugs Present

Figure 5.20 shows the average number of verification cycles used in the runs that yielded a solution. The values shown indicate the number of times the system entered the verification EA in the T estingandV erif ication module. All runs that yield a solution result in at least one verification cycle; values greater than one for this statistic indicate how often false positive solutions were identified and rejected

by the system. Where Figure 5.18 and Figure 5.19 indicate how often the verification system did not catch a false solution, Figure 5.20 indicates how often these solutions were caught by the system and then system went on to find another solution.

In general, the remainder and triangleClassification programs used more veri-fication cycles than the other programs for all configurations. This is due to the rate at which the test case population is able to converge on a genotypic structure that generates an error in the prevailing members of the program population. Both the remainder and triangleClassification test case spaces are much smaller than that of printtokens2 and replace, making convergence much easier. With rapid convergence possible, the test case population will quickly cluster around identified optima, then when a program is identified that corrects the error being focused on, it will pass all of the test cases that were exploiting the error. However, while correcting the original error, additional errors could be created and not identified, due to the focused nature of the test case population. And so, when the candidate solution is identified and the system goes into the Testing and Verification module the error introduced during correction is identified trivially, since the test case population is no longer focused on a specific genotypic configuration.

Figure 5.21 displays box plots for the number of evaluations performed when the solution for a run was found. The boxes shown are bounded by the first and third quartiles of this value, the cross is at the median, and the endpoints of the lines on the top and bottom of the boxes are at the max and min value, respectively. An evaluation is defined as a single program-test case pairing. This gives an overview of the effort used to generate a solution for the problems considered. For the one and two bug problems, most solutions were presented in less than five million evaluations.

The remainder runs generally took more evaluations to generate a solution than the others. This is due to both the number of verification cycles used in these runs and the arithmetic nature of the program (making it very sensitive to arbitrary modification).

Figure 5.20: Average Number of Verification Cycles Used in Successful Runs

Figure 5.21: Box Plot for the Number of Evaluations Used to Generate a Solution

Figure 5.22 shows box plots for the CPU time used in all of the experimental runs, in seconds. The majority of runs finished in less than 20000 seconds (i.e., approximately five and a half hours). A large amount of time during runs is spent waiting for non-terminating programs to be killed, which is currently done at the

second resolution. These times could be reduced through higher resolution program timing and the addition of more search guidance to the system.

Figure 5.22: Box Plot for the CPU Time Used for the Experimental Runs in Seconds