• No results found

Traditionally, when comparing repair techniques, one would make use of an exist- ing set of pre-processed bugs from the literature, such as ManyBugs or GenProg’s

TSE 2012 benchmarks. The efficacy of repair actions would then be evaluated by determining the number of bugs for which a repair can be found within a fixed resource window [Le Goues et al.,2012a;Long and Rinard,2015;Mechtaev et al.,

2016]. However, such an approach is inappropriate when assessing repair models, for a number of reasons:

1. Diversity: To produce a fair comparison of techniques, one must subject the repair models to a large sample of bugs, taken from a diversity of programs, in order to minimise the effects of sampling errors, and to mitigate the potential for any bias. At present time, the only publicly available benchmark suites for automated program repair presented in the literature (ManyBugs, IntroClass, GenProg TSE 2012, Defects4J, SPR) consist of at most several hundred bugs, across fewer than 15 different programs. Furthermore, the composition of these benchmarks is unlikely to be truly random, and may be subject to unin- tentional selection biases; for example, one may omit bug scenarios that are particularly expensive or difficult to replicate, such as bugs within the Linux kernel.

2. Overfitting: Operating on a relatively small set of bugs allows one to unin- tentionally overfit to the specifics of that sample. By analysing the composi- tion of the corpus, one may quite simply introduce a number of specific repair actions, tailored to fix certain bugs. Such criticism may be levelled at the ap- proach taken by PAR to the design and evaluation of its repair actions, each of which bear the watermarks of solving particular bugs from the benchmark it was tested on. By evaluating its repair model on such a small and well- studied set of benchmarks, we are unable to determine whether the repair model continues to hold the same utility in a more general context.

A potential solution to this problem of overfitting would be to use separate datasets for training (i.e., learning a suitable repair model) and testing (i.e., assessing the generality of the learned repair model). To avoid misleading results, such a dataset would need to represent a truly random, uniform sam- pling of real-world bug scenarios. To our knowledge, no such dataset is pub- licly available. Existing real-world datasets are either hand-picked (e.g., Many- Bugs), and thus subject to selection bias, or represent a narrow category of bugs (e.g., IntroClass contains bugs in simple programming assignments). Given the costs and complexities associated with sourcing and evaluating a suitable dataset, the only feasible way to assess the effectiveness of a repair model, in general, is through the use of bug fix mining.

3. Repair Quality: Instead of determining the ability of repair actions to aid in the construction of high-quality fixes, one may unintentionally end up finding the repair actions which best exploit weaknesses in the test suites used by each of the bug scenarios. For example, one may trivially introduce an “Append Exit” repair action, which appends a statement containingexit(0);after the

selected statement. For many of the bugs within the TSE benchmarks, and a sub-set of the ManyBugs scenarios, this would yield an acceptable fix, as only

the exit status of the program is checked, rather than its outputs. (For more details, see Section3.2.1).

4. Feasibility: Even if one were to possess a sufficiently rich variety of bugs, evaluating the effectiveness of each repair action by using each of them to perform search may take prohibitively long. One could sample a sub-set of the candidate fixes generated by each repair action to reduce the running time, but doing so will degrade the accuracy of the evaluation, and ties the performance of the repair actions to the underlying search technique.

Given the difficulties involved in this approach, we opt to assess the prevalence and graftability of repair actions through software repository mining instead, allowing a far larger corpus to be analysed, without the need for expensive test suite evalua- tions. The steps of our analysis are as follows:

1. A corpus of human bug fixes is mined from over 200 of the most popular repos- itories on GitHub containing C source files, using a custom-written, open- source repository mining tool, BugHunter.

2. A set of abstract syntax trees (ASTs) and AST differences is computed for each of the files modified by each fix, using GumTree [Falleri et al.,2014].

3. From these ASTs and their associated differences, a set of repair action in- stances are mined using the detection rules.

4. For each modified AST, a series of abstract and concrete donor pools are gener- ated from its contents. Using these repair pools, we determine the graftability of each of the proposed repair actions.

Trade-Offs

Although this approach overcomes the identified problems involved in evaluating repair actions by incorporating them into the search procedure, it also comes with its own set of trade-offs. These trade-offs, and the steps taken to minimise their effects upon the results of the analysis, are as follows:

• Precision: Although this approach allows us to evaluate repair actions across a much larger corpus of bugs, it comes with the drawback that it excludes correct repairs that are not syntactically equivalent to the human bug fix. To partially mitigate this problem, we check whether the human repair is of the same kind as the action (e.g., does it modify anifguard?) in addition to check-

ing whether the exact repair can be crafted from the elements of the program. This step also allows us to reason about the effectiveness of synthesis-driven automated repair approaches, and generative repair models, more like those from traditional applications of genetic programming.

• Irrelevant Changes: Human repairs may often contain modifications to the source code that are irrelevant to the bug fix. These changes might include

aesthetic or structural changes, or they may have been bundled together with the bug fix commit. In theory, one could alleviate this problem by using delta- debugging to minimise the AST difference. In practice, however, there may be no (readily accessible) test suite, and so this technique is unusable. Instead, in this study, only bug fixes pertaining to a single file and function are used to perform the analysis of repair actions.

Although the results of this analysis are likely to underestimate the utility of certain repair actions, the overall results help us to understand the composition of human repairs, the effectiveness of the plastic surgery hypothesis, and how we might go about designing a more effective repair model.

Related documents