Discussion & Conclusion - Advanced Techniques for Search-Based Program Repair

From the results of our evaluation, we observe relatively little benefit in incorporating information learned from the evaluation of candidate patches into the fault localisation, in comparison to previous attempts to use mutation analysis to locate faults. We believe there may be a number of reasons for this result:

• Lack of mutants: for a number of bug scenarios, we found that excessively long test suite evaluation and compilation times prevented the search from producing an adequate sample of mutants at each statement.

• Lack of passing test coverage: in cases where a large of statements are covered by no passing tests, all of these statements will be assigned a suspicious-

ness score of 1.0 by µP 2F(s). Consequently, this layer will either suppress

the fixed statement, if it is covered by any positive test cases, or it will fail to identify it amongst the many statements without positive test coverage. • All-or-nothing f2p response: within GenProg’s search space, we observe

erratic negative test case behaviour. In some cases, the only statements to pass a negative test case were those that could be repaired. In other cases, negative test case passes were much more common, whilst no mutants at the statement that was repaired (excluding the solutions) passed any negative tests. If one knew which type of f2p response one was dealing with, a more accurate fault localisation might be possible. In the future, we plan to explore whether the rarity of negative test passes might be used to determine whether a given failing-to-passing event is a coincidence, or indicative of a potential repair at that statement.

• Coarsely-grained mutation operators: one explanation for the relative lack of success in incorporating the results of the mutation analysis into the fault localisation may be due to the coarsely-grained nature of the repair operators within GenProg’s search space. With such actions, it may be difficult to expose subtle bugs within the statement that might otherwise be identified using finer-grained mutation testing operators. In our preliminary analysis, we find that most repairs tend to either have no effect on the outcomes of the test suite, or to cause all of their covering tests to fail; this all-or-nothing behaviour may be a consequence of the granularity of the search operators. • Combining information: to combine each of the proposed layers of fault

localisation from our evaluation, we computed the product of each of the layers; a necessarily arbitrary decision. A more meaningful, effective way of combining information from multiple sources and how to deal with conflict- ing or corroborating suspicious values is not immediately clear.

Going forward, to translate the potential of mutation analysis approaches such as MUSE into efficiency gains in automated program repair, we intend to explore the following:

• Richer repair models: in an effort to avoid the all-or-nothing behaviour ex- hibited by mutants generated using GenProg’s coarsely-grained statement- level operators, we intend to explore the utility of lower-level repair operators including, but not limited to, those traditionally used within mutation testing. Beyond the use of mutation testing operators, we intend to explore whether expression-level operators, such as the replacement of the LHS or RHS of an assignment, or the modification of function call parameters, may be used to predict the location of the fault.

• Shape prediction: in addition to investigating whether mutants can be used to predict the location of the fault, we are interested to see whether the results of particular types of mutants can be used to refine suspiciousness beyond the level of the statement, and if they can be used to predict the type of repair that

might be needed (for instance, if replacing an if condition appears to have no effect, that might suggest that a replacement condition is needed).

• Ensemble learning: rather than combining fault localisation layers by tak- ing their product, or using a simple weighted average, we may see better results—when using a different model—if ensemble learning techniques are used to find more effective ways of combining these multiple sources of information.

In conclusion, we find that mutation analysis appears to offer little potential for online fault localisation within GenProg’s search space. From observation of the mutants, we see that GenProg exhibits an all-or-nothing landscape, where most edits are either neutral or fail all of their covering tests. Given the previous success of Metallaxis and MUSE, we believe that this all-or-nothing behaviour may be partly responsible for the lack of improvement. Alternatively, it may be that the results observed by Metallaxis and MUSE fail to scale to large real-world programs. In future work, we intend to investigate the assumptions behind these techniques more deeply.

To benefit from the knowledge of its mutants test case results, we believe a set of more finely grained mutation operators are required; a requirement that will most likely allow a larger number of bugs to be solved at the same time.

Repair Model

Motivated by our findings and the findings of others—that the statement-level repair model used by GenProg is ineffective at finding repairs—in this chapter we conduct an empirical study of bugs in real-world C programs to determine a more effective repair model. Specifically, we explore the viability of extending the ideas of plastic surgery—using code from existing sources to craft the materials necessary for a repair—beyond the level of statements, and to a larger set of richer, more granular changes, capable of fixing a greater number of bugs.

Using a new bug fix mining tool, BugHunter, we automatically identify bug fixing commits within Git repositories, before extracting instances of AST-level repair actions and collecting a pool of donor code snippets from the program. Equipped with this information, we determine the fraction of bug fixes which involve a particular repair action, and the fraction of repair action instances that can be grafted [Barr et al.,2014] from existing code within the program. In an effort to reduce the search space, and backed by the findings of previous studies regarding the effectiveness of plastic surgery, we limit the membership of the donor code pool to snippets taken from the file where the fix occurred. To avoid the rejection of potential snippets due to differently labelled variables, we also explore the effectiveness of plastic surgery when such labels are removed; we refer to the labelled and unlabelled forms of the donor pool as the concrete and abstract donor pools, respectively.

From analysis of the results, we find that more granular repair actions, at and below the level of statements, are better suited to plastic surgery than more block-level changes. By removing labels from donor code snippets and treating them as tem- plates, we observe a substantial increase in graftability, rising from 0–58% to 16–94%. To fix a larger number of bugs, we suggest incorporating a sub-set of the most fre- quent repair actions into the repair model, and to use an abstract donor pool to craft repairs.

The contributions of this chapter are as follows:

• We present BugHunter, a repair action mining tool, capable of identifying bug fixing commits within Git repositories and discovering potential instances of automated repair actions at the AST-level.

• We build upon previous definitions of repair models [Martinez and Monperrus,

2013] and provide inference rules for a set of repair actions, used to perform the study.

• We examine the frequency of 23 repair actions, inspired by repair models used within existing repair techniques, across 10,000 identified bug fixes in 200

open-source C projects.

• We determine the graftability [Barr et al.,2014] of each of these repair actions within a set of labelled and unlabelled donor code pools.

• From the observed frequencies and graftabilities of repair actions and donor code pools, we make a number of suggestions for constructing a future repair model, capable of addressing a greater number of bugs.

The rest of this chapter is structured as follows: Section5.1gives a brief review of the related related literature. Section5.2elaborates on the motivation of this study and outlines our research questions. Section5.3discusses our methodology; Section 5.4expands upon the definition of repair models and provides descriptions, in the form of inference rules, for each of the repair actions studied. Section5.5outlines our approach to mining repair action instances from real-world software projects. Section5.6presents and discusses the results of our study. Finally, Section5.7sum- marises our findings, provides suggestions for the construction of more effective repair models, and outlines future directions for research.

5.1. Related Work

In this section, we briefly discuss previous research of relevance to repair models and plastic surgery, as well as highlighting where our study differs and builds upon this body of work.

The Plastic Surgery Hypothesis

Barr et al.[2014] summarise the “plastic surgery hypothesis” as follows: Changes to a codebase contain snippets that already exist in the codebase at the timeof the change, and these snippets can be efficiently found and exploited.

The plastic surgery hypothesis underlies various genetic-programming-based approaches to automated program repair, optimisation, and improvement, all of which use existing code to find solutions.

Given this definition,Barr et al.[2014] break down this hypothesis into two assumptions: (1) changes to the program are repetitive, relative to their parent, and that (2) this repetitiveness may be efficiently exploited to construct those changes. In particular, techniques such as GenProg rely on the existence of donor snippets within the current, buggy form of the program. To test the first assumption, they measure the graftability of 15,273 commits, taken from several large-scale Java projects. The graftability of a change is defined as the number of its snippets for which a matching snippet can be found within the search space. For the purposes of the study, source

code lines (with whitespace removed), rather than AST-level entities, are treated as snippets.Barr et al.[2014] measure this quantity across three different search spaces, each containing snippets taken from the following sources, respectively:

• The parent of the change

• All non-parental ancestors of the change

• The most recent version of a foreign software project

To test the validity of the second assumption—that the space of donor snippets can be efficiently explored—Barr et al.[2014] also measure the density of matching snippets within each space.

From the results of their analysis,Barr et al.[2014] found that, in most cases, donor grafts could be found within the current version of the program and that rarely was it necessary to search non-parental ancestors for the graft. Moreover, 30% of grafts could be found within the same file at which the human-written change occurred. These results support both the plastic surgery hypothesis and GenProg’s interpre- tation of that hypothesis: restricting the attention of the search to snippets within the current versions of the faulty versions of the file allows a large number of grafts to be found much more efficiently.

Our study builds upon the work byBarr et al.[2014] by investigating redundancy at the level of program repair actions, rather than at the level of source code lines. We choose to study source code redundancy within the context of particular repair actions since this allows us to more accurately determine the effectiveness of plastic surgery when applied to program repair.

Empirical Inquiry into Redundancy Assumptions of APR

Martinez et al.[2014] investigate the underlying assumption of plastic-surgery driven APR techniques, such as GenProg, PAR, and SearchRepair: that the ingredients used by a fix already exist within the program. To investigate this assumption, the authors examine six open-source Java projects and determine the fraction of (version control) commits—including those which do not pertain to bug fixes—that are “temporally redundant”. A commit is deemed temporally redundant if it can be composed in its entirety from code introduced by previous commits. The authors measure temporal redundancy at both the line-level and token-level, and within the same file (termed local temporal redundancy) and across all files (termed global temporal redundancy).

The results of the study demonstrate a stark contrast in temporal redundancy at the line-level and token-level: Between 2–17% of commits are temporally redundant at the line-level, whereas 8–52% are temporally redundant at the token-level. This finding suggests that more granular changes to the program are easier to compose from previous versions of the program, although this trade-off comes at the cost of a significantly increased search space.

From analysis of global and local redundancy, the authors find that between 8– 29% of tokens can be found within previous versions of the same file, compared to 31–52% across previous versions all files. Importantly, the size of the local pool was between two-to-three orders of magnitude smaller than the global pool. The cost effectiveness of searching the donor pool lends support to GenProg’s decision to restrict the composition of its donor statements to those within the files under repair.

Our study differs from that conducted byMartinez et al.[2014] in two important aspects: Firstly, we focus our attention on the effectiveness of plastic surgery within C programs, rather than Java programs. Secondly, we measure redundancy within the context of concrete repair actions, rather than generically measuring it at the line- or statement-level. For instance, we determine the graftability of a “Replace If Guard” action by measuring redundancy at an expression level.

Mining Software Repair Models

Martinez and Monperrus[2013] conduct an empirical study of the frequency of 41 different AST-level repair actions within real-world bug fixes in Java programs, and argue that not all probabilistic repair models are equally effective. As the dataset for their analysis, the authors automatically identify the sub-set of bug fixes in the CVS-Vintage dataset [Monperrus and Martinez,2012]: a dataset containing 89,993 source-code versioning transactions across 14 open-source Java programs. To find AST-level changes for each fix, they employ ChangeDistiller [Gall et al.,2009], an AST differencing tool which describes modifications to ASTs using 41 different types of changes (e.g., “Statement Insertion”, “Statement Update”, “Statement Deletion”, “Addition of final to class declaration”).

Martinez and Monperrus[2013] look at the different outcomes produced by using different methods of bug identification. They find that the number of source code changes is a good predictor of whether the changes within that transaction differ to those of normal software evolution; transactions with fewer changes tend to be the most different. In contrast, when transactions were selected purely based on the presence of indicators (e.g., “bug” or “fix”) within their associated messages, the observed distribution of repair action frequencies was almost identical to the distribution across all transactions, suggesting such measures are ineffective at isolating fixes to the source code.

When considering the set of transactions that contain only a single change—as re- ported by ChangeDistiller—the top five change types were as follows: Statement Update (38%), Add Function (14%), Condition Change (13%), Statement Insertion (12%), Statement Deletion (6%). Between them, these change types account for 83% of all changes.

The focus ofMartinez and Monperrus[2013]’s study is on the frequency of particular repair actions within Java projects. In contrast, although we also measure the

frequency of repair actions—albeit it for C—our study is primarily focused on the effectiveness of plastic surgery in the context of APR.

Critical Review of PAR

As an alternative to GenProg’s coarsely-grained repair model, Kim et al. [2013] propose a hand-crafted repair model, based on frequently observed bug fix patterns (e.g., insertion of a null check). They incorporate this repair model into PAR, an evolutionary program repair technique aimed at repairing Java bugs. Compared to an implementation of GenProg for Java, PAR was able repair more bugs (27 out of 119, vs. 16). The authors also demonstrate the acceptability of PAR’s patches through the use of a human study.

Despite achieving impressive results—and an ACM SIGSOFT Distinguished Paper Award—both PAR and the methodology used in its evaluation have since been the subject of criticism by Monperrus[2014]. This criticism has focused on both the unbalanced composition of the bug scenario benchmark used in the evaluation, and the methodology used to conduct the human study. The latter is of relevance to this study. Principally,Monperrus[2014] highlights potential biases within the human study, which may lead to participants to judge the correctness of a patch based on its visual similarity to human-repaired bug fixes, rather than its semantics. He ar- gues that conflating correctness with appearance may unfairly lead to the dismissal of more alien-looking but otherwise correct patches.

In this study, we estimate the utility of a repair action (i.e., its ability to generate repairs) by its frequency within a corpus of mined human-written bug fixes. In practice, this decision may not accurately reflect the utility of some repair actions. (e.g., it may be possible to fix a bug using a certain repair action, but such an action is unlikely to be performed by a human-programmer, owing to its aesthetics or other such non-functional properties.) Nonetheless, our results provide a reasonable ap- proximation of utility. For a more detail discussion of both PAR and its critique, see Section2.2.5.

Bug Fix Patterns in Java

In an effort towards realising a more effective repair model for Java programs,Soto et al. [2016] use software repository mining to determine the frequency—and to some extent, the composition—of certain bug fix patterns within human-written repairs for Java. As a corpus for their study, the authors use the publicly available September 2015/GitHubdataset, provided by Boa [Dyer et al.,2013], containing 4.5 million identified bug fixes in over 500,000 Java projects.

As the first part of their study,Soto et al.[2016] assess how many files are changed by each fix, where file insertions, deletions and modifications are all considered to be changes. Across all files types—including non-source code files—the authors observe

a median of 2 changes, and a surprisingly high mean of 11.3 changes, suggesting a long-tailed distribution. When only Java source code files are considered, the authors find that each bug fix changes a mean of 4.47 files—the median number of changes is omitted from the paper. Although this adjusted mean is still high, the relatively low—and more relevant—median suggests that most bug fixes involve

In document Advanced Techniques for Search-Based Program Repair (Page 116-127)