2.2 Benchmarking Clone Detection
2.2.6 Measuring Precision
Compared to recall, it is easier to measure precision without an oracle. The precision of a clone detection
tool (for a given subject software system) can be estimated by manually validating a statistically significant sample of its detected clones. Precision is then the ratio of the validated clones that are judged as true clones
this is repeated for a collection of diverse software systems from a variety of programming domains, and the
precision measurement is averaged. While this procedure is rather simple, there are still some challenges.
Clone validation is a very effort intensive process, and validating detected clones in a variety of software
systems can take a significant amount of time. Clone validation is also subjective [9, 19, 21], so precision measured by different individuals could vary significantly.
Part I
Synthetic Clone Benchmarking with
Mutation Analysis
In this part of the thesis, we present our work with synthetic clone benchmarking. We introduce the Mu-
tation and Injection Framework and use it in a number of tool comparison studies. The Mutation Framework
evaluates clone detection recall using synthetic clones in a mutation-analysis procedure. Synthetic bench-
marking is needed to evaluate clone detection recall at a fine granularity for the different kinds of clones that can exist. Another advantage of synthetic benchmarking is it allows controlled recall experiments to
be conducted, reducing or removing biases in the results. Fine-grained and controlled recall measurement
is more difficult with real-world benchmarks, but real-world benchmarks evaluate for complex and realistic
(developer-produced) clones, which is why both synthetic and real-world benchmarks are needed. In Part II
we discuss our real-world clone benchmark, BigCloneBench.
The Mutation and Injection Framework procedure was previously proposed in the related work [107] and
prototyped [111] for a single clone detection tool (NiCad). For this thesis, we improved the framework in
the following ways: (1) we generalized the framework for compatibility with most clone detection tools, (2) improved the mutation operators and mutation process for better accuracy and control, (3) designed an
evaluation procedure to allow recall to be compared across the clone types and clone edit types without
bias, and (4) implemented the framework as an extensible tool. The framework enables the users to perform
custom and fully automated recall evaluation experiments, which can then be shared, examined, repeated
and extended by the community. We discuss the methodology and design of the Mutation Framework in
Chapter 3.
In Chapter 4, we use our Mutation Framework to evaluate the state of the art clone detection tools per clone type. We compare our measurements against our expectations for the tools, and against the previous
and popular clone benchmark: Bellon’s Benchmark [13]. In this experiment, we validate the accuracy of the
Mutation and Injection Framework, and demonstrate the need for synthetic benchmarking. We also show
that Bellon’s Benchmark may not be accurate for modern clone detection tools, creating the need for a new
real-world clone benchmark, which is a motivation for our BigCloneBench (Part II).
In Chapter 5, we evaluate and compare the recall of state of the art tools at a fine granularity using our
Mutation and Injection Framework. Specifically we measure the recall of the tools per edit type from the editing taxonomy for block and function granularity clones in Java, C and C# systems. In this study we
demonstrate the advantage of the Mutation Framework’s synthetic approach in evaluating the capabilities of
the tools and pin-pointing their individual strengths and weaknesses.
In Chapter 6, we demonstrate how the Mutation Framework can be extended with custom mutation
operators to evaluate clone detection tools for any kind of clone. In our case study, we synthesize Type-3
clones with a single dissimilar gap of variable length. We evaluate the robustness of the Type-3 clone detectors
against Type-3 clones with various sizes of a gap. We generate a reference corpora that can evaluate the
robustness of clone detection tools to small and large dissimilar gaps in otherwise identical Type-3 clones. We find that even the best of the state of the art tools struggle to detect Type-3 clones with a single dissimilar
In Chapter 7, we adapt the Mutation Framework technologies to create ForkSim – a framework for
generating datasets of artificial software variants (i.e., software forks) with known similarities and differences.
These datasets can be used to evaluate tools for software variant analysis, such as for migrating variants
towards a software product line architecture. ForkSim demonstrates how our mutation analysis technology can be used to benchmark clone detection and other software analysis tools, for various applications. We
Chapter 3
The Mutation and Injection Framework
In this chapter, we present the Mutation and Injection Framework, a synthetic clone benchmarking
framework that precisely evaluates clone detection recall at a fine granularity using a mutation-analysis
procedure. The framework begins by selecting a random code fragment from a large repository of sample
source code. It duplicates and mutates this code fragment to produce a code clone of a known clone type
and with a known difference. The mutation operators used in clone synthesis are based on a comprehensive
and empirically validated taxonomy of the types of edits developers make on copy and pasted code. The
clone is then injected into a software system, evolving the system by a single copy-paste and modify clone.
The clone detection tool is then executed for this software system and recall is measured for the injected
clone. Since the framework created the clone itself, it is able to precisely evaluate the tool’s detection of the clone, including if it appropriately handled the clone-type specific differences between the cloned code. This
is repeated many thousands of times across all of the edit types in the taxonomy, allowing a comprehensive
and exhaustive measurement of recall. The framework fully automates the recall experiment, and allows all
aspects of the experiment to be customized and controlled.
We created the Mutation Framework to overcome challenges in Bellon’s Benchmark [13], which has
been the standard benchmark in clone detection for many years. Bellon built his benchmark by manually
validating 2% of the clones detected by six contemporary (2002) tools for eight subject systems, requiring 77
hours of manual clone validation efforts. While the union may provide good relative performance evaluation between participating tools [13], there is no guarantee that subject tools have collectively detected all clones
within the subject systems and therefore the measure of absolute performance is questionable. The reference
corpus is therefore biased by the types of clones the participating tools detect. Baker [9] raised concerns
with problems in the creation of Bellon’s benchmark, including clone validation procedures. Charpentier et
al. [19] revalidated a number of the clones and found disagreement in the results. The Mutation Framework
overcomes these challenges by synthesizing clone benchmarks that are independent of the clone detection
themselves, and which requires no subjective manual validation.
The Mutation and Injection Framework has some distinct advantages in measuring recall. It supports
three programming languages (Java, C and C#) and two clone granularities (function and block). These are abstracted from the procedure, and the framework could be extended to additional languages and granular-
includes mutation operators for every type of edit developers make on copy and pasted code. This allows recall
to be comprehensively measured at a finer granularity than clone type, allowing a tool’s specific capabilities
to be measured. The user configures the properties of the clones to be included in the synthesized reference
corpus, including clone size, syntactical similarity, mutations and granularity. The user can therefore create a custom benchmark corpus for any general or specific cloning context to evaluate their tool against. Recall
experiments produced by the framework can be easily replicated, duplicated, shared, modified and extended.
This chapter is based on a (currently unpublished) manuscript entitled “The Mutation and Injection
Framework” and authored by myself and Chanchal K. Roy. The manuscript has been edited and reformatted
to better fit this thesis.
This chapter is organized as follows. We discuss essential background knowledge in Section 3.1. We describe the framework’s methodology in Section 3.2, and its usage in Section 3.3. We discuss the related
work in Section 3.5, and conclude this work in Section 3.6.
3.1
Background
In this section, we provide additional background knowledge for this chapter. General background on clones
and clone detection benchmarking can be found in Chapter 2. Here we describe the clone similarity metric used in this chapter and by the Mutation Framework. We also describe the editing taxonomy for cloning,
which is an essential to our framework’s clone synthesis process.