Measuring Precision - Benchmarking Clone Detection

2.2 Benchmarking Clone Detection

2.2.6 Measuring Precision

Compared to recall, it is easier to measure precision without an oracle. The precision of a clone detection

tool (for a given subject software system) can be estimated by manually validating a statistically significant sample of its detected clones. Precision is then the ratio of the validated clones that are judged as true clones

this is repeated for a collection of diverse software systems from a variety of programming domains, and the

precision measurement is averaged. While this procedure is rather simple, there are still some challenges.

Clone validation is a very effort intensive process, and validating detected clones in a variety of software

systems can take a significant amount of time. Clone validation is also subjective [9, 19, 21], so precision measured by different individuals could vary significantly.

Part I

Synthetic Clone Benchmarking with

Mutation Analysis

In this part of the thesis, we present our work with synthetic clone benchmarking. We introduce the Mu-

tation and Injection Framework and use it in a number of tool comparison studies. The Mutation Framework

evaluates clone detection recall using synthetic clones in a mutation-analysis procedure. Synthetic bench-

marking is needed to evaluate clone detection recall at a fine granularity for the different kinds of clones that can exist. Another advantage of synthetic benchmarking is it allows controlled recall experiments to

be conducted, reducing or removing biases in the results. Fine-grained and controlled recall measurement

is more difficult with real-world benchmarks, but real-world benchmarks evaluate for complex and realistic

(developer-produced) clones, which is why both synthetic and real-world benchmarks are needed. In Part II

we discuss our real-world clone benchmark, BigCloneBench.

The Mutation and Injection Framework procedure was previously proposed in the related work [107] and

prototyped [111] for a single clone detection tool (NiCad). For this thesis, we improved the framework in

the following ways: (1) we generalized the framework for compatibility with most clone detection tools, (2) improved the mutation operators and mutation process for better accuracy and control, (3) designed an

evaluation procedure to allow recall to be compared across the clone types and clone edit types without

bias, and (4) implemented the framework as an extensible tool. The framework enables the users to perform

custom and fully automated recall evaluation experiments, which can then be shared, examined, repeated

and extended by the community. We discuss the methodology and design of the Mutation Framework in

Chapter 3.

In Chapter 4, we use our Mutation Framework to evaluate the state of the art clone detection tools per clone type. We compare our measurements against our expectations for the tools, and against the previous

and popular clone benchmark: Bellon’s Benchmark [13]. In this experiment, we validate the accuracy of the

Mutation and Injection Framework, and demonstrate the need for synthetic benchmarking. We also show

that Bellon’s Benchmark may not be accurate for modern clone detection tools, creating the need for a new

real-world clone benchmark, which is a motivation for our BigCloneBench (Part II).

In Chapter 5, we evaluate and compare the recall of state of the art tools at a fine granularity using our

Mutation and Injection Framework. Specifically we measure the recall of the tools per edit type from the editing taxonomy for block and function granularity clones in Java, C and C# systems. In this study we

demonstrate the advantage of the Mutation Framework’s synthetic approach in evaluating the capabilities of

the tools and pin-pointing their individual strengths and weaknesses.

In Chapter 6, we demonstrate how the Mutation Framework can be extended with custom mutation

operators to evaluate clone detection tools for any kind of clone. In our case study, we synthesize Type-3

clones with a single dissimilar gap of variable length. We evaluate the robustness of the Type-3 clone detectors

against Type-3 clones with various sizes of a gap. We generate a reference corpora that can evaluate the

robustness of clone detection tools to small and large dissimilar gaps in otherwise identical Type-3 clones. We find that even the best of the state of the art tools struggle to detect Type-3 clones with a single dissimilar

In Chapter 7, we adapt the Mutation Framework technologies to create ForkSim – a framework for

generating datasets of artificial software variants (i.e., software forks) with known similarities and differences.

These datasets can be used to evaluate tools for software variant analysis, such as for migrating variants

towards a software product line architecture. ForkSim demonstrates how our mutation analysis technology can be used to benchmark clone detection and other software analysis tools, for various applications. We

Chapter 3 The Mutation and Injection Framework

In this chapter, we present the Mutation and Injection Framework, a synthetic clone benchmarking

framework that precisely evaluates clone detection recall at a fine granularity using a mutation-analysis

procedure. The framework begins by selecting a random code fragment from a large repository of sample

source code. It duplicates and mutates this code fragment to produce a code clone of a known clone type

and with a known difference. The mutation operators used in clone synthesis are based on a comprehensive

and empirically validated taxonomy of the types of edits developers make on copy and pasted code. The

clone is then injected into a software system, evolving the system by a single copy-paste and modify clone.

The clone detection tool is then executed for this software system and recall is measured for the injected

clone. Since the framework created the clone itself, it is able to precisely evaluate the tool’s detection of the clone, including if it appropriately handled the clone-type specific differences between the cloned code. This

is repeated many thousands of times across all of the edit types in the taxonomy, allowing a comprehensive

and exhaustive measurement of recall. The framework fully automates the recall experiment, and allows all

aspects of the experiment to be customized and controlled.

We created the Mutation Framework to overcome challenges in Bellon’s Benchmark [13], which has

been the standard benchmark in clone detection for many years. Bellon built his benchmark by manually

validating 2% of the clones detected by six contemporary (2002) tools for eight subject systems, requiring 77

hours of manual clone validation efforts. While the union may provide good relative performance evaluation between participating tools [13], there is no guarantee that subject tools have collectively detected all clones

within the subject systems and therefore the measure of absolute performance is questionable. The reference

corpus is therefore biased by the types of clones the participating tools detect. Baker [9] raised concerns

with problems in the creation of Bellon’s benchmark, including clone validation procedures. Charpentier et

al. [19] revalidated a number of the clones and found disagreement in the results. The Mutation Framework

overcomes these challenges by synthesizing clone benchmarks that are independent of the clone detection

themselves, and which requires no subjective manual validation.

The Mutation and Injection Framework has some distinct advantages in measuring recall. It supports

three programming languages (Java, C and C#) and two clone granularities (function and block). These are abstracted from the procedure, and the framework could be extended to additional languages and granular-

includes mutation operators for every type of edit developers make on copy and pasted code. This allows recall

to be comprehensively measured at a finer granularity than clone type, allowing a tool’s specific capabilities

to be measured. The user configures the properties of the clones to be included in the synthesized reference

corpus, including clone size, syntactical similarity, mutations and granularity. The user can therefore create a custom benchmark corpus for any general or specific cloning context to evaluate their tool against. Recall

experiments produced by the framework can be easily replicated, duplicated, shared, modified and extended.

This chapter is based on a (currently unpublished) manuscript entitled “The Mutation and Injection

Framework” and authored by myself and Chanchal K. Roy. The manuscript has been edited and reformatted

to better fit this thesis.

This chapter is organized as follows. We discuss essential background knowledge in Section 3.1. We describe the framework’s methodology in Section 3.2, and its usage in Section 3.3. We discuss the related

work in Section 3.5, and conclude this work in Section 3.6.

3.1 Background

In this section, we provide additional background knowledge for this chapter. General background on clones

and clone detection benchmarking can be found in Chapter 2. Here we describe the clone similarity metric used in this chapter and by the Mutation Framework. We also describe the editing taxonomy for cloning,

which is an essential to our framework’s clone synthesis process.

In document Large-Scale Clone Detection and Benchmarking (Page 34-40)