Evaluation - Large-Scale Clone Detection and Benchmarking

As a demonstration of ForkSim’s primary use case, tool evaluation, we evaluated NiCad’s performance for

similarity detection between software variants. While NiCad is a clone detector designed for single systems,

it can be used to detect similarity between forks by executing it for the entire dataset and trimming the

intra-project clone results from its output. To evaluate NiCad, we generated a ForkSim dataset of 5 Java

forks. We used JHotDraw54b1 as the subject system and Java6 as the source repository. The generation

parameters used are listed in Table 7.4. The NiCad clone detector is capable of detecting function and block

granularity near-miss clones. It uses TXL to parse source elements of these granularities from an input system, and uses a diff-like algorithm to detect clones after these source elements have been normalized to

in this experiment, we extended NiCad to support the detection of clones at the file granularity.

Using NiCad, we detected the file and function clones in the dataset. NiCad was set to detect clones

3-5000 lines long, with at most 30% difference. It was configured to pretty print the source, blind rename

the identifiers, and normalize the literal values in the dataset before detection. The clones were collected

both in clone pair (pairs of similar files or functions) and clone class (set of similar files or functions) format.

Overall, NiCad found 363 file clone classes (16,553 pairs) and 1831 function clone classes (2,198,636 pairs).

To evaluate NiCad’s recall, we converted the known similarities between the forks into file and function

clone classes. Each file injected into multiple forks was converted into a file clone class, as were the files

contained in directories injected into multiple forks. File clone classes were created for each of the files the

forks inherited from the subject system, with the files modified due to function injection trimmed from these

classes. Each function injected into multiple forks was converted into a function clone class. Lastly, a function

clone class was created for each function the forks inherited from the subject system. These clone classes

were also converted to clone pair format.

NiCad’s recall performance is summarized in Table 7.5. Recall was measured per clone granularity (file or

function), and per origin of similarity (file/directory/function injection or original subject system files). As

can be seen, NiCad had 100% recall for all sources of file clone classes, and 98-99% for function clone classes.

If we consider clone pairs instead of clone classes, we see that the function clone detection is marginally

better (+0.1%). These are very promising results for NiCad as a fork similarity analysis tool. These results

are specific to the dataset’s generation parameters. In future we plan to evaluate NiCad’s recall performance for many datasets with varied parameters; for example, with larger and smaller max mutation values.

Due to time constraints, we did not perform a full precision analysis for this experiment. However, NiCad

is known to have high precision [110]. Using known similarities, we were able to validate 20.7% of NiCad’s

reported file clone pairs, but only 1.46% of its reported function clone pairs. NiCad is reporting a large

amount of cloned code beyond that of the known similarities. Part of this is due to unknown similarities arising from clones within the original subject system. However, a large fraction of this is due to the NiCad

Table 7.4: ForkSim Generation Parameters: NiCad Case Study

Parameter Value

Subject System JHotDraw54b1

Source Repository Java6

Language Java

# Forks 5

# Files 100

# Directories 25

# Functions 100

Function Size 20-100 lines

Max Injections 5

Uniform Injection Rate 50%

Mutation Rate files: 50%, directories(files): 50%, functions: 50%

Rename Rate files: 50%, directories: 50%

Table 7.5: NiCad Case Study Recall Results Type File Injections Directory Injections Function

Injections Original Files

File Clone Class 100%

(41/41) 100% (117/117) - 100% (260/260) Function Clone Class - - 98.7% (75/76) 99.4% (2869/2886) Function Clone Pair - - 98.8% (332/336) 99.5% (28708/28860)

clone size settings used. A minimum clone size of 3 lines was required to ensure that all cloned functions were

detected. However, small standard functions such as getters and setters are very similar after normalization, which was a source of a large number of these clone pairs. Likewise, interfaces and simple classes are likely

to be detected as similar after identifier normalization. For practical usage, these small similarities would

likely be filtered out in preference of the larger similarities. In summary, NiCad has very good detection

performance of similarities between forks, but the quantity of output would make its usage difficult. A

post-processing step needs to be added to extract the most useful and important similarity features from its

output.

7.8 Conclusion

In this chapter we have introduced ForkSim, a tool for generating customizable datasets of synthetic forks with

known similarities and differences. These datasets can be used in any research on the detection, visualization,

and comprehension of code similarity amongst software variants. ForkSim datasets allow similarity detection

tools to be evaluated in terms of recall (automatically) and precision (semi-automatically), and can be useful

in experiments aiming at evaluating the usability and visualization of similarity tools. We demonstrated

ForkSim using a case study evaluating NiCad’s cross-project similarity detection for a set of five ForkSim-

Part II

Real-World Large-Scale Clone

Benchmarking

In this part, we present our work with real-world and large-scale inter-project clone benchmarking. In

Chapter 4, we showed that the previous leading real-world clone benchmark, Bellon’s Benchmark, is not

appropriate for evaluating modern clone detection tools. While we had already delivered a high quality

synthetic benchmark with the Mutation and Injection Framework, a new real-world benchmark was warranted and needed by the community. Additionally, no existing clone benchmark was appropriate for evaluating

clone detection tools in the context of inter-project and large-scale clone detection, an emerging research

topic that we explore in this thesis, so such a benchmark was needed by ourselves and the community.

For these reasons, we introduce BigCloneBench: our real-world big clone benchmark for evaluating all

flavors of clone detection, including inter-project and large-scale clone detection. We built BigCloneBench

by mining IJaDataset, a big inter-project source-code dataset, for clones of distinct functionalities. We

designed a mining and validation procedure capable of building a large benchmark while minimizing the

clone validation efforts and minimizing the subjectivity in the validation results. We built a big benchmark

of eight million reference clones spanning the four clone types as well as the entire spectrum of syntactical

similarity, including intra-project, inter-project and semantic clones. We describe our clone mining procedure, the contents and properties of our benchmark, and its usages in Chapter 8.

We used BigCloneBench to conduct a clone detection tool comparison study where we measured recall

per clone type, including for each region of syntactical similarity. We measured and compared recall for

intra-project vs inter-project clones, and evaluated how well the tools capture the reference clones using

multiple clone-matching algorithms. We compared the results of our real-world benchmark against those

from our synthetic clone benchmark to demonstrate the need for both styles of benchmarking to get a full

understanding of a tool’s recall performance. This study is presented in Chapter 9.

To make this benchmarking procedure accessible to the community, we distilled our tool evaluation exper-

iment procedure into a customizable framework called BigCloneEval. This framework makes the execution

of recall evaluation experiments with BigCloneBench easy, and handles tool execution, tool scalability, and recall measurement automatically for the user. The evaluation experiments are customizable, including a

plug-in architecture for using custom clone-matching algorithms. Importantly, BigCloneEval creates a refer-

ence standard for tool evaluation with BigCloneBench. BigCloneEval is presented in Chapter 10.

Later in this thesis (Part III, Chapter 12), we use BigCloneBench to evaluate our CloneWorks clone

detector for large-scale clone detection, including the measurement of recall, precision, execution time and

Chapter 8 BigCloneBench

There are multiple flavors of clone detection tools. Classical clone detection tools locate syntactically

similar code within a single software system or small repository. These tools have been traditionally used

to cancel out the effects of ad-hoc code reuse (e.g., copy and paste) [108] in software systems. Semantic

clone detectors locate code that implements the same or similar functionalities. These tools target the clones

the classical detectors miss due to a lack of syntactical similarity. Recently, new applications for clone

detection and search have emerged relying on detected clones among a large number of software systems.

Since classical clone detection tools do not support the needs of such emerging applications, new large-

scale clone detection and clone search algorithms are being proposed as an embedded part of the emerging

applications. For example, large-scale clone detection and clone search (e.g., [81]) is used to find similar

mobile applications [22], intelligently tag code snippets [97], find code examples [66], and so on.

A limitation with existing benchmarks is that they only target the classical clone detectors which focus

on the detection of syntactically similar intra-project clones. They typically only consider the clones within a couple of software systems. Many “flavors” of clone detection cannot be evaluated by these benchmarks.

Semantic clone detectors require a benchmark of semantically similar clones with a wide range of syntactical

similarity. Large-scale clone detectors require a benchmark with many inter-project clones. Clone search

algorithms must be evaluated against large clone classes. By targeting a single benchmark scope, the bench-

mark becomes limited to a specific sub-class of clones. The community needs a standard benchmark that

covers the full range of clone types and clone detection applications.

In this chapter, we introduce BigCloneBench, a large-scale clone benchmark of true and false clones in

IJaDataset 2.0 [4] (25,000 subject systems, 2.3 million files, 250MLOC). Unlike previous real-world bench-

marks (namely Bellon’s Benchmark [13]), we did not use clone detectors to build our benchmark. Rather,

we mine IJaDataset for clones of frequently used functionalities. We used search heuristics to automatically

identify code snippets in IJaDataset that might implement a target functionality. These candidate snippets

are manually tagged as true or false positives of the target functionality by expert judges. The benchmark is populated with the true and false clones oracled by the tagging process. We use TXL-based [27] auto-

matic source transformation and analysis technologies to typify these clones and measure their syntactical

similarity.

measure the recall of all flavors of clone detection. The benchmark contains many intra-project clones for

evaluating classical clone detection tools. Every clone is of a functionality, with wide variety of syntactical

similarity, which makes it ideal for evaluating semantic clone detection tools. Its large number of inter-project

clones makes it an ideal target for evaluating large-scale detectors. It contains large clone classes of distinct functionalities, which can be used as targets for evaluating clone search algorithms. While not all tools scale to

large-scale, they can be evaluated for BigCloneBench by executing them for subsets of the benchmark within

their scalability constraint. As a standard benchmark, BigCloneBench is the ideal target for comparing the

execution time and scalability of clone detection tools. BigCloneBench also documents 288 thousand known

false positive clones discovered during the mining process. These can be used to evaluate the accuracy of the

clone detectors, but cannot replace a traditional measurement of precision by manual validation of a clone

detection tool’s output. We have focused our efforts on building a benchmark for measuring recall for any

flavor of clone detection.

This chapter is an updated and extended version of our manuscript [125] “Towards a Big Data Curated

Benchmark of Inter-Project Code Clone” which was published in the International Conference on Software

Maintenance and Evolution, c 2014 IEEE. I was the lead author on this work, and my co-authors include Judith F. Islam, Iman Keivanloo, Chanchal K. Roy and Mohammad Mamun Mia. Iman Keivanloo and

Chanchal K. Roy acted as supervisors for this project, while Judith F. Islam and Mohammad Mamun Mia

contributed functionality selection and clone validation efforts. The publication has been updated to reflect the latest work on BigCloneBench, and re-formatted for this thesis.

The rest of this chapter is organized as follows. In Section 8.1 we discuss the related work. In Section 8.2

we describe our methodology for building BigCloneBench, in Section 8.3 we discuss our efforts executing

this procedure, and in Section 8.4 we overview the contents of the final benchmark. Then in Section 8.5 we describe how the benchmark can be used to evaluate clone detection tools, and in Section 8.6 we describe

how it can be used to evaluate clone search tools. We close with a description of the distribution of the

benchmark in Section 8.7, the threats to the validity of the benchmark in Section 8.8, and our conclusions

in Section 8.9.

8.1 Related and Previous Work

Benchmark experiments have been performed that measure the recall and precision of classical clone detection

tools that scale to a single system. However, measuring recall has traditionally been very challenging. Some

experiments have ignored recall, and measured tool precision by manually validating a small sample of a tool’s

candidate clones [38, 52, 70, 76, 82]. Other experiments have tackled the recall problem by accepting the union

of multiple tools’ candidate clones as the reference set, possibly with some manual validation [13,18,35,94,112]. For some experiments, very small subject systems were manually inspected for clones [18, 71, 110]. An ideal

is not feasible except for toy systems. For example, when considering only clones between functions in the

relatively small system Cook, there is nearly a million function pairs to manually inspect [137].

Large-scale (Big Data) analysis is a very popular and rewarding field in both industry and academia.

As the benefits and utility of large-scale analysis has become clearer [87], so has the number of technolo-

gies (e.g., [6, 29, 36, 37, 132]) that enable it. To develop, improve, and compare these technologies, quality

benchmarks are needed. This need has been recognized by the international community in such conferences

or workshops as Big Data Benchmarking [140], which began in 2010. There has been significant efforts in

evaluating large-scale analysis technologies [5, 7, 25, 40, 98, 101]. For example, BigBench [40] models a typical

large-scale scenario and generates a large-scale benchmark problem. The major large-scale technologies could

be compared using BigBench.

In contrast, BigCloneBench is a domain-specific large-scale benchmark for evaluating all types of clone

detection and clone search technologies, especially those that scale to large-scale. It can be used to measure

recall, estimate precision, and compare the execution time and scalability of clone detection and search

tools. The benchmark consists of a curated collection of true and false clone pairs in the large-scale inter-

project repository IJaDataset. Recall and precision are measured by comparing the tool’s output against the

benchmark. Execution time and scalability are compared by using IJaDataset as a common target for the

detection tools.

While the Mutation and Injection Framework could be adapted to large-scale by injecting artificial clones into them, the importance of evaluating the tools with real data is widely discussed [12, 75, 77, 128]. Krutz

and Le [77] oracled all method pairs between randomly selected files in a subject system using several judges.

While their data has high confidence, their benchmark is very small, only 66 method clone pairs.

These existing benchmarks are not suitable for evaluating the other flavors of clone detection tools, for

example, the emerging large-scale clone detection algorithms. The existing benchmarks are too small, and

only consider intra-project clones from a handful of subject systems. Large-scale clones span thousands

of subject systems, and inter-project clones may have significantly different properties from intra-project

clones. In this paper we present a large-scale benchmark that contains millions of inter-project clones, spans thousands of subject systems, was built without the use of clone detectors, and has a very clear oracling

procedure.

In document Large-Scale Clone Detection and Benchmarking (Page 114-121)