Mining Code Snippets - Large-Scale Clone Detection and Benchmarking

8.2 Methodology

8.2.1 Mining Code Snippets

In this step, we select a target functionality and mine IJaDatset for code snippets that implement the functionality. IJaDataset contains 24 million (method granularity) snippets, equating 0.288 quadrillion method

heuristic to identify the snippets that might implement the target functionality. This smaller set of candidate

snippets are manually inspected in Section 8.2.2. Snippet mining occurs over five sub-steps as follows:

(1) Select Target Functionality: We begin by selecting a functionality we believe is needed in open-source

Java projects as our target functionality. We research and select the functionality by browsing Internet re-

sources used by open-source developers, such as: Stack Overflow, Java tutorial sites, standard library documentation, and popular 3rd party library documentation. We draw from our own development experiences,

we ask developers for sample functions, and we investigate sample open-source systems using the GitHub

API [102]. We select a functionality we believe will appear many times in IJaDataset.

As a running example, we chose the functionality “Shuffle an Array in Place”. Developers sometimes

need to randomly shuffle the contents of an array. This is reflected in the inclusion of a shuffle method for

List data structures in Java’s standard library. However, no such method is provided for array structures,

and sometimes developers need to shuffle arrays that are too large to be stored or temporary transitioned

into a List type object. Thus, while there is no guarantee, we expect IJaDataset to contain methods that

implement an in-place shuffle for arrays.

(2) Identify Implementations: Before we design a search heuristic, we need to identify how the func-

tionality might be implemented in Java. We review Internet discussion (e.g., Stack Overflow) and API

documentation (e.g., JavaDoc) to identify the common implementations of the target functionality. These resources are frequently used by open source developers, so we expect similar implementations to appear in

IJaDataset. Implementations may use different support methods, data structures and APIs, or they may

express the algorithms using different syntactic constructs (e.g., different control flow statements).

For each of the identified implementations, we collect a sample snippet that uses the implementation to

achieve the functionality. The sample snippets are minimum working examples, and include only the steps

and features necessary to implement the functionality. The sample snippets are added to IJaDataset as

additional crawled open-source code. These play a role in the identified true and false clone pairs. They are

also used to improve the efficiency and accuracy of snippet tagging.

For the running example, our research found that the most common implementation of an in-place array

shuffle is the Fisher-Yates shuffling algorithm. It shuffles an array in place by iterating through an array and

randomly swapping the value at the current index with the value at a randomly selected index. If iteration

begins from the start of the array, the random index is chosen as an index greater than the current. The

reverse is done if iteration begins at the end of the array. Otherwise, implementations differ in the array data type they shuffle (primitive type or object type), and which syntactical loop structure they use. We chose

three sample snippets that exemplify these differences. One of these snippets is shown in Figure 8.2

public static <T> void shuffle(T[] a) { int length = a.length;

Random random = new Random(); for(int i = 0; i < length; i++) {

int j = i + random.nextInt(length-i); T tmp = a[i];

a[i] = a[j]; a[j] = tmp;}}

Figure 8.2: Sample Snippet: Shuffle Array

or features a snippet must realize to be a true positive of the target functionality. We derive our specification

from our research into the functionality, as well as the sample snippets. Our specification is what we believe the average developer would expect the functionality to minimally perform. We validate this with our taggers,

who are developers and software engineering researchers. The specification is designed to be inclusive to the

possible implementations of the target functionality. It should not preclude any implementation, including

any that may not have been identified in step 2.

For our running example, and from our research into its implementations and sample snippets, we decided

our formal specification should simply be “shuffle an array of some data type in place”. This specification is

open enough that it accepts any variation of Fisher-Yates, but also will accept any snippet that shuffles an

array in place using an alternate algorithm.

(4) Create Search Heuristic: We create a search heuristic to locate snippets in IJaDataset that might

implement the target functionality. Drawing from the identified implementations, the sample snippets, and

the specification, we identify the keywords and source code patterns that are intrinsic to the implementations

of the functionality. We expect snippets that implement the functionality to contain some combination

of these keywords and source patterns. These keywords and source patterns are implemented as regular

expression matches and are combined into a logical expression. Keywords and patterns that are expected

to appear together in an implementation are ‘AND’ed, while groups of keywords and patterns that are from

different implementations are ‘OR’ed. The heuristic returns true if an input snippet satisfies the logical expression.

We design the heuristic with respect to the sample snippets (Step 2). We try to select keywords and

source patterns that not only appear in our identified implementations, but may also appear in unidentified implementations. We design the heuristic to balance two opposing constraints: (1) the heuristic should

identify as many true positive snippets as possible, but (2) should not identify so many false positives that

the judges are overburdened in their tagging efforts. We balance completeness and required tagging efforts,

preferring completeness. We try several search heuristics and review their results before choosing the one

that provides the best balance.

As mentioned in Step 2, the Fisher-Yates algorithm iterates through the array and swaps the current index

with a randomly selected index, with the range of the selected index depending on if iteration begins at

the start or the end of the array. If looping from the start, then the selected index is chosen using the

source code pattern: “j = i + random.nextInt(length-i)”. If looping from the end, the pattern is instead: “j = random.nextInt(i+1)”. We therefore used both of these cases using an OR clause. We implemented

the patterns as regular expressions, allowing any valid white space as well as any valid alternative variable

names.

(5) Build Candidate Set: The search heuristic is executed for every snippet in IJaDataset to identify

possible candidate snippets that might implement the target functionality. For the current release we search

all method granularity snippets. Manual inspection is required to identify which are true or false positives

of the target functionality, as discussed in Section 8.2.2 below.

For our running example, the search heuristic we designed in Step 4 identified 281 snippets that include

one of the index selection source code patterns. These form our candidate set for the “Shuffle Array in Place” functionality.

In document Large-Scale Clone Detection and Benchmarking (Page 122-125)