Analyzing the Predictability of Source Code and its Application in Creating Parallel Corpora for English-to-Code Statistical Machine Translation

(1)

Analyzing the Predictability of Source Code and its Application in

Creating Parallel Corpora for English-to-Code Statistical Machine

Translation

Musfiqur Rahman A Thesis in The Department of

Computer Science and Software Engineering

Presented in Partial Fulfillment of the Requirements

For the Degree of Master of Computer Science (Computer Science) at Concordia University

Montréal, Québec, Canada

March 2018 c

(2)

Concordia University

School of Graduate Studies

This is to certify that the thesis prepared

By: Musfiqur Rahman

Entitled: Analyzing the Predictability of Source Code and its Application in Creating Parallel Corpora for English-to-Code Statistical Machine Translation

and submitted in partial fulfillment of the requirements for the degree of

Master of Computer Science (Computer Science)

complies with the regulations of this University and meets the accepted standards with respect to originality and quality.

Signed by the final examining committee:

Chair Dr. Tse-Hsun Chen Examiner Dr. Sabine Bergler Examiner Dr. Weiyi Shang Supervisor Dr. Peter Rigby Approved by

Dr. Volker Haarslev, Graduate Program Director

23 March 2018

Dr. Amir Asif, Dean

(3)

Abstract

Analyzing the Predictability of Source Code and its Application in Creating Parallel

Corpora for English-to-Code Statistical Machine Translation

Musfiqur Rahman

Analyzing source code using computational linguistics and exploiting the linguistic properties of source code have recently become popular topics in the domain of software engineering. In the first part of the thesis, we study the predictability of source code and determine how well source code can be represented using language models developed for natural language processing. In the second part, we study how well English discussions of source code can be aligned with code elements to create parallel corpora for English-to-code statistical machine translation. This work is organized as a “manuscript” thesis whereby each core chapter constitutes a submitted paper.

The first part replicates recent works that have concluded that software is more repetitive and predictable,i.e.more natural, than English texts. We find that much of the apparent “naturalness” is artificial and is the result of language specific tokens. For example, the syntax of a language, especially the separatorse.g., semi-colons and brackets, make up for 59% of all uses of Java tokens in our corpus. Furthermore, 40% of all 2-grams end in a separator, implying that a model for autocompleting the next token, would have a trivial separator as top suggestion 40% of the time. By using the standard NLP practice of eliminating punctuation (e.g.,separators) and stopwords (e.g., keywords) we find that code is less repetitive and predictable than was suggested by previous work. We replicate this result across 7 programming languages.

Continuing this work, we find that unlike the code written for a particular project, API code usage is similar across projects. For example a file is opened and closed in the same manner irrespective of domain. When we restrict our n-grams to those contained in the Java API we find that the entropy for 2-grams is significantly lower than the English corpus. This repetition perhaps explains the successful literature on API usage suggestion and autocompletion.

We then study the impact of the representation of code on repetition. The n-gram model assumes that the current token can be predicted by the sequence of n previous tokens. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the n-gram representations of the same code. This suggests that future work should focus on graphs that include control and data flow dependencies and not linear sequences of tokens.

(4)

The second part of this thesis focuses cleaning English and code corpora to aid in machine translation. Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. We clean_{StackOverflow}, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpora. We contrast three data cleaning approaches: standard NLP, title only, and software task. We evaluate the quality of each corpus for MT. We measure the corpus size, percentage of unique tokens, and per-word maximum likelihood alignment entropy. While many works have shown that code is repetitive and predictable, we find that English discussions of code are also repetitive. Creating a maximum likelihood MT model, we find that English words map to a small number of specific code elements which partially explains the success of usingStackOverflowfor search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available.

(5)

Acknowledgments

I would like to begin with acknowledging the guidance, encouragement, and support from my supervisor, Dr. Peter Rigby. Without his proper supervision this work would not have came into existence. I wholeheartedly thank him for his patience, effort and time towards me and my thesis.

I am also thankful to my fellow colleagues and paper collaborators, especially Dharani Kumar Palani, Nafiz-al-Naharul Islam, and Nicolas Chausseau-Gaboriault.

I am thankful to the Dr. Sabine Bergler for her valuable suggestions that helped me generate new ideas for the research. Additionally, I would like to thank Dr. Tien Nguyen (University of Texas at Dallas) and Dr. Christoph Treude (University of Adelaide) for their time and guidance.

Last but certainly not least, I thank my parents who encouraged and supported me throughout the time of my research.

(6)

List of Figures

1 Pipeline for experiments performed in this study . . . 12 2 _{SelfCrossEntropy}of programming languages with and without

SimpleSyntax-Tokens . . . 13 3 Comparing the Java APISelfCrossEntropywith raw Java source code, Java source

code withoutSimpleSyntaxTokens, and English . . . 18 4 The top 20% of the n-grams and n-node graphs account for the y-axis% of the usages.

For example, the top 20% of the n-node graphs account for 80.6% of all usages. . . . 23 5 AGroumrepresenting iteration through a HashMap, which is an abstraction of the

code in Listings 1 through 4. . . 24 6 An example of a_{StackOverflow} post that answers in both English and code the

question, “How can I refresh the cursor from a CursorLoader?” English words, such as “discarded”, can be aligned with code elements, such asrestartLoader(), for use in MT. Post available at: https://stackoverflow.com/a/11092861/1055441 . . . 33 7 Per-word maximum likelihood alignment entropy . . . 39

(9)

List of Tables

1 Corpus size in tokens per language . . . 10 2 Percentage of language specific token,i.e._{SimpleSyntaxTokens}, for each

program-ming language corpus . . . 15 3 Percentage increase inSelfCrossEntropy after the removal of

SimpleSyntaxTo-kens . . . 16 4 The cumulative proportion of n-node graphs and n-grams from 0% to 100% in 10 point

increments for all usages. For example, the top 40% of the n-node graphs account for over 89% of all usages. The table shows the left skew of the distributions. . . 24 5 Simple size measures of the corpora . . . 38

(10)

Chapter 1

Introduction

Leveraging methods and algorithms fundamentally developed for Natural Language Processing [29] in modelling and analyzing programming languages is an interesting topic of research in the domain of software engineering. In this thesis we first investigate the regularity of multiple programming languages using n-gram language models and then use various NLP techniques to process software engineering documentation to create a bilingual English-code parallel corpus.

Language models are a very popular approach in the field of Statistical Machine Translation (SMT)[32] and Natural Language Processing (NLP)[29]. The growing popularity of this approach has resulted in the application of language modelling techniques in diverse fields. In the field of Software Engineering recent works have exploited the benefits of language modelling to study the ‘naturalness’ of software source code [26, 50, 13, 57]. Although the termnaturalnessapparently does not refer to any mathematical notion, it has been presented mathematically by using the theory of statistical language modelling [26]. In essence, language models, being trained on a large corpus, assign higher naturalness to previously seen code, while assigning lower naturalness to unseen or rarely seen code. For example, Campbellet al. [13] showed that language models mark code which is syntactically faulty asunlikely orless likely. The goal of Chapter 3 is to explain the repetitive behaviour of source code for multiple programming languages and to compare the repetitiveness of n-grams with graph representations of code. We perform our experiments from the point of view of the token distribution to determine if naturalness can be found in popular programming languages. We want to understand if there are any language specific features that make one language more repetitive than the others. We compare the n-gram representation (i.e.sequences of n tokens) used by previous works [71, 26, 13] with a graph based representation. We conjecture that low-level lexical tokens may artificially inflate the repetitiveness of source code. For example, in most of the popular programming languages theiftoken is always followed by a(token. This trivial repetitiveness is

(11)

not present in graphs. We particularly focus on the following:

• the impact of different types of tokens on the repetitiveness of source code. We examine keywords, operators, and separators to observe their repetitiveness in source code,

• changes in token repetitiveness after the removal of language specific tokens,

• repetitiveness of API element usage, and

• graph-based representation of source code and its impact on source code repetitiveness.

Chapter 3 is broken into the following sections. In Section 3.3, we describe our data. We replicate previous results in Section 3.4. In Section 3.5, we study the impact of types of tokens on repetitiveness. In Section 3.6, we examine the repetitiveness of API code elements. In Section 3.7, we compare graphs with n-gram. Since we extract different tokens and graphs, we describe the extraction methodology in this section where they are used. In Section 3.8, we discuss limitations of our work and threats to validity. In Section 3.9, we position our work in the context of the literature. In Section 3.10, we summarize our contribution and conclude the chapter.

In Chapter 4, we leverage our finding regarding the ‘naturalness’ of source code. Since source code APIs are much more repetitive than natural languages we try to process source code and software engineering discussion in order to automatically translate from English to source code.

The process of translating between two languages automatically is known as Machine Translation (MT). Recent advances and computational power have increased the popularity of MT. MT techniques can be broadly classified into three classes: Statistical Machine Translation (SMT) [33, 2, 38], Example-based Machine Translation (EBMT) [69], and Neural Machine Translation (NMT) [6, 46]. Although MT approaches differ in terms of theory, algorithms, and efficiently, all approaches require a high volume and low noise parallel corpus [31, 7] orbitext [25]. Application of MT algorithms is not limited to translation between natural languages. In recent years, MT techniques have been used to translate from natural language to programming languages [24, 49, 56].

StackOverflow can be seen as a bilingual corpus that discusses programming in both English and code. However,_{StackOverflow}posts are noisy because people write posts in an informal manner. Examples of noise in_{StackOverflow}includes incorrect spelling, inappropriate use of punctuation, use of acronyms without elaboration, and grammatical mistakes. From a linguistic point of view, this results in a degradation of the quality of corpus texts. Removing the noise from theStackOverflowdata without any significant loss of relevant information is challenging. In this chapter our goal is to clean the data using techniques ranging from general Natural Language Processing (NLP) [29] to Software Engineering specific techniques [70], and determine which techniques yield a corpus that can be used for MT. We process data using three different methods and determine the quality of the processed corpora using three evaluation metrics.

(12)

Chapter 4 is structured as follows. In Section 4.3, we detail our data and data cleaning approaches. In Section 4.4, we evaluate each corpus for MT. Finally we conclude the chapter by summarizing our contribution and briefly discussing some potential future works in Section 4.5.

This thesis is organized as a “manuscript” thesis whereby each core chapter constitutes a submitted paper. There is also a background section and conclusion section that combine the manuscripts into a thesis. Chapter 2 briefly introduces the literature and background necessary for the thesis. Chapters 3 describes our paper on the “natural” properties of software. Chapter 4 is our paper on cleaningStackOverflowfor machine translation. Chapter 5 summarizes our contributions and suggest potential future directions.

(13)

Chapter 2

Background and Literature

We break the related work into the following categories:

• Application of NLP in software engineering

• Research on code validation

• Research into autocompletion and recommenders

• Statistical translation

• Use of _{StackOverflow}in software engineering research

2.1

Application of NLP in software engineering

Basic research into understanding redundancy and measuring entropy in languages has a long history. Shannon [68] developed statistical measures of entropy for the English language. Gabel and Su [22] noted high levels of redundancy in code and Hindle et al.[26] continued this work,s demonstrating that software is highly repetitive and predictable. Recent work has replicated these software findings on a giga-token corpus [3] and looked at the entropy in local code contexts [71]. Other have examined repetition at the line level [59] or in other domains such as Android Apps [36, 5]. In each case, code has been found to be repetitive and predictable.

Besides language entropy, many software engineering researchers used other NLP approaches such as text classification, latent semantic analysis etc in their works. For example, Huang et al. and Maldonadoet al.in [27, 42] respectively took the text classification approach for identifying self admitted technical debt. Lormanset al. used latent semantic indexing for designing implementation by linking requirements and test cases [39]. Latent Dirichlet Allocation (LDA) technique was used

(14)

by Wanget al.[72] to study developers’ interaction onStackOverflow. In [65] authors studied what mobile App developers ask about on_{StackOverflow}. They also use LDA technique for topic identification from the discussion forum.

2.2

Research on code validation and checking

Most existing tools to find defects and other code faults use static analysis. Recent works have focused on using the statistical properties of the languages to find bugs and to suggest patches. For example, Campbellet al.[13] find that syntax errors can be identified using n-gram language models. Ray et al.[58] identified bugs and bug fixes in code because buggy code is less natural and has a higher entropy. Santos and Hindle [67] used the n-gram cross entropy of text in commit messages to identify successfully commits that were likely to make a build fail.

2.3

Research into autocompletion and suggestions

Modern IDEs contain an autocompletion feature that usually uses the structure of the language to make suggestions. Research into code suggestion have long known intuitively that code is repetitive. For example, textual similarity of program code [4], commit messages [14], and API usage patterns [44] have been exploited to guide developers during their engineering activities. Building on this work, Zimmermannet al.[76] used association rule mining on CVS data to recommend source code that is potentially relevant to a given change task. Recent work by Azadet al. [5] has extended this work to make change rule predictions from a large community of similar Apps and the code discussed in StackOverflowdiscussions.

Advanced autocompletion techniques have leveraged the history of applications and the repetitive nature of programming to suggest code elements to developers. Robbes and Lanza [62] filtered the suggestions made by code completion algorithms based on, for example, where the developer had been working in the past and the changes he or she had made. Bruchet al.[9] suggested appropriate method calls for a variable based on an existing code base that makes similar calls to a library. Buse and Weimer [10] automatically generate code snippets from a large corpus of applications that use an API. Duala-Ekoko and Robillard [21] use structural relationships between API elements, such as the method responsible for creating a class, to suggest related elements to developers. Works by Nguyen et al.[50] use statistical language models to autocomplete code accurately. Nguyen and Nguyen [47] expanded this work to graphs in order to create suggestions that are syntactically valid.

(15)

2.4

Statistical translation

Recent works have mirrored the success of Statistical Machine Translation in natural languages, e.g.,Google Translate, and applied these approaches to translating English to code. For example, SWIM [56] uses a corpus of queries from Bing to align code and English and generates sequences of API usages. DeepAPI [24] uses recurrent neural networks to translate aligned source code comments with code to translate longer sequences of API calls. T2API [49] uses alignments between English and code onStackOverflowto generate a set of API calls. These calls are then rearranged based on the likelihood of existing program graphs. T2API can generate long graphs of common API usages from English. Our work provides a frame in which to understand these works. For example, the sequences of SWIM and DeepAPI tend to be short and simplistic as they are restricted by a left-to-right processing of tokens. In contrast, T2API which re-orders API elements in a graph can produce more complex usages.

2.5

Use of StackOverflow in software engineering research

Many researchers in their works used StackOverflow posts for performing experiments on issues related to empirical software engineering, program comprehension, code completion, etc. This data source is very popular among the software engineering research community due to its availability as well as volume. Wonget al.minedStackOverflowdata to autogenerate source code comments [73]. Pintoet al. studied software energy consumption from_{StackOverflow} discussion in [53]. Wong et al. in [73] studied how developers interact in _{StackOverflow} discussion. In a similar work, Chowdhury et al. worked on filtering out off-topic from online discussion forums by mining_{StackOverflow}[15]. Rigbyet al.in [61] developed a tool for extracting salient code elements from StackOverflowposts, which we use in this thesis.

(16)

Chapter 3

Natural Software Revisited

Note: This chapter has been submitted to a conference and has been included verbatim in this manuscript thesis.

3.1

Abstract

Recent works have concluded that software is more repetitive and predictable, i.e.more natural, than English texts. These works included “simple/artificial” syntax rules in their language models. We find that while syntax is important, it is trivially predictable. For example, in “while (...)”, bracket always follows “while”, the compiler has a rule for this. When we remove these and other SimpleSyntaxTokens we find that code is still repetitive and predictable but only at levels slightly above English. Furthermore, previous works have compared individual Java programs to general English corpora, such as Gutenberg. Gutenberg contains a historically large range of styles and subjects (e.g., Saint Augustine to Oscar Wilde). We perform an additional comparison of StackOverflow English discussions with source code and find that this restricted English is almost as repetitive as code. Our results hold across seven programming languages.

Although we find that code is less repetitive than previously thought, we suspect that API code element usage will be repetitive across software projects. For example, a file is opened and closed in the same manner across domains. When we restrict our n-grams to those contained in the Java API we find that the entropy for 2-grams is significantly lower than the English corpus. This repetition partially explains the successful literature on API usage recommendation and autocompletion.

Previous works have focused on sequential sequences of tokens. While n-grams work well for sequential natural languages, we suspect that they obscure abstract patterns in code. When we extract program graphs of size 2, 3, and 4 nodes we see that the abstract graph representation is much more concise and repetitive than the n-gram representations of the same code. This suggests

(17)

that future work should focus on graphs that include control and data flow dependencies and search for new representations that go beyond linear sequences of tokens. Our replication package makes our scripts and data available to future researchers [1].

3.2

Introduction

Language modelling is a popular approach in the field of Statistical Machine Translation (SMT) [32] and Natural Language Processing (NLP) [29]. The growing popularity of this approach has resulted in the application of language modelling techniques in diverse fields. In the field of Software Engineering, language modelling has revealed power-law distributions and an apparent ‘naturalness’ of software source code [26, 50, 13, 57]. Although the termnaturalnessis vague, it has been expressed mathematically with statistical language models [26]. In essence, language models trained on a large corpus, assign higher naturalness to previously seen code, while assigning lower naturalness to unseen or rarely seen code. For example, Campbell et al. [13] showed that language models mark code which is syntactically faulty as unlikely orless likely than code without syntax errors. The goal of this paper is to revisit the “natural” code hypothesis in new contexts. As in NLP, the SE tasks and context will require different tuning and cleaning of a corpus. For example, if the goal is to create an English grammar correction tool, then stopwords such as ‘the’ are necessary. In contrast, if the goal is to extract news topics then stopwords must be removed as these dominant tokens will introduce noise and reduce the quality of predictions. Analogously, if the goal is to find syntax errors then the corpus must include SimpleSyntaxTokens. In contrast, if the goal is to recommend multi-element API usages, then _{SimpleSyntaxTokens}will dilute predictions. For example, Hindle et al.[26] did not remove_{SimpleSyntaxTokens}and in their autocompletion model they suggest a_{SimpleSyntaxToken}approximately 50% of the time. As a result, a recommender tool would suggest an obvious separator before a useful token such as an API call. In this work, we examine the repetitive behaviour of source code for multiple programming languages, we determine the impact of SimpleSyntaxTokenson repetition, we quantify how repetitive API usages are, and we compare the repetitiveness of n-grams vs graph representations of code. We examine each topic in the following four research questions.

RQ1, Replication: how repetitive and predictable is source code?

We replicate the work of Hindleet al.[26]. We also examine 6 additional programming languages: C#, C, JavaScript, Python, Ruby, and Scala. Our replication gives us confidence that our dataset is large and diverse enough to test the “naturalness” hypothesis in new contexts.

RQ2, Artificial Repetition: how repetitive and predictable is code once we remove SimpleSyntaxTokens?

(18)

In NLP, it is standard practice to remove punctuation and stopwords. We examine the contribution of three types of _{SimpleSyntaxTokens}to the language distribution: separators such as bracket and semi-colon; keywords, such asifandelse; and operators, such as plus and minus signs.

RQ3, API Usages: how repetitive and predictable are Java API usages?

Frameworks and APIs provide reusable functionality to developers. Unlike the code written for a particular project, API code is similar across projects. For example, a file is opened and closed in the same manner whether it is used in banking or healthcare. We examine only Java API tokens and determine how repetitive and predictable their usage is. Given the large and successful literature on API usage recommendations and autocompletions, we suspect that API elements may be more repetitive and predictable than general program code.

RQ4, Code Graphs: how repetitive and predictable are graph representations of Java code?

An n-gram language model assumes that the current token can be predicted by the sequence of n−1 previous tokens. However, compilers and humans do not process programs sequentially. In the case of compilers, parse trees or syntax trees are generated to provide abstract representations of code. Eyetracking studies of developers reading code show a nonlinear movement along the control and data flow of the program [11]. We extract theGraph-based Object Usage Model (Groum)[51] from Java programs and compare how repetitive graphs of nodes sizes 2, 3, and 4 are with equivalent sized n-grams from the same Java programs.

The remainder of this paper is structured as follows. In Section 3.3, we describe our data. In Sections 3.4, 3.5, 3.6, and 3.7, we report the results of our experiments for each of the research questions. Since we extract different tokens and graphs, we describe the extraction methodology in section in which it is used. In Section 3.8, we discuss limitations of our work and threats to validity. In Section 3.9, we position our work in the context of the literature. In Section 3.10, we summarize our contribution and conclude the paper. We also publicly release a replication package [1] which includes all processed n-gram and graph data as well as the scripts used in our processing pipeline.

3.3

Data Sources

Project Source Code: We create our source code corpus from 134 open source projects on GitHub. As a starting point, we select the Java and Python project used in a prior study [71]. To ensure that we processed a consistent number of tokens for each language, between 20M and 25M tokens, we added Java and Python projects as well as projects from 5 additional programming languages. These projects were selected from the most popular projects on GitHub for each language.1 _{For all the}

1_{Top GitHub projects per language:}

(19)

Table 1: Corpus size in tokens per language

Language Files Total Tokens Unique Tokens

Java 26,938 24,091,076 388,399 (1.61%) C# 23,186 24,217,086 389,800 (1.61%) C 10,932 25,255,417 938,434 (3.72%) JavaScript 10,544 25,157,297 257,606 (1.02%) Python 15,454 23,198,691 513,728 (2.21%) Ruby 60,371 25,896,601 715,157 (2.76%) Scala 34,242 23,634,250 333,794 (1.41%)

projects, we examine only the master branch. Since each research question requires the source code to be processed differently,e.g.,n-grams vs graphs, we describe the extraction methodology for each research question. The list of projects, scripts, and the processed n-grams and graphs can be found in our replication package [1]. A summary for each programming language is shown in Table 1.

English and StackOverflow text: Following Hindle et al. [26] we process the Gutenberg corpus. We use a subset of the Gutenberg corpus which includes over 3.4k English works [35]. The corpus represents a range of styles, topics, and timeperiods making Gutenberg a diverse corpus. In contrast, the programming corpora are for single programming languages. To make a more comparable English corpus, we process StackOverflow posts that discuss programming tasks in English for each programming language.

We extract 200,000 posts from StackOverflow by removing code and keeping only the English text.2 _{Furthermore, we use the following constraints to reduce noise and poorly constructed English}

when selecting posts:

1. We only use posts which are the accepted answer.

2. Each post has at least 10 positive votes. The corresponding question post has at least 1 positive vote.

3. We take posts which have at least 300 characters in the text body excluding the code snippet and any code words in the text. This ensures that our corpus has sufficient English tokens. Although we exclude code words, we take only posts that contain a code snippet to ensure that the discussion is about code and not, for example, configuration of an IDE.

To extract the English tokens in StackOverflow posts we extract the necessary data (bodywithout 2_{https://archive.org/details/stackexchange, September 2016}

(20)

code) with a Python HTML library. We merge the posts into a single file and perform the NLP process steps of stemming, lematization, lexicalization and stopword removal.

3.4

Replication

RQ1: How repetitive and predictable is software?

We replicate the work of Hindle et al.[26] to ensure that the data we sample produces similar results. We also examine C#, C, JavaScript, Python, Ruby, and Scala. We want to understand if the language and programming paradigm influence the repetitive nature of programming.

3.4.1

Theoretical background and methodology

We give the definitions of n-gram language models, cross entropy, and SelfCrossEntropyand describe how we extract n-grams.

n-gram Language Model

We use the term language model (LM) to mean the probability distributions over a sequence of n tokens P(k1, k2,..., kn). A LM is trained on a corpus containing sequences of tokens from the

language. Using this LM our goal is to assign high probability to tokens with maximum likelihood, and low probability to n-grams with lower likelihood. The primary purpose of modelling a language statistically using LMs is to model the uncertainty of the language by determining the most probable sequence of tokens for a given input.

Consider a sequence of tokensk1, k2, k3, ... kn−1, kn in a document,D. n-gram models

statisti-cally calculate the likelihood of the nth token given the previous n-1 tokens. We can estimate the probability of a document based on the product of series of conditional probabilities:

P(D) =P(k1)P(k2|k1)P(k3|k1, k2)...P(kn|k1, k2, ..., kn−1)

Here,P(D)is the probability of the document andP(ki)is the conditional probability of tokens. We

can transform the above equation to a more general form which is given below.

P(k1, k2, k3, ..., kn−1, kn) = n X

i=1

P(ki|k1, ..., kn−1)

This transformation uses the Markov Property which assumes that token occurrences are influenced only by limited prefix of lengthn[75]. Furthermore, we can consider this as aMarkov Chainwhich assumes that the outcome of the next token depends only on the previous n−1 tokens

(21)

Figure 1: Pipeline for experiments performed in this study

[52]. Thus we can write:

P(ki|ki−(n−1), ..., ki−1) =P(ki|ki−(n−1))

This equation requires the prior knowledge of the conditional probabilities for each possible n-gram. Computing these conditional probabilities is calculated from the n-gram frequencies. We use these n-grams to determine the entropy of a language corpus including source code.

SelfCrossEntropy

Hindleet al.’s [26] calculate the average number of bits, i.e.entropy, required to predict the nth token of the n-grams in a document. They use the standard formula for cross-entropy. They define cross-entropy in the context of n-grams. Given a language model M, the entropy of a document D, with n tokens, is H(D, M) =−1 n n X i=1 log2P(ki|k1...ki−1)

They use cross-entropy in a unique manner to define _{SelfCrossEntropy}. Instead of estimating the language modelM from another document or corpus, they divide a single corpus into 10 folds. M is then calculated from 9 of the folds andH(D, M) is calculated withD being the remaining fold. The finalSelfCrossEntropy is the average value across all folds.

Extracting n-grams

We replicate Hindle et al. [26] using the same tools and methodology as shown in Figure 1. We remove the source code comments. We lexicalize each source file in the project using ANTLR3_to

extract code tokens. Then we merge all the lexicalized files to create a corpus. For example, to get theSelfCrossEntropyof the Java language, we process all.javafiles. Then we merge the

(22)

(a)SelfCrossEntropy withSimpleSyntaxTokens (raw source code)

(b)SelfCrossEntropywithout SimpleSyntaxTo-kens

Figure 2: SelfCrossEntropyof programming languages with and withoutSimpleSyntaxTokens

processed files to create our final corpus. To calculate the SelfCrossEntropy, a single corpus is split into 10 folds. Ten-fold cross validation is used with the probability estimated from 90% of the data and validated on the remaining 10%. The results are averaged over the 10 test folds. We use MIT Language Model (MITLM) toolkit4_{to calculate the}

SelfCrossEntropy for each data set. MITLM uses techniques for n-gram smoothing to deal with unseen n-grams in the test fold (see Hindleet al.[26] for further discussion). We calculate the _{SelfCrossEntropy}for token sequences, i.e.n-grams, from 1-grams to 10-grams for each programming corpus, the Gutenberg corpus and English text on StackOverflow corpus. The processing pipeline for the experiments is shown in Figure 1.

3.4.2

Replication Result

How repetitive and predictable is software?

Figure 2a shows the replication of Hindleet al.’s [26] work, including six additional programming languages and StackOverflow posts. All the programming languages under consideration for this study show the same pattern of _{SelfCrossEntropy}. The highest_{SelfCrossEntropy}is observed for unigram language models. The value of _{SelfCrossEntropy} declines significantly for bigram and trigram models. From 3-grams to 10-grams the _{SelfCrossEntropy}remains nearly constant. Since we are able to replicate Hindleet al.’s result, we are confident that our dataset is large and diverse enough to test the “naturalness” hypothesis in new contexts.

(23)

While the pattern is the same, the values of SelfCrossEntropy are substantially different for each language. With Scala being much less repetitive than C#. We conclude that the pattern of decreasing _{SelfCrossEntropy}across n-grams holds from the Hindle et al.’s work. However, the difference among languages forces us to conjecture that the syntax of the language is artificially reducing its_{SelfCrossEntropy}.

3.5

Artificial Repetition

RQ2. how repetitive and predictable is code once we remove SimpleSyntaxTokens?

Standard preprocessing steps in NLP involve the removal of stopwords and punctuation [66, 45]. Stopwords, including articles,e.g., “the”, and prepositions,e.g.,“of”, are removed in information retrieval tasks because they introduce noise in the data set reducing the likelihood of retrieving interesting information. In our work, we examine the impact of three types of_{SimpleSyntaxTokens}: separators such as brackets and semi-colons; keywords, such asifandelse; and operators, such as plus and minus signs. Hindle et al.did not remove these _{SimpleSyntaxTokens}and in their autocompletion model they suggest aSimpleSyntaxTokenapproximately 50% of the time. As a result, an autocompletion tool would suggest an obvious separator before a useful token such as an API call. In this section, we examine the impact of each type of SimpleSyntaxTokentoken on the apparent repetitiveness of code.

3.5.1

Background and Methodology

To identify theSimpleSyntaxTokensfor each programming language, we examined the language specification to identify the keywords, separators, and operators. We calculate the percentage of SimpleSyntaxTokensin each programming language. Then we removeSimpleSyntaxTokens from the corpus and measure the entropy of n-grams without the language specific tokens. We report the change in_{SelfCrossEntropy} of the n-grams after the removal of language specific tokens and answer the following questions:

1. What percentage of total tokens areSimpleSyntaxTokens?

2. What is the change in SelfCrossEntropy after removingSimpleSyntaxTokens?

3. How repetitive is code withoutSimpleSyntaxTokenscompared to English?5 5_{For the English corpora we removed the standard stopwords with the NLTK toolkit.}

(24)

Table 2: Percentage of language specific token, i.e._{SimpleSyntaxTokens}, for each programming language corpus

Language Separators Keywords Operators Total

Java 44.00% 9.36% 5.85% 59.21% C# 42.57% 10.96% 7.55% 61.08% C 39.23% 5.50% 15.14% 59.87% JavaScript 47.21% 6.87% 6.53% 60.61% Python 41.98% 4.99% 6.42% 53.39% Ruby 23.37% 8.37% 8.93% 40.67% Scala 39.27% 7.40% 7.28% 53.95%

3.5.2

Results and Discussion

What percentage of total tokens are SimpleSyntaxTokens? Stopwords are removed during natural language information retrieval tasks because their high prevalence introduces noise reducing the likelihood of retrieving highvalue information. When applied to our programming corpora, in Table 2, we see that_{SimpleSyntaxTokens}account for a high percentage of total tokens. Across the programming languages, JavaScript has the highest number of _{SimpleSyntaxTokens}at 60% of total tokens, while the smallest percentage is 41% for Ruby. Separators account for the largest proportion of SimpleSyntaxTokens, between 23% and 47% of all tokens.

The main implication from Table 2 is that SimpleSyntaxTokensdominate the tokens in all corpora and when included make code look artificially repetitive.

What is the change in SelfCrossEntropy after removing SimpleSyntaxTokens?

We remove theSimpleSyntaxTokensand recalculate the SelfCrossEntropy. In Table 3 we see that the increase in_{SelfCrossEntropy},i.e.a decrease in repetitiveness, is dramatic. For Java, we see that from 1-grams to 6-grams we need a respective increase of 68%, 67%, 90%, 97%, 98% more bits. After 6-grams we need a nearly constant 100% increase in bits. Clearly more information is required to encode Java programs without the artificially repetitive _{SimpleSyntaxTokens}.

How repetitive is code without SimpleSyntaxTokens compared to English?

We investigate the difference in SelfCrossEntropy between programming languages and English by reporting the number of additional bits necessary to encode English. Hindleet al. [26] report a maximum average per-word entropy of approximately 8 bits for English and 2 bits for Java, which means that English requires 4 times as many bits, while for 2-grams and 3-grams, English requires 2 and 2.7 times as many bits. Similarly we find that before removingSimpleSyntaxTokens,

(25)

Table 3: Percentage increase in _{SelfCrossEntropy}after the removal of _{SimpleSyntaxTokens} Language 1-gram 2-gram 3-gram 4-gram 5-gram 6-gram 7-gram 8-gram 9-gram 10-gram Java 60.18 67.40 90.17 94.70 97.49 98.75 99.67 100.55 100.75 101.00 C# 56.73 48.26 77.66 84.34 87.52 89.42 90.04 90.65 90.94 91.16 C 46.75 64.50 81.33 84.95 87.85 90.30 91.80 92.26 92.66 92.85 JavaScript 42.48 35.72 34.61 31.47 32.58 33.03 33.43 33.60 33.66 33.69 Python 57.67 62.81 82.20 86.07 89.45 90.11 90.55 90.53 90.65 90.70 Scala 18.48 15.29 12.74 11.60 11.34 11.22 11.13 10.75 10.78 10.85 Ruby 31.18 36.82 42.91 44.65 45.55 45.87 45.99 46.02 46.05 46.07

we need 1.7, 2.3, 2.7, 2.8, 2.9, more bits for 1-grams to 5-grams for Java. After 5-grams the increase is constant at 2.9 times.

However, withoutSimpleSyntaxTokensthe number of additional bits required is substantially less for Java: 1.0, 1.4, 1.4 additional bits for 1-gram to 3-grams and remains constant at 1.5 from 4-grams to 10-grams. This provides further evidence thatSimpleSyntaxTokensclearly account for a large proportion of the repetitiveness in Java. With slight variation in the actual number, this result generalizes to the other programming languages in Figure 2b.

As we discussed in the data section, the Gutenberg corpus contains a wide range of English writing styles, topics, and authors. In contrast, the programming corpora used in our work and that of Hindleet al.’s are for single programming languages. To provide a more comparable English corpora we processed StackOverflow posts related to each programming language. We find that SelfCrossEntropyof English on StackOverflow is highly similar to that of code. For example, Java requires .9 times as many bits as StackOverflow English to encode 1-grams. Clearly the vocabulary on StackOverflow is very limited. For 2-grams, 1.1 times as many bits are required and this number remains constant at 1.2 for 3-grams to 10-grams. After 2-grams we see that sequences of token usages are larger in StackOverflow. This is likely because classes and methods tend to be used together in Java. However, compared to the originally reported 4 times as many bits, or 300% more bits the removal of _{SimpleSyntaxTokens} shows a 1.1 to 1.2 times as many bits or 10 to 20% more bits. This result is consistent across programming languages.

(26)

3.5.3

Concluding discussion on SimpleSyntaxTokens

Hindleet al.were “worried” by the questions that we ask in this section [26]. They asked “is the increased regularity we are capturing in software merely a difference between the English and Java languages themselves? Java is certainly a much simpler language than English, with a far more structured syntax.” To answer this question, they conducted an experiment were they compared the SelfCrossEntropy of a single program with the cross entropy of predicting the tokens in one Java program with those in other Java programs. They conclude that because the entropy for single programs is lower than the entropy between programs that regularity of software is “not an artifact of the programming language syntax.” However, in both cases the programs were written in the same language, Java, using the same syntax. Their experiment clearly does not control for simple syntactical regularities in the Java language. In contrast, in our study we remove _{SimpleSyntaxTokens} and find that the regularity of programs drops dramatically. We conclude that that the syntax of programming languages artificially reduces the entropy of software. Our findings suggest that software engineers should follow the NLP practice of removing stopwords and punctuation, in this case SimpleSyntaxTokens, to reduce the noise they introduce and to make higher value autocompletion suggestions.

3.6

API Usages

RQ3. How repetitive and predictable are Java API usages?

API code is used across multiple projects in the same manner regardless of the domain of the project. We extract Java API tokens and determine how predictable their usage is. We conjecture that sequences of API elements,i.e.API usages, should be more repetitive and predictable than general program code.

3.6.1

Background and Methodology

We extract the Java API elements from the Java Platform Library Standard Edition 7 Specification[23]. We remove all tokens from the Java corpus which are not part of Java standard libraries. The set of API elements includes package, class, field, and method names. For the Java corpus, we calculate the_{SelfCrossEntropy}for the API usage of size 1 to 10-grams.

(27)

Figure 3: Comparing the Java APISelfCrossEntropywith raw Java source code, Java source code withoutSimpleSyntaxTokens, and English

(28)

3.6.2

Results and Discussion for API Usages

Figure 3 compares the _{SelfCrossEntropy} of n-gram API usages in Java to raw Java, Java withoutSimpleSyntaxTokens, StackOverflow English, and Gutenberg. We find that the Self-CrossEntropyof the Java API is less repetitive and predictable than the raw corpus which contains SimpleSyntaxTokens. This result derives from the high proportion of SimpleSyntaxToken tokens,i.e.57% of tokens in Java areSimpleSyntaxTokens. Java that excludes SimpleSyntax-Tokensbut includes internal code, requires 20% more bits for 1-grams and a consistent 30% more for 2 to 10-grams compared with the Java API. This is likely because the domain specific tokens, for example, the “BankAccount” class in a banking application, are used much less repetitively than the API code, such as “String” or “InputStreamReader” classes in standard Java 7 libraries.

The corresponding numbers for English on StackOverflow, are 30% to 60% more bits. For Gutenberg, which includes a diverse set of English texts, 50% to 90% more bits are required. These differences are substantially lower than Gutenberg and raw Java which requires between 70% and 190% more bits to encode the Gutenberg corpus.

We conclude that raw Java code that containsSimpleSyntaxTokensis more repetitive than the Java API usages likely due to the repetitive use of syntax rules. In contrast, we find that Java API is more repetitive than general Java code that does not containSimpleSyntaxTokens. Our finding that Java API usages are quite repetitive quantifies the truth underlying the large and successful literature on suggesting sophisticated API autocompletions (e.g.,[44, 5, 62, 10, 50]).

3.7

Code Graphs

RQ4: how repetitive and predictable are graph representations of Java code?

The assumption made by the n-gram language model is that the current token can be predicted by the sequence ofn−1 previous tokens. For many natural languages the assumption holds as they are interpreted sequentially from left to right. In contrast, compilers and humans do not usually process programs sequentially. In the case of compilers, parse trees or syntax trees are generated to provide abstract representations. Eyetracking studies of developers reading code show a nonlinear movement along the control and data flow of the program [11] which differs from natural language reading strategies [18], for example, by focusing on method signatures [63] and following beacons [17] in the code. In this section, our goal is to measure how repetitive an abstract graph representation of code is and to understand if it reveals repetitions that cannot be identified with n-grams.

(29)

3.7.1

Background and Methodology

In order to determine how repetitive code graphs are, we need a graph extraction technique that is able to satisfy the following requirements:

1. Extract the code graphs from a large number of projects that may not be able to compile due to, for example, external dependencies.

2. Filter out granular information, such as variables and expressions, to include only control and data dependencies among class objects and methods in the code graphs.

3. Identify isomorphic code graphs to determine the occurrence frequency of each graph.

We evaluated the Eclipse AST parser, and found that it had critical limitations:

1. The Java project dependencies must be present for each project.

2. The AST includes lowlevel details, such as variable names, which would artificially reduce graph frequencies.

3. Techniques [34, 60, 28] to identify structural similarities in the code using ASTs are computa-tionally expensive[48].

In summary, the Eclipse AST parser is designed for static analysis, but is not appropriate for statistical based recommendations.

In contrast, GrouMiner [47] was designed to extract GRaph-based Object Usage Models (Groums) and to efficiently calculate isometric graphs. Below we describe the steps necessary to

extract the frequency of Java code graphs:

1. Recoder is used to extract an AST without the need to compile the program [41].

2. _GrouMinertransforms the AST for each method body into a_Groum. The nodes in a_Groum represent constructors, method invocations, field accesses, and branching points for control structures. The edges represent temporal, data and control dependencies between nodes.

3. Graph induction is used to generate subgraphs of the_Groumfor a specified size, in our case 2, 3, and 4-node graphs.

4. _GrouMinercomputes the occurrence frequencies of each_Groum using [48] technique.

3.7.2

Data

We useGrouMiner to capture the occurrence frequency of eachGroumin the Java projects used in the previous n-gram sections. In the previous section we found that API code tends to be more

(30)

repetitive and predictable across multiple projects. As a result, we capture Groums containing API usages from the Java Platform Standard Edition 7 Specification [23]. We include _Groums that contain at least one Java API node. We eliminate_Groums which contain only control flow structures or only contain internal code. To perform a fair comparison with n-grams, we use the same inclusion and exclusion criteria to filter the n-grams tokens. Our goal is to study the inherent degree of repetition for the two representations, graphs and n-grams. In the previous sections, we calculated theSelfCrossEntropyby predicting the nth token for n-grams in 10-fold cross validation. Since graphs are not sequential, the most appropriate prediction comparison is unclear. To avoid this problem, we examine the underlying frequency distribution for each set of n-grams and n-node graphs on the same set of Java projects. This strategy of examining the distribution has been employed in many previous works examining code structure [40, 74, 8, 16]. The more left skewed the distribution the more repetitive and predictable the representation.

3.7.3

Results and Discussion for Java Code Graphs

We collectGroumswith 2, 3 and, 4 nodes and the corresponding n-grams. We measure the occurrence frequencies of eachGroumand n-gram across the Java projects. Since graphs represent an abstraction of code, we conjecture, that on the same code,Groumswill have a stronger Pareto-type distribution than n-grams,i.e.graphs will be more repetitive and left skewed. In Figure 4 we plot the top 20% of the n-grams and n-nodeGroumsagainst the percentage of total n-grams and n-nodeGroums, respectively. We see both n-grams and n-node_Groumsare highly left skewed. For example, the top 20% of n-grams account for 76%, 58%, 51% for all instances of 2, 3, and 4-grams, respectively. The corresponding value for the top 20% of n-node _Groumsaccount for 81%, 73%, 72% of instances of 2, 3, and 4-node graphs, respectively. The top 20% of graphs are 5, 15, 21 percentage points more frequent than the top 20% of n-grams. Furthermore, the drop between 2-nodes and 3-nodes is much less than between 2-grams and 3-grams, indicating that graphs remain highly repetitive with increasing size.

Table 4 shows the complete distribution for the 10 to 90% for graphs and n-grams. The column at 20% is represented in the Figure 4 but for space reasons we cannot show the graphs as this would represent 18 lines. The table shows that the pattern remains clear, with n-nodes being more left skewed than n-grams. We conclude that graph representations are much more repetitive than sequential n-gram representations.

3.7.4

Illustration of Graphs

We have quantitatively determined that Groumsare more repetitive and predictable than n-grams. In this section, we provide illustrations of why they are more repetitive. For example, an n-gram

(31)

sequence will not capture the relationship between File.open()andFile.close(), because there will always be other tokens, such as File.read(), between these API calls. Although we removed SimpleSyntaxTokens in this section, if they are included the problem is exacerbated because obvious tokens lie between related API calls. In contrast, _Groums will always contain a data dependency edge between File.open()andFile.close()even when internal classes are present. The temporal program flow will still be captured by control edges.

A more complex example from our corpus of Java programs illustrates the transformation of separate program code fragments into a common abstract Groum with 4-nodes. The Groum in Figure 5 represents the API usage pattern of iterating through a java.util.HashMap with an enhancedforloop. TheGroumis an abstract representation of the code in Listings 3.1 to 3.4 as well 23 other classes in the Neo4J project. Specifically, the Groumcontains the data and control flow dependencies betweenMap.entrySet(), Map.Entry.getKey(), Map.Entry.getValue(),and an enhancedforloop. For example, in Listing 3.2 the code iterates through a hashmap of tracked client sessions and in Listing 3.1 the code iterates through a hashmap of throughput reports.

Below we use the listings to show the important differences between the _Groum and n-gram models.

Abstraction: From examining the listings, it is clear that no n-gram model would consider these code fragments as identical. There are many internal classes andSimpleSyntaxTokensbetween these API elements. Even when only API elements are considered there would be no direct sequence with Map.entrySet()preceding Map.Entry.getValue(). This relationship is only captured as a data dependency in a graph.

Size: the size of the n-gram necessary to capture each of these code fragments would be much larger than the 4-node_Groum. For example, if we include_{SimpleSyntaxTokens}, for the respective listings we need sequences with 34, 32, 30, and 38 tokens to represent the code in each listing. Without SimpleSyntaxTokensthe corresponding number of tokens is smaller but still quite large at 14, 15, 13, and 15 tokens per listing.

We conclude that Groumscapture information about the control and data flow at a higher level of abstraction which makes them a more repetitive representation of code than n-grams. Graphs are also a more realistic representation of code than sequential n-grams as compilers and humans do not process code sequentially. Graphs are more appropriate for statistical code autocompletion because they can suggest non-sequential relationships that cannot be represented in an n-gram model.

Listing 3.1: TransactionThroughputChecker.java

p r i v a t e v o i d p r i n t T h r o u g h p u t R e p o r t s ( P r i n t S t r e a m out ) { out . p r i n t l n ( " T h r o u g h p u t ␣ r e p o r t s ␣ ( tx / s ) : " ) ;

(32)

Figure 4: The top 20% of the n-grams and n-node graphs account for the y-axis% of the usages. For example, the top 20% of the n-node graphs account for 80.6% of all usages.

(33)

Table 4: The cumulative proportion of n-node graphs and n-grams from 0% to 100% in 10 point increments for all usages. For example, the top 40% of the n-node graphs account for over 89% of all usages. The table shows the left skew of the distributions.

Cumulative Percentage 2-node graph 3-node graph 4-node

graph 2-gram 3-gram 4-gram

0 0.00 0.00 0.00 0.00 0.00 0.00 10 71.46 62.18 61.50 66.22 47.34 38.72 20 80.58 72.90 72.06 75.55 58.06 50.97 30 85.82 79.21 78.38 80.95 65.96 57.13 40 89.14 84.11 83.53 85.58 70.82 63.26 50 92.21 87.75 87.14 87.98 75.69 69.38 60 93.76 90.20 89.71 90.38 80.55 75.50 70 95.32 92.65 92.28 92.79 85.41 81.63 80 96.88 95.10 94.86 95.19 90.27 87.75 90 98.44 97.55 97.43 97.60 95.14 93.88 100 100.00 100.00 100.00 100.00 100.00 100.00

Figure 5: A_Groumrepresenting iteration through a HashMap, which is an abstraction of the code in Listings 1 through 4.

(34)

out . p r i n t l n ( " \ t " + e n t r y . getKey() + " ␣ ␣ " + e n t r y . getValue() ) ; } out . p r i n t l n () ; } Listing 3.2: GlobalSessionTrackerState.java p u b l i c G l o b a l S e s s i o n T r a c k e r S t a t e n e w I n s t a n c e () { G l o b a l S e s s i o n T r a c k e r S t a t e c o p y = new G l o b a l S e s s i o n T r a c k e r S t a t e () ; c o p y . l o g I n d e x = l o g I n d e x ; for ( Map.Entry < M e m b e r I d , L o c a l S e s s i o n T r a c k e r > e n t r y : s e s s i o n T r a c k e r s . entrySet() ) {

c o p y . s e s s i o n T r a c k e r s . put ( e n t r y . getKey() , e n t r y . getValue() . n e w I n s t a n c e () ) ; }

r e t u r n c o p y ; }

Listing 3.3: ListAccumulatorMigrationProgressMonitor.java

p u b l i c Map < S t r i n \ s e c t i o n g , Long > p r o g r e s s e s () { Map < String , Long > r e s u l t = new HashMap < >() ;

for ( Map.Entry < String , A t o m i c L o n g > e n t r y : e v e n t s . entrySet() ) { r e s u l t . put ( e n t r y . getKey() , e n t r y . getValue() . l o n g V a l u e () ) ; }

r e t u r n r e s u l t ; }

Listing 3.4: ExpectedTransactionData.java

p r i v a t e Map < Node , Set < String > > c l o n e L a b e l D a t a ( Map < Node , Set < String > > map ) { Map < Node , Set < String > > c l o n e = new HashMap < >() ;

for ( Map.Entry < Node , Set < String > > e n t r y : map . e n t r y S e t () ) { c l o n e . put ( e n t r y . getKey() , new HashSet < >( e n t r y . getValue() ) ) ; }

r e t u r n c l o n e ; }

3.8

Limitations and Validity

(35)

because it does not require the project to be compilable. Recoder, like PPA tool [19], has known limitations that lead to unknown nodes in a graph. When a node is unknown we are unable to generate a_Groum. For 2, 3, and 4-node graphs we have 4.5%, 8.0%, 10.6% of graphs that contain an unknown. These percentages are inline with the 90% accuracy of the state-of-the-art partial programs analysis and code snippets analysis tools [19, 41, 61].

A second limitation is that identifying isomorphic graphs using GrouMiner[48] is computa-tionally expensive. In this work, we calculated Groumsizes up to 4-nodes. Based on our analysis, we have seen that the probability distribution of graphs for 3-node and 4-nodes remain constant indicating that, like n-grams, higher n-node graphs exhibit similar degrees of repetition. Furthermore, since graphs are at a higher degree of abstraction, fewer nodes are necessary to represent the same block of code when compared to sequential n-grams.

Limitations of _{SelfCrossEntropy}: In terms of entropy calculations, _{SelfCrossEntropy} is an extension of cross entropy whereby 10-fold cross validation is used to calculate the per-token average of the probability with which the language model generates the test data [26]. Ideally, we would calculate all possible combinations of the next token, however, as Shannon [68] points out, this is impractical withO(tN_{), where t is the number of unique tokens and N is the total number}

of tokens in the corpus. For each language in our corpus there are over 300k unique tokens and 20 million total tokens. As a result,SelfCrossEntropy serves as a good approximation of entropy.

Reliability and External Validity: By examining a diverse set of languages we increase the generalizability of our results. Furthermore, in RQ1 our goal was to replicate previous work and to ensure that our data and scripts produced consistent results. We were successful in this replication, increasing the validity of the data used in the novel work in subsequent research questions. In our replication package [1], we have included all processed n-gram and graph data as well as the scripts used in our processing pipeline to allow other researches to validate and extend our work.

3.9

Related Work

Research into language entropy.

Basic research into understanding redundancy and measuring entropy in languages has a long history. Shannon [68] developed statistical measures of entropy for the English language. Gabel and Su [22] noted high levels of redundancy in code and Hindle et al. [26] continued this work demonstrating that software is highly repetitive and predictable. Recent works have replicated these software findings on a giga-token corpus [3] and looked at the entropy in local code contexts [71]. Others have examined repetition at the line level [59] or in other domains such as Android Apps[36, 5]. In each case, code has been found to be repetitive and predictable. In our work, research question 1

(36)

replicates Hindleet al.’s work expanding it to multiple programming languages. We noted differences among programming languages and conjectured that these differences may be due to syntax. Following NLP practices of removing stopwords and punctuation, we remove operators, separators, and keywords, and find that without these highly repetitive tokens software is much less repetitive and predictable. While we support the general conclusion that code is repetitive and predictable, we find that it is not much more repetitive than English. This conclusion is important because it will reframe the ease with which statistical predictions about software can be made.

Research on code validation and checking.

Most existing tools to find defects and other code faults use static analysis. Recent works have focused on using the statistical properties of the languages to find bugs and to suggest patches. For example, Campbellet al.[13] find that syntax errors can be identified using n-gram language models. Ray et al.[58] identified bugs and bug fixes in code because buggy code is less natural and has a higher entropy. Santos and Hindle [67] used the n-gram cross entropy of text in commit messages to identify successfully commits that were likely to make a build fail. Our research confirms that statistical code checking will work much better on syntax or APIs than on internal classes as these former types are much more repetitive.

Research into autocompletion and suggestions.

Modern IDEs contain an autocompletion feature that usually uses the structure of the language to make suggestions. Research into code suggestion has long known intuitively that code is repetitive. For example, textual similarity of program code [4], commit messages [14], and API usage patterns [44] have been exploited to guide developers during their engineering activities. Building on this work, Zimmermannet al. [76] used association rule mining on CVS data to recommend source code that is potentially relevant to a given change task. Recent work by Azadet al.[5] has extended this work to make change rule predictions from a large community of similar Apps and the code discussed in StackOverflow discussions.

Advanced autocompletion techniques have leveraged the history of applications and the repetitive nature of programming to suggest code elements to developers. Robbes and Lanza [62] filtered the suggestions made by code completion algorithms based on, for example, where the developer had been working in the past and the changes he or she had made. Bruchet al.[9] suggested appropriate method calls for a variable based on an existing code base that makes similar calls to a library. Buse and Weimer [10] automatically generate code snippets from a large corpus of applications that use an API. Duala-Ekoko and Robillard [21] use structural relationships between API elements, such as the method responsible for creating a class, to suggest related elements to developers. Works by Nguyen et al.[50] use statistical language models to autocomplete code accurately. Nguyen and Nguyen [47] expanded this work to graphs in order to create suggestions that are syntactically valid. Much of this

(37)

work focuses on suggesting API elements. Our work suggests that API usages are substantially more repetitive and predictable than general code, explaining the success of API suggestion approaches. Furthermore, we show why graphs are a more appropriate representation of code, and we hope this will encourage future researchers to focus on graph abstractions instead of sequential tokens. Statistical translation.

Recent works have mirrored the success of Statistical Machine Translation in natural languages, e.g.,Google Translate, and applied these approaches to translating English to code. For example, SWIM [56] uses a corpus of queries from Bing to align code and English and generates sequences of API usages. DeepAPI [24] uses recurrent neural networks to translate aligned source code comments with code to translate longer sequences of API calls. T2API [49] uses alignments between English and code on StackOverflow to generate a set of API calls. These calls are then rearranged based on their the likelihood of existing program graphs. T2API can generate long graphs of common API usages from English. Our work provides a frame in which to understand these works. For example, the sequences of SWIM and DeepAPI tend to be short and simplistic as they are restricted by a left-to-right processing of tokens. In contrast, T2API which re-orders API elements in a graph can produce more complex usages.

3.10

Conclusion

Our findings confirm previous work that code is repetitive and predictable. However, it is not as repetitive and predictable as Hindleet al.[26] suggested. We have found that the repetitive syntax of the program language makes software look artificially much more repetitive than English. For example, language specific_{SimpleSyntaxTokens}account for 59% of the total Java tokens in our corpus. We conclude that the researcher must ensure that the corpus is tuned and cleaned for the prediction task. If the goal is to recommend statistically tokens that are related to complex software engineering tasks, for example, completing a set of API calls, then suggestingSimpleSyntaxTokens, such as semicolons, that are encoded as rules in a compiler, will simply distract from more interesting recommendations.

We make our scripts, n-grams, and graphs available in our replication package [1] and hope that our work will be used by researchers to select appropriate corpora with sufficient repetition. For example, we conducted a failed experiment to suggest patches based on past fixes using an n-gram language model. Had we had our current analysis there would have been little need to conduct the experiment as it would be obvious that internal class tokens and usages were too infrequent to be used successfully in a statistical model. Future work to complement static analysis with statistical models could allow for appropriate recommendations even when a class is used infrequently.

(38)

The success of API usage recommendations flows naturally from our findings. By tuning the vocabulary to API code tokens and examining the usage of these APIs element across many programs there is sufficient repetition to make accurate recommendations.

Software recommender tools are moving from simple single element autocompletions to multi-element, non-sequential recommendations of code blocks. Our work shows that different repre-sentations of code have different degrees of repetition. Graph reprerepre-sentations, such as Groums, allow for a higher degree of abstraction and the data and control flow allow for non-sequential relationships. Furthermore, the abstract nature of graphs allows for a more concise representation that reduces the number of noise tokens in code predictions. We hope that future work will focus on new code representations that are tailored to statistical code suggestion allowing for complex and useful recommendations.

(39)

Chapter 4

Cleaning StackOverflow for use in

Machine Translation

Note: This chapter has been submitted to a conference and has been included verbatim in this manuscript thesis.

4.1

Abstract

Generating source code API sequences from an English query using Machine Translation (MT) has gained much interest in recent years. For any kind of MT, the model needs to be trained on a parallel corpus. In this paper we cleanStackOverflow, one of the most popular online discussion forums for programmers, to generate a parallel English-Code corpus. We contrast three data cleaning approaches: standard NLP, title only, and software task. We evaluate the quality of the each corpus for MT. We measure the corpus size, percentage of unique tokens, self-cross entropy, and per-word maximum likelihood alignment entropy. While many works have shown that code is repetitive and predictable, we find that English discussions of code are also repetitive. Creating a maximum likelihood MT model, we find that English words map to a small number of specific code elements which partially explains the success of using StackOverflowfor search and other tasks in the software engineering literature and paves the way for MT. Our scripts and corpora are publicly available.

(40)

4.2

Introduction

The process of translating between two languages automatically is known as Machine Translation (MT). Recent advances and computational power have increased the popularity of MT. MT techniques can be broadly classified into three classes: Statistical Machine Translation (SMT)[33, 2, 38], Example-based Machine Translation (EBMT)[69], and Neural Machine Translation (NMT)[6, 46]. Although MT approaches differ in terms of theory, algorithms, and efficiently, all approaches require a high volume and low noiseparallel corpus[31, 7] orbitext[25]. Application of MT algorithms is not limited to translation between natural languages. In recent years, MT techniques have been used to translate from natural language to programming languages[24, 49, 56].

StackOverflow, a developer question and answer forum, can be seen as a bilingual corpus because it discusses programming in both English and code. For example, in Figure 6 we see a post the answers in both English and code the question “How can I refresh the cursor from a CursorLoader?” English words in th

Analyzing the Predictability of Source Code and its Application in Creating Parallel Corpora for English-to-Code Statistical Machine Translation