Style - Source code authorship attribution

Authorship attribution relies on stylistic analysis to identify common authorship. This is unlike re- lated fields such as plagiarism detection, which have the easier task of finding common content.

2.3. STYLE

Given the importance of style in authorship attribution, this section provides background material on writing style, coding style, and their differences. We then discuss how style evolves over time, and practices that are used for style obfuscation and general wrongdoing.

2.3.1 Writing Style

All authors exhibit personal preferences in their writing, as outlined in the Chapter 1 introduction. Examples can be evidenced with measurement between parts of language and writing structure of samples of single authorship. For example, Koppel et al. [2003] demonstrated how individual style comes about using the following three similar sentences:

• “John was lying on the couch next to the window.” • “John was reclining on the sofa by the window.” • “John had been lying on the couch near the window.”

These sentences all have some key words that remain unchanged (such as “John” and “window”), but there are others that are easily interchangeable such as the function words. Function words, such as conjunctions, are the common words that act as glue between other words in natural language, and have little meaning of their own. Function words are also known as stop words in informa- tion retrieval, as discussed in Section 2.5.1. Given that all authors need to use function words as the glue in their writing, the way that they are used can indicate individual style and idiom. An ab- sence of function words might indicate unnatural content comprising “word salads” for search engine spamdexing [Lavergne, 2006].

2.3.2 Coding Style

The “John/window” example above demonstrates that there are some free components in natural language sentence structure where stylistic preference can be expressed. However, some people may argue that individual style cannot be expressed in source code, particularly when coding standards [Cannon et al., 1997; Geotechnical Software Services, 2008; Sun Microsystems, 1997] are followed. However, there are numerous common source code components (such as operators and keywords), which effectively act as function words that can be used to form a coding style.

Soloway [1986] reported that research involving novice programmers “suggests that language constructs do not pose major stumbling blocks for novices learning to program. Rather, the real problems novices have lie in ‘putting the pieces together’, composing and coordinating components

of a program”. Therefore the way that even novice programmers use language constructs and put them together demonstrates individual style.

The presence of coding standards also generates stylistic differences in itself, as there is a “lack of consensus” between publications on programming style and standards [Oman and Cook, 1990], and “no guidelines on how to resolve conflicts between rules” [Oman and Cook, 1988].

2.3.3 Similarities and Differences between Writing and Coding Style

Oman and Cook [1988] argued why writing style and programming style are similar. They explained that “effective writing is more than observing established conventions for spelling, grammar and sentence structure”, similar to how effective programming is more than just following style guides. In writing there is “perception and judgement a writer exercises in selecting from equally correct expressions, the one best suited to his material, audience, and intention” [Oman and Cook, 1988]. Similarly, a programmer must choose the appropriate operators, keywords and library functions from which many equally correct options could be chosen. Moreover, Oman and Cook [1988] explained that some books on programming style have been derived from books on natural language style.

A key difference between writing and coding style is the lower amount of flexibility that can be demonstrated when coding, as “computers are far less forgiving than humans of imprecision and difference in usage” [Michaelson, 1996], and “in computers the compiler and run-time system are the ultimate arbiters of program acceptability” [Michaelson, 1996].

Another key difference is the disparity between the vocabulary size in natural language and the number of constructs in code. For example, there are around one million English words today [Ling, 2001], which is far more than the number of features in the C programming language, with thirty- two keywords, thirty-nine operators, and fifteen modest header files, containing the standard library functions and constants [Kelly and Pohl, 1997]. The disparity may be further increased with the introduction of words with spelling mistakes in natural language, since there is less scope for mistakes in source code that must follow strict syntax rules for compilation.

2.3.4 Evolving Style

Since authorial writing style evolves over time, the earliest work samples become the least reliable indicators of current writing style. For example, Can and Patton [2004] studied the changes in writing style of two Turkish authors spanning twenty-seven and fifty-six years respectively. With work samples organised into “old” and “new” categories, they found a statistically significant difference in the average word length between these categories.

2.3. STYLE

In another study, Pennebaker and Stone [2003] found that as individuals age, they “use more positive and fewer negative affect words, use fewer self-references, use more future-tense and fewer past-tense verbs, and demonstrate a general pattern of increasing cognitive complexity”.

To the best of our knowledge, there is no previous research for empirical evaluation of evolving programming style for programmers. Instead, Kemerer and Slaughter [1999] researched the evolution in twenty-three software projects, spanning a twenty year period and 25,000 change events. However, this is not helpful for studying the evolution of programming style in individuals. Kemerer and Slaughter [1999] noted that “it is not surprising that empirical research on software evolution is scarce. The researcher has to collect data at a minimum of two different points in time. This cre- ates practical difficulties in terms of sustaining support for the project over this period and/or finding an organisation that collects and retains either relevant software measurement data or the software artefacts themselves”. Therefore, we expect that our work in Chapter 6 is the first to empirically evaluate the evolution of programming style in individuals, with our experiments that use a collection of student programming assignments spanning six distinct points in time.

2.3.5 Dishonest Style

Kacmarcik and Gamon [2006] demonstrated that substituting just 14 words per 1,000 is sufficient to reduce correct authorship attributions by 83%. The key is to “identify the features that a typical authorship attribution technique will use as markers and then adjust the frequencies of these terms to render them less effective on the target document”. Kacmarcik and Gamon [2006] have also com- mented that “idiosyncratic formatting, language usage and spelling” are tell-tale signs of authorship, and that simple use of spelling and grammar checkers, for example, will return documents to “con- ventional norms” making authorship attribution more difficult.

All of the above strategies can be used to mask authorship when dealing with law enforcement. But there are also more legitimate reasons for anonymisation, such as organisational whistle-blowers who feel the need to report bad behaviour and wish to avoid drawing attention to themselves [Kac- marcik and Gamon, 2006].

Palkovskii [2009] wrote about counter plagiarism detection software that has been used to obfus- cate assignments, so that plagiarism detection software is rendered useless. These algorithms involve substituting characters from one natural language to another that appear identical to the human eye, but are from different character sets and are hence treated differently in plagiarism detection software. For example, the Greek letters ‘α’ (alpha) and ‘ν’ (nu) closely resemble English letters ‘a’ and ‘v’. However, some differences cannot be discerned at all by the human eye. For example, Palkovskii [2009] suggested the replacement of the English letter ‘o’ with a similar circular Russian charac-

ter. Likewise, other substitutions involve replacing all spaces with a non-space character in a white colour, so that it visually appears as a regular space. Methods to detect the use of this kind of software include transforming the content into ordinary ASCII text, so that obfuscations that are effective in word processing software would be undone. Authorship attribution software needs to be robust against the substitutions described above. The use of n-grams is a robust method we use in Chapter 5 as introduced in the next section.

In document Source code authorship attribution (Page 52-56)