Dispute, Litigation, and Theft - Source code authorship attribution

2.2 Rationale

2.2.3 Dispute, Litigation, and Theft

Authorship attribution has also been used widely outside of academia to resolve disputes, litigation, and theft. Perhaps the most well-known authorship attribution dispute is that of the Federalist papers [Mosteller and Wallace, 1963]. This case involves seventy-seven newspaper essays “published anonymously in 1787-1788 by Alexander Hamilton, John Jay and James Madison to persuade the cit- izens of the State of New York to ratify the Constitution” [Mosteller and Wallace, 1963]. Authorship of these papers is generally agreed, except for twelve, which could have been written by Alexander Hamilton or James Madison. The problem has become popular for linguists, as the correct answer is believed to be just one of these two authors for each of the samples, making the problem well con- tained. Bosch and Smith [1998] reported that “through the use of statistical interference, Mosteller and Wallace came to the conclusion that the odds are overwhelmingly in favour of Madison having been the author of all twelve of the disputed papers”, but this has not prevented others from also attempting this classical problem [Zhao and Zobel, 2007a].

Figure 2.4: The RentACoder web site is a software development marketplace where users can submit software project proposals for competitive bidding between prospective developers. The web site has sometimes been used for academic dishonesty [D’Souza et al., 2007]. Permission to use this screenshot was provided by Ian Ippolito from Exhedra Solutions on 7 May 2010.

2.2. RATIONALE

Other well known disputes have concerned the works of Shakespeare. The success of his plays and poems has motivated others to claim authorship of some of his work, whether rightly or wrongly. According to Elliott and Valenza [1991], there are fifty-eight claimed “true authors” of Shakespearean work, of which thirty-seven are testable [Elliott and Valenza, 1996]. Other studies have instead ex- plored the verification problem, to attempt to attribute newly discovered works to Shakespeare [Ko- lata, 1986].

Many more recent cases demonstrate that courts of law may be required to resolve plagiarism claims, copyright infringement, and authorship disputes, between parties that may result in litigation [Krsul and Spafford, 1997]. These often involve a manual inspection process where experts draw conclusions about any combination of writing skill, coding skill, or motive.

In one case, Wong [2004] described an incident where a publisher suspected plagiarised work by a text book author. The iThenticate service confirmed the plagiarism, but the publisher chose to revise later editions of the work to protect the author. iThenticate [iParadigms, 2007a] is a version of Turnitin previously described in Section 2.1.4, which is targeted towards publishers, lawyers, and corporations.

Similarly, our research also has application to unauthorised code reuse in the corporate sector. For example, one role of members of the Software Freedom Law Centre [2010], is to investigate possible violations of software licences, such as the GNU General Public License. It is difficult to discover well-hidden violations using manual code inspections, and there is a need for tools to determine if one project is the derivative work of another.

MacDonell et al. [2004] reported on a suspected theft case, where they were approached to determine whether a former employee stole and incorporated code in a product from a rival company. The two systems were examined for similarity, and it was found that the degree of similarity was no more than coincidental. The investigation also considered that both products originated from public domain efforts. With this knowledge, the company decided to withdraw from litigation.

There is also a need for whole organisations to protect themselves against plagiarism and copyright infringement. For example, the inheritor of Unix operating system intellectual property — SCO Group — sued IBM for more than one billion dollars, for allegedly incorporating Unix code in its Unix-like AIX operating system in March 2003 [Shankland, 2003].

In another case, Edward Waters College in Jacksonville, Florida, had its accreditation revoked in 2004, after plagiarised content was found in documentation sent to its accreditation agency [Bollag, 2004]. Accreditation was regained six months later after legal proceedings [Lederman, 2005]. This incident resulted in reduced enrolments and threatened current students with loss of funding.

attribution. Again, we mention that plagiarism detection software is of no help if the offending samples are not in the same collection. Taking source code for example, authorship disputes can arise since “programmers tend to feel a sense of ownership of their programs” [Glass, 1985], which can lead to code reproduction in successive organisations. Therefore it is imperative for organisations to monitor their coding styles, to identify code that is potentially obtained inappropriately to avoid problems later. Likewise, authorship attribution is relevant in proving the violation of no-competition contract clauses, whereby programmers are forbidden to work for rival companies for a fixed period after the end of an employment arrangement by identifying the author [Lange and Mancoridis, 2007]. Stamatatos [2008] described two other important uses of authorship attribution with legal implica- tions. First, it is very helpful in the intelligence community to identify and relate authors of terrorism messages. Moreover, it is helpful in criminal law to identify authorship of harassing messages and suicide notes.

Finally, we warn that great care needs to be taken when using authorship attribution techniques in legal proceedings, as there have been some failures that have undermined confidence. For example, the cusum (cumulative sum chart) technique uses writing statistics such as verb frequencies, sentence lengths, and word classes, for evidence of changes in writing style over intervals in writing [Holmes and Tweedie, 1995]. This technique has been historically used to gather writing statistics to prove or cast doubt over the authorship of works presented to court, and has been problematic when explaining evidence to judge and jury. In another case, work described as “banal” was initially incorrectly attributed to Shakespeare, which infuriated scholars [Grieve, 2005].

In document Source code authorship attribution (Page 47-50)