Content comparison techniques - Source-Code Plagiarism Detection Tools

2.5 Source-Code Plagiarism Detection Tools

2.5.2 Content comparison techniques

Content Comparison techniques are often referred to asstructure-metricsystems in the literature. These systems convert programs into tokens and then search for matching contiguous sequence of substrings within the programs. Similarity between programs depends on the percentage of the text matched. Mozgovoy [100] classified content comparison techniques

intostring matching-based algorithms,parameterised matching algorithmsand parse trees comparison algorithms.

String-matching algorithms

Most recent plagiarism detection systems rely on comparing the structure of programs. These systems use attribute-counting metrics but they also compare the program structure in order to improve plagiarism detection. String-matching based systems employ comparison algorithms that are more complex than attribute counting algorithms. Most string-matching algorithms work by converting the programs into tokens and then use a sophisticated searching algorithm to search for common substrings of text within two programs.

Such systems were initially proposed by Donaldsonet al. [37] who identified simple techniques that novice programming students use to disguise plagiarism. These techniques are:

• renaming variables,

• reordering statements that will not affect the program result,

• changing format statements, and

• breaking up statements, such as multiple declarations and output statements.

The techniques identified by Donaldsonet al.[37] are of the most basic forms of attack. Whale [139], and Joy and Luck [62] also provide a more detailed list of attacks. A much

more detailed list is also provided by Precheltet al. [115]. We also provide an exhaustive list containing plagiarism attacks, found in [26].

The program developed by Donaldsonet al.[37] scans source-code files and stores information about certain types of statements. Then, statement types significant in describing the structure are assigned a single code character. Each assignment is then represented as a string of characters. If the string representations are identical or similar, then the pair of programs is returned as similar.

Some recent string-matching based systems include Plague [139], YAP3 (Yet Another Plague) [144], JPlag [115], and Sherlock [62].

In most string-matching based systems, including the ones mentioned above, the first stage is called tokenisation. At the tokenisation stage each source-code file is replaced by predefined and consistent tokens, for example different types of loops in the source-code may be replaced by the same token name regardless of their loop type (e.g. while loop, for loop). Each tokenised source-code file is thus represented as a series of token strings. The programs are then compared by searching for matching substring sequences of tokens. Programs that contain matching number of tokens above a given threshold are returned as similar. Similarity between two files is most commonly computed by the coverage of matched tokens between the detected files.

Plague [139] first creates a sequence of tokens for each file, and compares the tokenised versions of selected programs using a string matching technique. Results are displayed as

a list of matched pairs ranked in order of similarity, measured by the length of the matched portioned of the token sequences between the two files. Plague is known to have a few problems [25]. One of the problems with Plague is that it is time-consuming to convert it to another programming language. Furthermore, the results are displayed in two lists ordered by indices which makes interpretation of results not immediately obvious. Finally, Plague is not very efficient because it relies upon a number of additional Unix tools which causes portability issues.

YAP3 converts programs into a string of tokens and compares them by using the token matching algorithm,Running-Karp-Rabin Greedy-String-Tiling algorithm (RKR-GST), in order to find similar source-code segments [144]. YAP3 pre-processes the source-code files prior to converting them into tokens. Pre-processing involves removing comments, trans- lating upper case letters to lower case, mapping synonyms to a common form (i.e., function is mapped to procedure), reordering the functions into their calling order, and removing all tokens that are not from the lexicon of the target language (i.e., removing all terms that are not language reserved terms). YAP3 was developed mainly to detect breaking of code functions into multiple functions, and to detect the reordering of independent source-code segments. The algorithm works by comparing two strings (the pattern and the text) which involves searching the text to find matching substrings of the pattern. Matches of substrings are called tiles. Each tile is a match which contains a substring from the pattern and a substring from the text. Once a match is found the status of the tokens within the tile are set to marked. Tiles whose length is below a minimum-match length threshold are ignored. The

RKR-GST algorithm aims to find maximal matches of contiguous substring sequences that contain tokens that have not been covered by other substrings, and therefore to maximise the number of tokens covered by tiles.

JPlag [115] uses the same comparison algorithm as YAP3, but with optimised run time efficiency. In JPlag the similarity is calculated as the percentage of token strings covered. One of the problems of JPlag is that files must parse to be included in the comparison for plagiarism, and this causes similar files to be missed. Also, JPlag’s user defined parameter of minimum-match length is set to a default number. Changing this number can alter the detection results (for better or worse) and to alter this number one may need an understanding of the algorithm behind JPlag (i.e RKR-GST). JPlag is implemented as a web service and contains a simple but efficient user interface. The user interface displays a list of similar file pairs and their degree of similarity, and a display for comparing the detected similar files by highlighting their matching blocks of source-code fragments.

Sherlock [62] also implements a similar algorithm to YAP3. Sherlock converts programs into tokens and searches for sequences of lines (called runs) that are common in two files. Similarly to the YAP3 algorithm Sherlock searches for runs of maximum length. Sher- lock’s user interface displays a list of similar file pairs and their degree of similarity, and indicates their matching blocks of source-code fragments found within detected file pairs. In addition, Sherlock displays quick visualisation of results in the form of a graph where each vertex represents a single source-code file and each edge shows the degree of similarity between the two files. The graph only displays similarity (i.e., edges) between files

above the given user defined threshold. One of the benefits of Sherlock is that, unlike JPlag, the files do not have to parse to be included in the comparison and there are no user defined parameters that can influence the system’s performance. Sherlock, is an open-source tool and its token matching procedure is easily customisable to languages other than Java [62]. Sherlock is a stand-alone tool and not a web-based service like JPlag and MOSS. A stand alone tool may be more preferable to academics with regards to checking student files for plagiarism when taking into consideration confidentiality issues.

Plaggie [1] is a similar tool to JPlag, but employs the RKR-GST without any speed opti- misation algorithms. Plaggie converts files into tokens and uses the RKR-GST algorithm to compare files for similarity. The idea behind Plaggie is to create a tool similar to JPlag but that is stand-alone (i.e. installed on a local machine) rather than a web-based tool, and that has the extra functionality of enabling the academic to exclude code from the comparison, i.e. code given to students in class.

Another plagiarism detection tool is that developed by Mozgovoy et al., called Fast Plagiarism Detection System (FDPS) [102, 103]. FDPS aims to improve the speed of plagiarism detection by using an indexed data structure to store files. Initially files are converted into tokens and tokenised files are stored in an indexed data structure. This allows for fast searching of files, using an algorithm similar to that in YAP3. This involves taking substrings from atest fileand searching for matching substring in thecollection file. The matches are stored in a repository and thereafter used for computing the similarity between files. Similarity is the ratio of the total number of tokens matched in the collection file to the

total number of tokens in the test file. This ratio gives the coverage of matched tokens. File pairs with a similarity value above the given threshold are retrieved. One of the problems with the FDPS tool is that its results cannot be visualised by means of displaying similar segments of code [101]. Therefore, the authors combined FPDS with the Plaggie tool for conducting the file-file comparison and visualising the results.

Parameterized matching algorithms

Parameterized matching algorithmsare similar to the ordinary token matching techniques (as discussed in Section 2.5.2) but with a more advanced tokenizer [100]. Basically, these parameterized matching algorithms begin by converting files into tokens. A parameterized match (also called p-match) matches source-code segments whose variable names have been substituted (renamed) systematically.

The Dup tool [6] is based on a p-match algorithm, and was developed to detect duplicate code in software. It detects identical and parameterized sections of source-code. Initially, using a lexical analyzer the tool scans the source-code file, and for each line of code, a transformed line is created. The source-code is transformed into parameters, which involves transforming identifiers and constants into the same symbol P, and a list of the parameter candidates. For example linex=f un(y)+3∗xis transformed intoP =P(P)+P∗Pand a list containing x, fun, y, 3 and x is generated. Then the lexical analyser creates an integer that represents each transformed line of code. Given a threshold length, Dup aims to find source-code p-matches between files. Similarity is determined by computing the percentage

of p-matches between two files. The Dup tool outputs the percentage of similarity between the two files, a profile showing the p-matched lines of code between two files, and a plot which displays the location of the matched code. According to [46], the Dup tool fails to detect files if false statements are inserted, or when a small block of statements is reordered. Other parameterized matching algorithms exist in the literature [3, 7, 56, 119, 44].

Parse Trees Comparison Algorithms

Parse tree comparison algorithmsimplement comparison techniques that compare the structure of files [100]. The Sim utility [46] represents each file as a parse tree, and then using string representations of parse trees it compares file pairs for similarity. The comparison process begins by converting each source-code file into a string of tokens where each token is replaced by a fixed set of tokens. After tokenisation, the token streams of the files are divided into sections and their sections are aligned. This technique allows shuffled code fragments to be detected.

Sim employs ordinary string matching algorithms to compare file pairs represented as parse trees. As with most string matching algorithms, Sim searches for maximal common subsequence of tokens. Sim outputs the degree of similarity (in the range of 0.0 and 1.0) between file pairs, and was developed to detect similarity in C programs. According to the authors of Sim, it can be easily modified to work with programs written in languages other than C. It can detect similar files that contain common modifications such as name changes, reordering of statements and functions, and adding/removing comments and white spaces

[46].

A more recent application of parse trees comparison algorithms is that implemented in the Brass system [10]. The functionality of Brass involves representing each program as a

structure chart, which is a graphical design consisting of a tree representation of each file. The root of the tree specifies the file header, and the child nodes specify the statements and control structures found in the file. The tree is then converted into a binary tree and a symbol table (data dictionary) which holds information about the variables and data structures used in the files. Comparison in Brass is a three part process. The first two algorithms involve comparing the structure trees of files and the third involves comparing the data dictionary representations of the files. Initially, similar file pairs are detected using the first comparison algorithm. Thereafter, the second and third comparison algorithms are applied to the detected file pairs.

Tree comparison involves using algorithms more complex than string matching algorithms. From this it can be assumed that systems based in tree comparison algorithms are slower than systems based on string matching algorithms [100]. The filtering technique by Brass, speeds up the comparison process.

According to Mozgovoy [100] not much research has been carried out in the area of parse tree algorithms, and it is not known whether these algorithms perform better than string matching algorithms with regards to detecting similar source-code file pairs.

In document An approach to source code plagiarism detection investigation using latent semantic analysis (Page 50-59)