Key observations were made when inspecting the volumes of each individual token. For example, the ratio of new lines to carriage returns was found to be 16:1. This suggests that at least one out of sixteen samples was not developed in our predominantly Unix-based environment, and this gives a useful authorship marker concerning choice of operating system for programming assignment de- velopment. The white space and literal tokens were all found in large quantities, as the nine tokens that make up these categories were all within the top fifteen when totalling the volume of each token. Of the remaining token classes, the parenthesis was the most prevalent operator token, int was the most prevalent keyword, NULL was the most prevalent input/output token, and strlen was the most prevalent function token.
In summary, in this section we have explored sixty-three feature sets, and we found that the white space, operator, and keyword classes were most effective. In particular, this combination of classes (named “Feature Set 50”) was also the most effective combination of all those evaluated. The MRR score increased from 75.59% to 82.28% (+6.69%) and the MAP score increased from 26.70% to 41.33% (+14.63%) compared to the previous scores from Section 5.2. Therefore, this feature set is carried forward for all experiments that follow.
5.4 Classification
The experiments in Sections 5.2 and 5.3 have used MRR and MAP to evaluate the quality of the ranked lists. These scores have allowed us to make decisions for key parameters in our model, such as choice of n-gram size, choice of information retrieval similarity measure, and choice of feature set. Having made these decisions, the next step is to make actual authorship classification decisions to measure the accuracy level of our model. In this section, we evaluate three methods for deciding authorship, followed by a comparison of our accuracy scores to the reimplemented baselines presented in Chapter 4.
5.4.1 Overall Results
In this section, we evaluate three metrics for determining classification accuracy. Therefore, the MRR and MAP measurements that were only used to compare the quality of ranked lists in Sections 5.2 and 5.3 are no longer used.
First, the single best result metric attributes authorship to the author of the top ranked document. The proportion of times that this is correct is used to calculate overall accuracy. This metric is the only one that uses a single document from the ranked list for the authorship decision.
Sample Sample Similarity Classified Author Single Best Result
Rank Author Score A 1.0
1 A1 0.8 B 0.0
2 B1 0.5 C 0.0
3 B2 0.5 Classified Author Average Scaled Score
4 C1 0.5 A (0.8 + 0.2 + 0.2 + 0.2 + 0.2) / 5 = 0.32
5 B3 0.5 B (0.5 + 0.5 + 0.5 + 0.2) / 4 = 0.43
6 B4 0.2 C (0.5) / 1 = 0.5
7 A2 0.2 Classified Author Average Precision
8 A3 0.2 A (1/1 + 2/7 + 3/8 + 4/9 + 5/10) / 5 = 0.52
9 A4 0.2 B (1/2 + 2/3 + 3/5 + 4/6) / 4 = 0.61
10 A5 0.2 C (1/1) / 4 = 0.25
Table 5.4: A ranked list of ten samples (left) and how the single best result, average scaled score and average precision measurements vary depending on the order of samples in the ranked list and the similarity scores (right).
Next, the average scaled score metric uses the Okapi BM25 similarity scores returned by the search engine. The scores are first normalised against the score from the top-ranked document, which is otherwise not considered in the ranked list. Then the normalised scores for each author are averaged, and authorship is assigned to the author with the highest average score. Overall accuracy is again the proportion of times that this is correct. This is the only metric that uses the absolute similarity measurements of the search engine.
Finally with the average precision metric, we calculate the average precision for the documents of each candidate author in turn, and assign authorship to the author with the highest average precision score. Again, the proportion of times this is correct is used for calculating overall accuracy. This metric is the only one that uses the relative similarity measurements of the search engine.
Table 5.4 provides example calculations of the three metrics on a dummy ranked list, comprising samples from three authors named Author A, Author B, and Author C. This example demonstrates scores that each metric would generate for the dummy ranked list. The author with the highest score is classified as the correct author. Therefore according to the scores shown, the query sample (not shown) is classified as Author A for the single best result metric, Author B for the average precision metric, and Author C for the average scaled score metric. Then accuracy is the proportion of times that the classified author matches the actual author of the query sample.
We present our classification experiment results in Table 5.5. When using the “single best result” classification method, we correctly classified work in 76.78% of cases for the ten-class problem. This is the best of our methods compared to “average scaled score” (76.47%, p = 0.52) and “average
5.4. CLASSIFICATION
Num Single Best Result Average Scaled Score Average Precision
Auth Correct Percent Correct Percent Correct Percent
7 8,773 / 11,153 78.66% 8,795 / 11,153 78.86% 8,561 / 11,153 76.76% 8 9,954 / 12,686 78.46% 9,977 / 12,686 78.65% 9,739 / 12,686 76.77% 10 12,261 / 15,969 76.78% 12,212 / 15,969 76.47% 11,925 / 15,969 74.68% 12 14,381 / 19,110 75.25% 14,127 / 19,110 73.92% 13,909 / 19,110 72.78% 20 23,087 / 31,859 72.47% 21,885 / 31,859 68.69% 21,725 / 31,859 68.19% 29 32,782 / 46,387 70.67% 30,210 / 46,387 65.13% 30,469 / 46,387 65.68% 30 33,677 / 47,785 70.48% 30,864 / 47,785 64.59% 31,224 / 47,785 65.34% 46 50,309 / 73,362 68.58% 44,723 / 73,362 60.96% 46,200 / 73,362 62.98%
Table 5.5: Overall comparison of the three classification methods. Varying problem sizes are shown including the ten-class problem that is largely used in our work, and other problem sizes from seven- class to forty-six-class that have been used by other researchers (see Table 3.3, p. 73).
precision” (74.68%, p = 1.23 × 10−5), but the difference was not statistically significant in the first case as indicated.
In Table 5.5, we also present results for seven other problem sizes ranging from seven-class to forty-six-class, which have been used in previous source code authorship attribution approaches, as reviewed in Section 3.4 (p. 70). The motivation for repeating our experiment on the other problem sizes is to consider whether the length of the ranked lists affects the three metrics evaluated. We found that the “single best result” method was again most effective for the harder authorship attribu- tion problems (twelve-class up to forty-six-class), however the “average scaled score” results were marginally higher for the seven-class and eight-class problems by 0.20% and 0.21% respectively. Most interestingly, the single best result metric produces the best results for the larger problem sizes, possibly due to the other metrics being forced to process many potentially unhelpful results.
In summary, the “single best result” metric is most effective for most of the problem sizes consid- ered, and we carry this forward for all experiments that follow. The other metrics are not used in the remainder of this thesis. We also note that additional metrics could have been considered, but these have been left for future work. In particular, a voting model could be implemented where the results of multiple metrics are combined, and authorship is attributed to the author with the highest score for the greatest number of metrics.
5.4.2 Comparing Accuracy to the Reimplemented Work
Figure 5.11 shows our ten-class result from Table 5.5 (using the single best result metric) against the reimplemented baseline results from Section 4.5 for Coll-A. Our 76.78% classification accuracy score is 10.38% above the next best accuracy score of 66.40% by Frantzeskou et al. [2006a]. We
only show Coll-A results for now as this collection is used for developing our approach. More improvements follow in Chapter 6, which resulted from our investigation of factors that affect the accuracy of our approach. The full comparison is given later in Section 7.1 (p. 173).
An interesting result from Figure 5.11 is the general drop in results from the other researchers compared to those reported in their original publications as summarised in Table 4.1. We have reim- plemented the previous work as closely as possible as explained in Section 4.4, however the problem sizes are of course different. This is demonstrated in Figure 5.11, which uses the ten-class problem consistently, but Table 4.1 uses the problem sizes in the publications from seven-class to forty-six- class. However, a surprising trend is that all of the larger problem size results in Table 4.1 are higher than the ten-class results in Figure 5.11, with the exception of Lange and Mancoridis [2007]. The discrepancy could be explained by the previous problems being easier based on the collections cho- sen, or possible statistical anomalies from the use of modest collections. There is scope to further explore this area by repeating the Figure 5.11 experiment for other problem sizes, but this remains as future work, as the aim of this thesis is provide results consistently for the ten-class problem, given the time requirements for each experiment. However, we briefly explore multiple problem sizes in Section 6.1.1.