Up to this point, white space has been managed by a single feature, which needs improvement for capturing indentation information as concluded in the previous section. We now consider forty white space tokens representing contiguous sequences of one to forty white spaces to better represent deep levels of indentation. Similar to Table 6.9, in Table 6.10 we show how the entropy of these new features varies between tasks and authors, for all white space features with at least ten instances per sample average.
The “SPACE01” token is still most prevalent but has now been reduced to 16.38% of its prior volume. Space tokens in multiples of three come next (“SPACE03”, “SPACE06”, “SPACE09”, “SPACE12”), indicating the strong preference for code blocks to be indented in multiples of three spaces. The “SPACE15” and “SPACE18” tokens are not the next most prevalent, but have the next highest between-author entropy scores. Of most interest is the drop in between-author entropy scores for the remaining five white space features (“SPACE02”, “SPACE04”, “SPACE08”, “SPACE07” and “SPACE05”), indicating a wider spread of scores and good choices for authorship markers. The “SPACE02” token had the lowest entropy of all, but we found that the same outlier sample shown in Figure 6.12 partly contributed to this score.
When analysing the spread of scores for the “SPACE02” token similar to Figure 6.14, we again noticed the extreme contribution of the same outlier discussed in Figure 6.12, but in this case the trend still held after omitting this program, unlike the square-bracket token case demonstrated in Table 6.9.
6.7. IMPROVING ACCURACY WITH HIGHLY DISCRIMINATING FEATURES
0 1 2 3 4 5 6 7 8 9 10 11 12
Parenthesis Usage
Percentage of total tokens used
Number of instances 0 20 40 60 80 100 120 140 160 180 200 220 0 1 2 3 4 5 6 7 8 9 10 11 12
Carriage Return Usage
Percentage of total tokens used
Number of instances 0 20 40 60 80 100 120 140 160 180 200 220
Figure 6.14: Comparison of the use of parenthesis and carriage return features over all authors in
Token Count En–Task En–Auth "SPACE01" 1,675,961 2.51 7.98 "SPACE03" 366,565 2.50 7.89 "SPACE06" 229,320 2.48 7.88 "SPACE09" 131,577 2.46 7.76 "SPACE12" 88,908 2.47 7.71 "SPACE02" 68,845 2.42 6.03 "SPACE04" 64,918 2.53 7.05 "SPACE15" 47,415 2.46 7.48 "SPACE18" 25,261 2.44 7.28 "SPACE08" 24,078 2.53 6.65 "SPACE07" 22,958 2.49 6.91 "SPACE05" 17,344 2.49 6.74 Maximum 2.58 8.09
Table 6.10: Entropy of feature distribution for the six tasks (En–Task) and the 272 authors (En–Auth) for white space features with at least ten instances per work sample on average. Smaller entropy values indicate larger variation (dispersion) within task and author groups. Bold cases are discussed in the text.
Next we note that we have 2,918 instances of the “SPACE40” token indicating that there may still be some indentation more than 40 spaces deep, which were treated as a “SPACE40” token plus another quantity of spaces. However, this amount represents only one to two instances per sample on average, so we do not try to introduce more features.
Using the new white space features, we repeated our ten-class authorship attribution experiment on Coll-T, and found that accuracy jumped from 78.52% (11,700/15,000) to 82.35% (12,352/15,000) with this feature change alone. We also repeated the experiment on Coll-A and found that accuracy jumped by a similar margin from 76.78% (12,261/15,969) to 79.66% (12,700/15,943). Both of these results are statistically significant at the 95% confidence interval using a two-sample test for equality of proportions with continuity correction (p = 2.20 × 10−16and p = 1.50 × 10−10respectively). Further feature refinements like the one presented may yield more effective results, but we would expect these improvements to be more subtle, as the change made with white space affected the feature represented by the largest number of tokens in Coll-T. Another reasonable refinement could be to interpret literals differently to identify authors who use short or long identifier names, but we leave these refinements for future work.
6.8. SUMMARY
6.8 Summary
In this chapter, we explored several factors that affect the accuracy of our approach. These factors comprised the number of authors, number of samples per author, sample lengths, stylistic strength, timestamp, and entropy. A key finding is that it takes about one semester for student coding style to stabilise according to our data, which has implications for practitioners who deal with source code quality control. The end result for our final model was a statistically significant improvement in accuracy compared with the version from the previous chapter. Next, in Chapter 7 we compare our final model to the identified baselines, and investigate improvements to those baselines.
Chapter 7
Improving Contributions in the Field
The work presented in Chapter 5 introduced our information retrieval approach and the choice of key parameters including similarity measure, n-gram size, and feature set, using the data in Coll-A. Then in Chapter 6 we explored several factors that affect the accuracy of our approach, including topical and temporal patterns in the data. The timestamp experiments using Coll-T identified a further improvement to our feature set involving how white space is used. It remains to be seen how our approach performs on a variety of programming languages in a variety of settings, as the experiments to this point in our approach have used C programming assignments only.
In Section 7.1, we provide accuracy results for our approach on all of the collections introduced in Chapter 3. We then benchmark these against the reimplemented approaches from Section 4.5 (p. 106), to show how much our work has improved the state-of-the-art in source code authorship attribution accuracy. In Sections 7.2 and 7.3, we next implement some extensions for the previous approaches that used n-grams and software metrics respectively. Finally, we summarise all of the key results for this thesis in Section 7.4 before summarising the chapter in Section 7.5.
7.1 Overall Results for the Information Retrieval Approach
When reporting our results for collections Coll-A, Coll-T, Coll-P, and Coll-J, we first reiterate that a different number of runs is used depending on the collection. For example, in the Section 5.1 (p. 110) methodology based on Coll-A, we described how the experiment is repeated for 100 runs, with each run using a random subset of 10 authors. We have taken care that roughly the same number of queries are processed for each collection, however this will never be exact, as the total number of queries depends upon the authors that are randomly selected for each run. This is as a result of the number of samples per author greatly varying, as shown in Figure 3.2 (p. 69). The only exception is
Coll-T, which has exactly six samples for every author. In short, we used 100 runs for Coll-A, 250 runs for Coll-T, 150 runs for Coll-P, and 250 runs for Coll-J.
To enable us to make some generalisations about the differences between our accuracy scores, we performed a post-hoc analysis to obtain an estimate of the statistical power of these experiments. Assuming we want high power (0.8, meaning an 80% chance of avoiding a Type 2 error) at 95% confidence, we have 15,000 queries, and accuracy is 80% for a given method, then our tests are powerful enough to reject a false null hypothesis for a second method, with a difference in accuracy of 1.28% or more. We note that this statistical power will remain fairly constant as each experiment has a similar number of queries.
We report two sets of results in this section concerning the index construction methods reviewed in this thesis. The first variation was presented in Section 5.1 (p. 110), where the query document is indexed, but is removed from the results list (presumably the first rank). This variation was used throughout Chapters 5 and 6 in order to reduce the number of indexes required when developing our approach.
The second variation from Section 5.5 (p. 132) uses a strict separation of the training and testing data, where all samples except the query sample are indexed for each run, and each sample is treated as the query in turn. This variation requires one index per query instead of one index per run, hence it is much slower. However, since this variation uses a strict leave-one-out cross validation design, it more closely models the previous work, and is therefore more appropriate for comparison purposes.
The first variation is referred to as lenient leave-one-out (or Lenient), and the second variation is labelled strict leave-one-out (or Strict) henceforth. The Lenient approach achieved 80.59% accuracy for Coll-A, 81.88% for Coll-T, 88.81% for Coll-P, and 81.87% for Coll-J. The Strict approach achieved 79.70% accuracy for Coll-A, 81.29% for Coll-T, 89.34% for Coll-P, and 80.76% for Coll-J. These results are summarised in Table 7.1 with the p-values for Z-tests for two proportions.
In absolute values, the Strict variation accuracy results were higher for Coll-P, and the Lenient variation accuracy results were higher for the other collections. However, considering the p-values at the 95% confidence level, we note that two of the four differences are statistically insignificant, and the Coll-A result is borderline (p = 0.05). Therefore we remark that it is satisfactory to use either of these methods. This finding is consistent with the conclusion in Section 5.5 (p. 132) drawn from Coll-A alone. For the remainder of this chapter, we use the slower Strict variation as it is more consistent with the previous work reviewed in Section 4.3 (p. 90). The Lenient variation is not discussed further in this chapter.
Next, it is remarkable to note the higher accuracy obtained for the freelance collections (Strict). Coll-P accuracy results were around 7% higher than the academic collection accuracy results, and the