Machine Learning Fundamentals - Source code authorship attribution

documents. Common uses are P@10 for evaluating the relevance of the first page of results returned by current, popular Internet search engines, and P@1 for evaluating just the correctness of the top result.

Finally, the F-measure [Rennie, 2004] (or F1 score) is used to generate a value that represents a

compromise between precision and recall:

F1=

2 × precision × recall

precision + recall (2.6)

Reciprocal Rank and Mean Reciprocal Rank

Reciprocal rank measures the reciprocal value of the position of the first relevant (or correct) result from a ranked list. So if the first correct result was position 3, then the reciprocal rank would be 1₃. Then, Mean Reciprocal Rank (MRR) is used to average multiple reciprocal rank scores to measure overall system performance.

Average Precision and Mean Average Precision

Average precision measures the precision of every relevant (or correct) result from a ranked list and takes the average. For example, if correct results are at positions 2, 3, and 9 in a ranked list of ten documents, then average precision is1₂+2₃+3₉/3 = 1₂. Similar to MRR, Mean Average Precision (MAP) is used to average multiple average precision scores to measure overall system performance.

2.6 Machine Learning Fundamentals

Cunningham et al. [1997] described a purpose of machine learning as “to devise algorithms that can supplement, or supplant, domain experts in knowledge engineering situations”. Cunningham et al. [1997] also mentioned a key link between information retrieval and machine learning that forms a significant component of this thesis:

“Using learning algorithms to automate information retrieval processes such as document classification ... can alleviate the workload of information workers and reduce in- consistency introduced by human error.”

This section provides an overview of machine learning to ensure that enough is understood for classification in authorship attribution. We begin by describing the training and testing phases for a typical classification problem. Then we cover cross-validation, feature selection, and discretisation

topics that specifically arise in this thesis. Numerous text books are available such as the data mining book by Witten and Frank [2005] for further reading on machine learning.

2.6.1 Training and Testing

This thesis introduces information retrieval ranking for source code authorship attribution in Chap- ter 5. However, most of the prior source code authorship attribution work (covered in Chapter 4), has used machine learning classification algorithms to learn how to attribute authorship. Therefore, we begin by discussing how machine learning algorithms can be used to attribute authorship.

A training phase is needed so that a classification algorithm can learn how to classify existing work samples of established authorship. Therefore, when a new sample is presented, the classification algorithm can assign authorship to the most likely author based on the learned traits. In experimenta- tion, this step is a separate testing phase, where the effectiveness of multiple classification algorithms are often compared to one another, in order to identify the most suitable algorithm for the problem at hand.

For authorship attribution with machine learning methods, accuracy is defined as the proportion of times the classification algorithm assigns the testing samples to the correct author. The specific classification algorithms that appear in this thesis are covered in Section 2.7.

2.6.2 Cross Validation

When classification experiments are organised into training and testing phases, a separate data set is needed for each component. If accuracy scores were only reported for the trained data, the re- sults would then be overfitted to that data, and not representative of results obtained for new unseen problems.

The simplest way to organise an experiment into training and testing phases is to divide the data in half, and use one part for training and the other for testing. However, this approach is problematic for small data sets in particular, as the amount of test data is reduced by half.

A solution is to repeat the above experiment a second time with the roles of the data reversed. That is, use the testing half for training and the training half for testing for a second fold, and then combine the results with the first fold. This approach allows every sample to be used for testing in turn, which increases the size of the result set. This experiment design is known as two-fold cross

validation [Witten and Frank, 2005, pp. 125–127].

Another problem still persists when using two-fold cross validation. That is, half of the data is unavailable for training. This is particularly problematic when conducting source code authorship

2.6. MACHINE LEARNING FUNDAMENTALS Fold Instances 1 A1 B1 C1 D1 E1 2 A2 B2 C2 D2 E2 3 A3 B3 C3 D3 E3 4 A4 B4 C4 D4 E4 5 A5 B5 C5 D5 E5 6 A6 B6 C6 D6 E6

Figure 2.8: Thirty instances and five classes (A-E) spread across six folds. All folds have a document from each class, therefore and all classes are perfectly stratified.

attribution experiments on student programming assignments, for example, as their coding styles may be still evolving, and only some training samples may be truly helpful when attributing each sample. It may be particularly difficult if the best samples are in the testing set when each sample in the training set is classified in turn.

To overcome this second problem, the size of the training fold needs to be increased such that more samples are available when constructing a model for classification. This can be done by simply increasing the two-fold experiment design to a larger number of folds, such as ten-fold. In ten-fold cross validation, nine of the folds are used for training, and the remaining fold is used for testing. This is then repeated nine more times, where the remaining folds are treated as the test fold in turn. The size of the training set is therefore increased from 50% to 90% of the samples in the collection.

Another problem concerns collection properties such as the number of samples per author. For example, consider a scenario with thirty samples in a collection, where the samples belong to five authors with six samples per author each. Figures 2.8 and 2.9 depict what the folds could look like when this collection is organised into six folds and ten folds respectively. There is a problem here, in that when one fold is separated for testing, there is a ₂₇5 chance to identify the author using a six-fold experiment design (Figure 2.8), and a ₂₅5 chance to identify the author using a ten-fold experiment design (Figure 2.9), when working by random chance.

To alleviate this third problem, another cross-validation design is available called leave-one-out cross validation [Witten and Frank, 2005, pp. 127–128]. In this variant, the number of folds is maximised and set to the number of samples in the collection. This variation causes an efficiency trade-off, as it can be slower to execute since the number of folds has been maximised. However, this experiment design maximises the amount of training data, and is more suited to dealing with collections with a varying number of samples per author, since problems demonstrated in Figures 2.8 and 2.9 cannot occur. The leave-one-out cross validation experiment design is used throughout this thesis.

Fold Instances 1 A1 B5 D3 2 A2 B6 D4 3 A3 C1 D5 4 A4 C2 D6 5 A5 C3 E1 6 A6 C4 E2 7 B1 C5 E3 8 B2 C6 E4 9 B3 D1 E5 10 B4 D2 E6

Figure 2.9: Thirty instances and five classes (A-E) spread across ten folds. Each fold is only repre- sented by three of the five classes.

2.6.3 Feature Selection

The purpose of feature selection is to “identify and remove as much irrelevant and redundant information as possible prior to learning”, which can generate “enhanced performance, a reduced hypothesis search space, and, in some cases, reduced storage requirement” [Hall and Smith, 1998]. This is con- tradictory to the idea of monotonicity that asserts that “increasing the number of features can never decrease performance” [Hall and Smith, 1998]. However, this often does not apply to machine learning, since “adding irrelevant or distracting attributes to a dataset often ‘confuses’ machine learning systems” [Witten and Frank, 2005, pp. 232].

Feature selection algorithms involve processes such as adding or removing one feature at a time to a model until no further change improves classification accuracy. Alternatively, the value of indi- vidual features can be evaluated one at a time for possible inclusion in a model. However, we do not go into this further, as this thesis mostly uses existing feature set classes instead of exploring the use of feature selection algorithms in detail.

2.6.4 Discretisation

Machine learning algorithms often need to process features with discrete or continuous values. Dis- crete features are categorical. For example, gender as a feature has two categorical values: “male” and “female”. Conversely, features that require measurement are often continuous such as height.

Continuous features can prove difficult for machine learning algorithms to process, and a solution is to organise the possible values into a number of fixed ranges. This process is called bucketing, binning, or discretisation [Shevertalov et al., 2009]. For example, when using height as a feature for

In document Source code authorship attribution (Page 69-73)