This section is concerned with the process of defect prediction: the construction and use of classifiers that are intended to predict the presence of defects in future code units. As discussed in Chapter 3, when trying to obtain an estimate of potential real-world predictive performance, it is crucial for classifiers to be tested on unseen data, data that was not used during model construction. Therefore, the original data sample should typically be re-sampled into one or more training and testing set pairs. Some of the methods of doing this were described in Section 3.3.1, a good method would usually be stratified 10-fold cross-validation. As discussed in the previous section, each training set may require its own pre-processing. It is often a necessity that this pre-processing is carried out on only the training data, as otherwise the assumption of unseen data could be broken (see chapters 4 & 5). For example, if a high-level of class imbalance was identified during data analysis, it may be that an undersampling and/or oversampling technique should be utilised on only the training data (see Section 3.4). Also, as will be shown in the next section, repeated data points may require addressing in only the training data.
When trying to produce as good a classification model as possible, it is impor- tant to carry out a model optimisation phase (see Section 3.5). Note that this is not always possible for some simple classifiers, and that there are other, more so- phisticated classification methods (random forests for example) that can perform competitively without such tuning; these methods typically do not require an addi- tional, explicit model optimisation phase, although one can often be carried out if desired. Nevertheless, for many classifiers (such as SVMs) an explicit model optimi- sation phase is often a prerequisite for obtaining a satisfactory classification model. The pseudo-code for a typical, full classification experiment which includes model optimisation was given in Figure3.6. Note that during this process it is imperative for the test data to remain entirely unseen. Also note that it is good practice for a coarse parameter search to be undertaken to begin with, and then a finer search on the area identified as best by the coarse search [Hsu 2003].
8.3.1 How to address the issues caused by repeated data points
As already mentioned, the simplest way to address the issues caused by repeated data points is to discard them as part of the contextual data cleansing process, making each consistent data point unique (for details see Section6.1.2.6). However, carrying out such a pre-processing step may not be the best approach, as repeated data points may occur in the real world. For this reason, I propose the following method for addressing the issues caused by repeated data points in artificial (as opposed to real-world) classification experiments. This method was first proposed in an EASE journal paper [Gray 2012], which was an extended version of the conference paper described in Section6.1. The full journal paper can be found in AppendixE.
8.3. The Process of Defect Prediction 111
1. After the initial divide into training and testing set, discard all training data points with feature vectors common to the testing set. This ensures that per- formance is measured on unseen data, while the test set remains unmodified. 2. If the class distribution of the training set is adversely altered as a consequence of step 1, consider sampling techniques (see [He 2009,Chawla 2002]) to help maintain the original (or a more balanced) distribution. If oversampling is re- quired, I recommend the synthetic minority oversampling technique (SMOTE [Chawla 2002]) rather than oversampling by duplication, as it reduces the likelihood of overfitting [Chawla 2002,Chawla 2003,Cieslak 2006].
3. During model optimisation/tuning (if performed), remove all duplicates from the validation set, as recommended by [Kołcz 2003]. Next, discard all valida- tion set data points with feature vectors common to the corresponding training set (the training subset). If the class distribution of the validation set is ad- versely altered, consider the use of sampling techniques, as described in step 2. The purpose of this step is to help avoid overfitting.
This proposed approach is most suitable when researchers believe the repeated data points to be genuine, not noise. This is because test sets remain unmodified. With the simple approach of removing all repeated data points during data cleans- ing, test sets are indirectly modified before the data separation process has occurred. Note that a possible addition to the proposed approach is to remove all duplicates from the training set, to further reduce the possibility of overfitting. The best place for this to occur would be in-between steps 1 and 2.
8.3.2 Summary
To summarise this section, during the process of defect prediction:
• Help ensure that classifiers will be tested against unseen data by partitioning the data into one or more training and testing set pairs. Stratified 10-fold cross-validation is typically a good method of doing this (see Section3.3.1). • Carry out a model optimisation phase if suitable to do so based on the clas-
sification method being used (see Section 3.5). Good practice is for a coarse parameter search to be undertaken to begin with, and then a finer search on the area identified as best by the coarse search [Hsu 2003].
• As discussed in the previous section, if there are genuine repeated data points present in the data that cannot be addressed by gathering more features (to better discriminate them), then a suitable approach is described in Section
8.3.1. This approach keeps test sets unmodified, while preventing any con- tamination of training and testing data.
112 Chapter 8. Finalising the Methodology