4.1.2 Experiments & Findings . . . 43 4.2 Lessmann et al. 2008 . . . 44 4.3 Elish & Elish 2008 . . . 45
4.4 Liebchen & Shepperd 2008. . . 46 4.5 An SLR on Fault Prediction Performance. . . 47 4.5.1 Findings . . . 48 4.6 Summary . . . 49
T
his chapter is concerned with the defect prediction literature that is most rele- vant to this dissertation. In the sections that follow, a limited set of the most influential studies are discussed in detail. After this, the main findings from a sys- tematic literature review (SLR) on the performance of software defect predictors are presented. I was directly involved in this SLR, which was a substantial, collaborative piece of work between Brunel University and the University of Hertfordshire.The influential studies described in the following sections are: [Menzies 2007b], as this is a very well-cited defect prediction study; [Lessmann 2008], as this involved perhaps the largest-scale defect prediction experiment to date; [Elish 2008], as this is focused on defect prediction with SVMs; and [Liebchen 2008], as this highlights data quality awareness as an issue in modern, empirical software engineering research.
4.1
Menzies et al. 2007
The study carried out by Menzies et al. in 2007 [Menzies 2007b] is perhaps the most well-cited piece of modern defect prediction literature. It involved many defect prediction experiments, with various learners and publicly available data sets. The source of these data sets was the NASA Metrics Data Program Repository1.
1Previously available at
http://mdp.ivv.nasa.gov/ and currently available in a more basic, less documented form athttp://filesanywhere.com/fs/v.aspx?v=896a648c5e5e6f799b.
42 Chapter 4. Literature Review
4.1.1 The NASA Metrics Data Program Repository
The NASA Metrics Data Program (MDP) Repository currently contains 13 module- level data sets2 explicitly intended for software metrics research. Each data set
represents a NASA software system/subsystem and contains the static code metrics and fault data for each comprising module. The static code metrics recorded include: • LOC-count measures such as the number of lines of code and comments. • Halstead measures such as unique operand and operator counts.
• McCabe measures such as the cyclomatic complexity.
All such non-fault related software metrics within each of the data sets were generated using McCabeIQ 7.1, a commercial tool for the automated collection of static code metrics. The primary fault data in these data sets takes the form of an error-count metric. This metric was reportedly calculated from the number of error reports issued for each module via a bug-tracking system. From the details given at the original NASA MDP Repository, it is unclear precisely how these error reports were mapped back to the individual modules; however, it was stated that: “If a module is changed due to an error report (as opposed to a change request), then it receives a one up count. It cannot receive more than a one up for a given error report.” It was also stated that the error-count metric describes “the number of changes due to error.”
Because the error-count metric is discrete, it is often binarised by those using these data sets in classification experiments. Note that if this is not the case, there will typically need to be a class for each unique error-count value. This will potentially result in tiny quantities of data representing some of the classes, which may be problematic for learners (because of the class imbalance problem, see Section
3.4). The most common binarisation process used with the NASA MDP data sets (NASA data sets hereafter) was defined in [Menzies 2007b] as in Equation4.1:
def ective? = (error_count ≥ 1). (4.1)
This process, despite its drawbacks (see Section2.2.1), was carried out in each of the NASA-based studies described in this chapter. A thorough analysis of the NASA data sets is provided in Section6.1, where many data quality issues are highlighted.
2Note that there is evidence to suggest the existence of a 14th data set, known as KC2. To the
4.1. Menzies et al. 2007 43
4.1.2 Experiments & Findings
There were three classifiers used in [Menzies 2007b]: OneR, C4.5 (J48) and naïve Bayes. OneR (one rule) is a decision stump, a classification method similar to a decision tree but which can only split on a single attribute. The C4.5 algorithm is a well-known decision tree. An explanation of both this and the naïve Bayes method can be found in Section 3.2. The implementation used for each of these algorithms came from the popular data mining tool: Weka3. In addition to these learning
techniques, this study also utilised a feature selection method. Such methods aim to reduce the number of features in a data set by locating and removing those that contain irrelevant information. The method used in this study: InfoGain, is concerned with information content in bits.
With 10 repetitions of a 10-fold cross-validation experiment using 8 of the NASA data sets, the individual performance of all three learners were compared. Naïve Bayes was identified as the best method, with statistically better performance than the other classifiers. With regard to the ‘best’ features chosen by InfoGain, the claimed results were that using only the 2 or 3 ‘best’ features gave equal performance to that of when all 38 features were used. Furthermore, the frequencies of the ‘best’ features varied heavily between data sets. This led to the following two conclusions: Firstly, that there is no single set of ‘best’ metrics. Secondly, that defect prediction experiments should be carried out using all available metrics, and that these should then be passed on to a feature selection technique such as InfoGain.
Classifier performance in this study was measured using the two metrics used in receiver operating characteristic (ROC) analysis,the true positive rate and the false positive rate (see Section3.3.3). In the journal these measures were referred to as the probability of detection (pd)and the probability of false alarm (pf), respectively. The performance obtained by the naïve Bayes classifiers, a mean pd of 0.71 and a mean pf of 0.25, was described as being “demonstrably useful”. This claim was questioned later the same year by Zhang and Zhang however, who demonstrated the choice of performance metrics to be inappropriate in the context of the class-imbalanced data [Zhang 2007]. As this subject comprises a major part of this dissertation, discussion on it will commence again in Section 6.2.
As well as the choice of performance measures used in [Menzies 2007b], aspects of the experimental design have also been called into question. A recent study by Song et al. involved a meticulous examination of the feature selection process [Song 2011]. It was found that the feature selection process used in [Menzies 2007b] violated the assumption of unseen data (see Chapter 3). This is because feature selection was carried out for each whole data set (not just the training set) using a method (InfoGain) which utilises class label data. Therefore, data which would not exist in the real world (the test set labels) was used in the model construction process, invaliding the experiment.
3
44 Chapter 4. Literature Review