Summary - Software Defect Prediction Using Static Code Metrics: Formulating a Methodology

When analysing which independent variables were being used in studies, we fre- quently found claims that using LOC-count-based measures alone provided equal and if not greater performance than using complexity or object-oriented (OO) metrics. This finding corresponds with the findings in [Oram 2010], where the complexity metrics of thousands of open-source C projects were statistically analysed, and all found to be highly correlated with LOC-count. This led to the conclusion that “lines of code should be considered as the first and only metric for (defect prediction) models”. Despite this, there were studies in the SLR claiming that product metrics other than LOC-count, such as various OO metrics (see [Chidamber 1994]), were useful. It is worth noting that care must be taken when comparing the results of studies using exclusively OO-based or exclusively complexity-based metrics, as they are often predicting faults at different levels of granularity (see Chapter 2). The studies in the SLR reporting best performance initially used all available metrics, but then used a feature selection technique to use only a subset of the features estimated as most useful for classification.

With regard to which learning methods are best for defect prediction, our findings are in agreement with the findings reported in [Lessmann 2008] that there is no universal ‘best’ method. However, an interesting observation is that simple methods, such as naïve Bayes and logistic regression, seem to be performing very competitively. Conversely, complex, state-of-the-art methods, such as SVMs, seem to be performing fairly poorly. This may be because of a lack of data mining expertise in this domain.

4.6 Summary

To summarise the key findings described in this chapter: [Menzies 2007b] claimed their naïve Bayes predictors were “demonstrably useful”, with a mean TPR (or pd) of 0.71 and a mean FPR (or pf ) of 0.25. These claims were swiftly questioned by [Zhang 2007] however, who demonstrated that the choice of performance measures was inappropriate in the context of the data. The experimental design in [Men- zies 2007b] was also recently questioned by [Song 2011], who highlighted the flawed feature selection process.

Perhaps the largest-scale defect prediction experiment to date was carried out by [Lessmann 2008], who found the best 17 of 22 classifiers to have statistically indistinguishable performance. This led to the conclusion that choice of classification technique may not be as important as previously thought. The same year, [Elish 2008] carried out a defect prediction study intended to demonstrate the capa- bility of SVMs for defect prediction. This study contained several major technical shortcomings. The technical shortcomings in both this study and [Menzies 2007b] indicate that defect prediction is an immature research area, which may be lacking the necessary skills. In addition to technical problems, [Liebchen 2008] have highlighted data quality awareness as an issue in modern, empirical software engineering research.

50 Chapter 4. Literature Review

The most up-to-date SLR on defect prediction was carried out by [Hall 2012]. The focus of this SLR was predictor performance. A contextual quality check of the 208 identified studies resulted in 172 (83%) being rejected for further synthesis. The most common cause for rejection was that basic application context information had not been reported. This was problematic as investigating the effect of such context was a key research question. The main findings from this study include that:

• Simple learners tend to perform competitively, while more sophisticated techniques tend to perform poorly.

• Using LOC-count-based measures alone often yields predictors that are com- petitive with those built using additional metrics.

Chapter 5 SVMs for Defect Prediction

Contents

5.1 Initial Classification Experiment . . . 51 5.1.1 Issues & Shortcomings . . . 53 5.2 Examining Predictive Models . . . 53 5.2.1 Issues & Shortcomings . . . 60 5.3 Conclusions . . . 61

I

n this chapter I describe two experiments where support vector machines (SVMs) were used for defect prediction. These experiments, despite their shortcomings (which were discovered after their completion and will be discussed in full), were the beginning of a learning process regarding methodological issues in fault prediction. The experience gained from these studies was substantial, and a key influence on the methodology proposed in Chapter8.

The first experiment, which involved using SVMs to predict defects in the NASA data sets (see Section4.1.1), led to publication in the 2009 International Conference on Engineering Applications of Neural Networks (EANN) [Gray 2009]. It was during this study that I first identified data quality issues. Details are given in Section5.1, and the full paper can be found in AppendixA. The second experiment, concerned with classification model analysis and described in Section5.2, led to publication in the 2010 International Joint Conference on Neural Networks (IJCNN) [Gray 2010]. The full paper can be found in Appendix B. Note that for both these experiments the SVM software used was LIBSVM [Chang 2001], an excellent, open-source library for support vector classification.

5.1 Initial Classification Experiment

To gain experience at machine learning, a classification experiment was undertaken utilising 11 of the 13 NASA data sets (described in chapters 4 & 6). These data sets were chosen as they have been heavily used in defect prediction research. The classifiers used in this study were SVMs (see Section 3.6); these were chosen as, at that time, SVMs had not been widely used for defect prediction. Additionally, SVMs are highly sophisticated classifiers, and (with suitable parameters) often work as well as, or if not better than many other learning techniques [Segaran 2007].

52 Chapter 5. SVMs for Defect Prediction

Analysis of the data sets prior to classification led to the discovery of many data quality issues. The most severe of these issues was that after basic pre-processing (described in Section 6.1.2.1), many of the data sets contained large proportions of non-unique, or repeated data points. Such data points can be conceptualised as repeated rows in a spreadsheet. The major potential problem caused by repeated data points is that after the data set divide into training set and testing set, it becomes possible for both sets to contain common data points. Recall that the term ‘set’ in all terms: ‘data set’, ‘training set’ and ‘testing set’ does not correspond to the mathematical notion of the term; therefore, it is possible for these collections (perhaps most suitably thought of as mathematical lists) to contain duplicate elements. Thus, when data set S contains duplicate elements and is divided (using one of the methods described in Section 3.3.1) into training set R and testing set E, it is possible for the intersection of R and E to not equal the empty set (R ∩ E 6= ∅). A standard classification experiment based on R and E will therefore be compromised, as the test data contains seen data points: data points that were used in the construction of the model. The assumption of unseen data will therefore have been violated in such an experiment (more clarification on this is provided in chapters 3 &6).

There are many more data quality issues specific to the context of machine learning with the NASA data sets: these are covered in great detail in Section

6.1. For this experiment significant quantities of suspicious/problematic data was removed from each data set during initial pre-processing. This included the removal of all data points with non-unique feature vectors, such that each unique, consistent data point was only represented once (there is more on this in Chapter 6).

After binarisation of the error-count data (see chapters4& 6), the NASA data sets exhibit various levels of class imbalance: there are typically many more data points labelled as ‘non-defective’ than there are that are labelled as ‘defective’. Precise details will be given in Chapter 6 during a full analysis of the data sets; however, to alleviate the class imbalance problem in this experiment, I used an undersampling technique (see Section 3.4). This simple technique involves pseudo- randomly discarding majority class data points until there is an equal number of data points representing each class. In this experiment, undersampling was carried out during initial pre-processing, prior to any separation into training and testing sets. After pre-processing, the experiment involved 10 repetitions of 5-fold cross- validation for each data set. A further 5-fold cross-validation was carried out for each training set, for the purpose of model optimisation (see Section 3.5). This model optimisation process yielded a cost and gamma value-pair estimated as optimal for each (withheld) test set, which was subsequently used to train each corresponding final model.

Because undersampling was carried out on each whole data set during initial pre-processing, all data sets had a precisely balanced class distribution (an equal number of instances representing each class). This led to accuracy being the selected performance measure used to assess predictive performance (see Section3.3.2). The results from this initial experiment are not presented here, for reasons described in the following section.

In document Software Defect Prediction Using Static Code Metrics: Formulating a Methodology (Page 61-65)