6.2 Experiments
6.2.1 Software Data Sets
The data sets used in this study comes from four projects - Eclipse, Camel, and Ant. Eclipse is a multi-language software development environment consisting of the base workspace and extensible plug-ins that customize the environment. The environments include the Eclipse Java development tools (JDT) for Java and Scala, Eclipse CDT for C/C++ and Eclipse PDT for PHP, among others. We unitlized the defect content in three successive releases of Eclipse, 2.0, 2.1, 3.0, at two levels of granularity: files and
packages. Several versions of Eclipse data sets have been in use to study defect predic- tion [130, 131]. In our study, we use the Eclipse data sets introduced by Zimmerman et al. [122], which are publicly available. Zimmerman et al. used the Java parsers for Eclipse - visitors andaggregators - to aggregate the metrics to file and package levels. More specifically, thevisitorsis implemented to compute standard metrics for methods, classes, or files (compilation units), while theaggregators is used to compute single val- ues for each level. They computed the average (avg), maximum (max) and total values (total) for each metric, except the NOCU - the Number of files. The complexity metrics for each package/file can be computed from the archived builds of Eclipse. These data sets have been used in several recent studies as well [100,132, 133]. Table ??presents the metrics included in the Eclipse data sets.
In addition, we also applied active learning on data sets from two projects - Camel and Ant which are publicly available in the PROMISE repository. Both of Camel and Ant consist of three releases. Each instance in the three projects represents a class (.java) file and consists of twenty software metrics.
A summary of the data sets used in our study is reported in Table6.1. The table lists the number of instances, actual defect rate and the number of metrics used in each release of the four projects. We note that the sample sizes of Ant releases are particularly low. For example, Ant 1.3 consists of 125 files and only 20 are defective. Table6.2and Table
6.3provide the annotation for metrics in the three projects, respectively. To seek more details of the projects, please refer to [122,134].
6.2.2 Experimental Setting
Defect prediction between successive releases of the same product is practical because we expect minimal changes in the development environment and, consequently, similar defect characteristics. Further, the defect content of modules from an earlier release is known as a consequence of defect reporting. If the community of users is sufficiently large, the reports are likely to cover a big portion of the existing defects. Suggesting that humans serve as “oracles” for some modules in the upcoming release does not represent an extraordinary burden on the development team. For example, in Eclipse the defects reported in the six months prior to the release date are called pre-defects. Generally, development teams perform pre-release assessment, debugging and defect removal through unit testing, code walk-through, inspection and other forms of software verification.
Active learning defect prediction approach investigated in this study simply introduces a discipline in the selection of modules that need to be exposed to more thorough veri- fication. Depending on project practices, this requirement may induce additional devel- opment cost. However, if defect prediction model performs well, the cost of post-release maintenance should be lower. Whether this value proposition is valid or not remains an open question not only for the proposed defect prediction approach but for the en- tire research area [135]. However, it is clear that our approach (like any other active learning method) should use the oracle sparingly, requesting as few pre-release module assessments as possible.
In this section, we report experimental results from active learning defect prediction on four projects, totally nine releases, using visual analysis such as graphs and tables. In Section6.2.6, we supplement the visual analysis with appropriate statistical tests. The experiments will help us understand:
• The defect prediction performance of active learning between subsequent releases;
• The impact of active learning variants - random selection vs. uncertainty-based selection of modules that need oracle’s assessment;
• The impact of feature selection techniques when applied prior to active learning;
• The impact of dimensionality reduction techniques when applied prior to active learning
• The impact of data size and defect rate on the prediction performance of active learning;
In Section 6.1.1we showed that all feature selection techniques perform similarly (see Figure 6.1). Hence, for further experiments we selected only one of them - the informa- tion gain feature selection (InfoGain). We also learned that the RF similarity coupled with dimensionality reduction technique MDS outperforms Euclidean proximity. There- fore, we will experiment with MDS, which uses RF similarity only. The six experimental approaches we analyzed and their abbreviations are:s
1. Act: Active learning with uncertainty-based selection; 2. Rand: Active learning with random-based selection;
3. IG Act: Information Gain feature selection,IG, followed byAct;
4. IG Rand: Information Gain feature selection,IG, followed byRand;
5. M DS Act: M DS with RF similarity followed byAct;
Each release is experimented for each of the above six active learning approaches. For example, Release 2.0 in Eclipse is used to build defect prediction model predicting defect prone modules in release 2.1. Next, release 2.1 is used for training and release 3.0 for prediction. Every experiment is run 10 times and average values are reported for experimental comparison.
Random Forest (RF) is selected as the base algorithm in active learning experiments due to its consistent performance in[71, 136]. Our previous studies also showed that random forest outperforms other supervised learning when the data is imbalanced and noisy.
At each iterationof active learning, a fixed number of modules (∼1% of the unlabeled
modules) are selected for the assignment of their true defect labels. For example, with the Eclipse data, 4 packages (79 files) are selected at each iteration when predicting on release 2.1, and 7 packages (106 files) when predicting on release 3.0.
To track the prediction performance at each iteration of active learning, we do not set an apriori stopping criterion. The algorithm continues until it runs out of unlabeled modules (i.e. all unlabeled modules are labeled). Of course, in practice we are interested in the prediction performance of models that use as few modules analyzed by the oracle as possible, likely no more than 20%. A classic supervised learning experiment with random forest (RF) is the same as the 1st iteration of our experiment, before active
learning process starts. At that point, modules from previous release(s) are used as training data and all modules from the current release represent test data.
Performance measures for active learning can be derived by tracking the predictions, i.e, P(Yu= 1|Xu), at each iteration. Following the best practices in [87] and [122], we
computed AUC, Precision, Recall and Accuracy measures. The fault prediction at each iteration reflects the performance of the trained model on all the unlabeled modules of the current release.