In this chapter, we used a case-control study for the first time in the field of software engineering to study software fault proneness. We presented a detailed methodology of how to build a model by eliminating unnecessary interactions and confounders. The second contribution of this chapter was including into the models the interaction between con- founders and how that affected software fault proneness, in addition to the individual con- founders. The interactions were not considered in any related work using explanatory studies [3, 4, 5, 6, 7, 8, 9]. The results showed, the interactions had significant effects on software fault proneness.
Further, the replicated study showed consistent results for Bugfixes (i.e., prerelease bugs), Developers and the interaction between them. Some confounders such as Age and Code churn were not included in the Apache projects due to the high correlation. Some static code confounders showed slightly higher effect in the Apache projects like Complexity AMC and Number of Public Methods (NPM) than in the case of Eclipse datasets.
The results identified confounders that can be used to explain Postrelease bugs. Specif- ically, Bugfixes, Developers, and Age had the highest OR. These were seen more in cases than in controls, which means they contributed more than other confounders to the software fault proneness. The results of Bugfixes and Developers are consistent across all Eclipse and
Apache projects. The results of Age were only consistent in Europa and Ganymede. The highest consistent interaction is between Bugfixes and Developers. Also, the interaction be- tween Complexity and Bugfixes showed consistency among Apache projects. Bugfixes with Age interaction was consistent in Eclipse projects but not with Apache.
The results of this study are informative and explain impact of confounders and interac- tions on software fault proneness. Our future work will apply the case-control methodology on other software projects to further explore the generalizability. We will also explore using case-control studies on software development effort. Extracting more confounders is another possible direction of our future research.
Chapter 4
Software Fault Proneness Prediction
This chapter focuses on the prediction part of software fault proneness. We employed the algorithm used in the previous chapter (i.e., conditional logistic regression) and the models achieved by the case-control study. We tested the prediction performance of these models using performance metrics explained in Section 4.2. We used the same matched samples created in the previous chapter for the case-control models using conditional logistic regression (CLR) and for five other widely used classifiers in the area of software fault pronenesses: logistic regression (LR), naive Bayes (NB), decision tree (J48), random forest (RF), and decision list (PART). In addition, we applied the group lasso regression (G-Lasso) algorithm, which has not been used previously for software fault proneness prediction.
This chapter starts with introduction and motivation for the prediction work in Section 4.1. Then the approach of this study, including brief explanation of the algorithms used, statistical analysis applied, and performance metrics used are discussed in 4.2. A brief explanation of datasets used and the main features used are discussed in Section 4.3, followed by a discussion of the results in Section 4.4. Then threats of validity for the prediction research were discussed in Section 4.5, and finally the chapter is concluded in Section 4.6.
4.1
Introduction and Motivation
Software faults are problematic when they are not detected and fixed early because they may cause the software to fail to perform its required function. Therefore, predicting fault- prone software units (i.e., files) is essential because it helps fix faults before the product is deployed to end users. The sooner the software faults are detected, the better it is for the software development cost and efforts.
Many studies have focused on predicting software fault prone units, using different types of metrics, data sets, and machine learning algorithms [34, 162]. While explanatory studies address questions like ’what’ and ’how’, prediction studies address questions like ’where’ and ’when’. Prediction studies use confounders to predict what would happen to the software unit in the future, that is, if they will be faulty or not.
Many classifiers (e.g., LR, NB, J48) have been used to predict fault proneness on many software projects (e.g., Eclipse, Apache), applying different types of metrics (e.g., static code, change metrics). Using different classifiers is essential because not all classifiers perform at the same performance level. To measure the performance of the prediction, we measure performance metrics such as recall (i.e., the rate of all true positive over the actual true) and precision (i.e., the rate of all true positive over the predicted true. performance metrics in this context are different from the features, variables, metrics, and confounders, which are synonymous and used interchangeably in this study.
Many approaches have considered improving prediction performance by the selection of the classifiers, selection of the features, improving the distribution of the response variables, or improving the distribution of the independent variables. Classifiers differ in terms of their performance. Therefore, many classifiers have been used in the area of software fault proneness. Some provided fair performance, and others were very robust. Further, different types of features (i.e., metrics) were used to improve prediction performance. Other studies focused on the type of data and worked on improving distribution (e.g., re sampling, cost sensitive, transforming data) [163, 164, 165, 166, 167, 168, 169]. The samples we used in this
chapter were stratified based on the lines of code (LOC), which required selecting fault-free files (from the controls group) with similar sizes as the files from faulty files group (i.e., cases group). This does the same job as other sampling techniques, that were developed to increase the number of minor events and made the two classes to be balanced.
In this chapter, we measure the prediction performance of the explanatory models that were built in Chapter 3. with a goal to find out whether they are useful for prediction as they were explanatory models. We built explanatory models using Eclipse and Apache projects based on a case-control methodology. These models did not use the whole set of confounders because we eliminated confounders based on the results of the correlation test, as discussed in Chapter 3. The process started with the group of confoundera and their interactions, then insignificant interactions and confounders were eliminated based on the backward hierarchal approach. The final model contained only significant interactions and confounders, which was much smaller than in the initial model. For instance, we started with six confounders and 15 interactions with Europa’s model and the final model had three metrics and three interactions.
Further, we used the same matched samples on other widely used classifiers to compare their performance with the performance of our models.
The second contribution of this chapter is applying for the first time for software fault proneness prediction an algorithm that accounts for variable shrinkage and selection using lasso (least absolute shrinkage and selection operator). The method consists of eliminat- ing unnecessary metrics from the model by assigning a penalty (i.e., λ) to regularize the model, which results with a sparse model (i.e., a model with fewer metrics). Some metrics are minimized to a very low value (i.e., shrinkage), and other metrics are eliminated (i.e., their coefficients become zeros). In case-control studies, we eliminated variables based on the correlation test, goodness of fit, and significance level in the model. With the lasso regression, this can be automated assigning the penalty, which is estimated through a k-fold cross-validation of the whole dataset. This method is good for the prediction purpose be-
cause it takes less time and effort. However, this method cannot handle the interaction as the conditional logistic regression can. The algorithm we applied is G-Lasso, which is an extension of the linear lasso regression. The G-Lasso is designed to fit the binary format of our response variable (i.e., fault prone and fault-free files).
Specifically, we measured the performance of conditional logistic regression models that were built using a case-control method and accounted for matching and involved interactions. Further, we measured the performance of G-Lasso. Then we compare their performances with six other machine learning algorithms (i.e., LR, NB, J48, PART, and RF) on 12 releases from the Eclipse and Apache projects at the file level. The following performance metrics were used for comparison: area under curve (AUC), recall, precision, false positive rate (FPR), F-score (the harmonic mean of recall and precision), and G-score (the harmonic mean of recall and 1-FPR). We also applied statistical tests to compare differences among all performance metrics of all classifiers. The research questions we address in this chapter can be summarized as follows:
• RQ1: Does CLR perform better than other classifiers?
• RQ2: How does G-Lasso perform compared to other classifiers?
• RQ3: What is the ranking of the classifiers in terms of the performance measures (i.e., recall, precision, FPR, G-score, F-score, and AUC)?
• RQ4: Does the dataset affect the prediction performance of the CLR or the prediction performance of G-Lasso?
• RQ5: Does the CLR using reduced models with interactions (i.e., achieved by the case- control methodology) perform better than other algorithms used in related studies?