Gradient Tree Boosting Approach for Software Defect Prediction

(1)

Gradient Tree Boosting Approach for Software Defect Prediction

K. Eswara Rao^1,*, G.Appa Rao² S. Anuradha³

1Dept of CSE, Aditya Institute of Technology and Management, Tekkali, AP

2,3 Dept. of CSE, GITAM University, Visakhapatnam, AP, INDIA-530045.

E-mail: [email protected] , [email protected] [email protected]

Abstract

Prediction of the actual defect in software has remained a challenging task for the software developers. Any deviation in counting the no. of defects or estimating the defects may lead to serious problems like the unexpected outcome. Majorly, defects like time, cost and effort have to be computed effectively at the initial phase of software development.

Some of the early developed data mining approaches are developed for quality analysis and defect prediction. But for large scale software development, these techniques are not performing well due to the high nonlinearity nature of data. This paper proposes a novel gradient boosting based machine learning approach for the effective prediction of a software defect. The proposed method has been analyzed with various performance- related factors and found to be superior among nine competitive machine learning approaches.

Keywords: software defect prediction, gradient boosting, machine learning, software quality

1. Introduction

Information technology (IT) is one of the most advanced as well as the current trend area of research. Through enhancement in IT, there is a rapid increase in developing novel software which results in the advancement of software scale. In these modern days, there is a higher demand for the production of software and hence SDC (Software Development Scale) is being increased day by day. This rapid development in SDC may result in software defects (SD). SD is an error, bug that occurs in software which impacts on the functionalities of the software and results in the failure as well as unreliable outcomes. The key defects of software are majorly from the source code. There is a chance that the entire software system may crash due to this SD. Therefore, predicting SD is an important thing that needs to be addressed during the (SSP) software system production. In order to detect the SDs, SDP (Software Defect Prediction) techniques are being used. SDP methods help in distributing extra assets for cracking complex problems. It also helped in reassuring the quality and hence nowadays, it became a hot topic in the field of SE (Software Engineering). The

(2)

key motto of SDP is to enhance the quality as well as the performance of the software. Two types of strategies are being used for predicting the defects of software: Static, Dynamic Defect Predictions. STDPs (Static Defect Predictions) mainly focus on the complexity as well as the size of software which helps in error reduction in an easy way. DDPs (Dynamic Defect Predictions) focuses on the lifecycle period of software which focuses on the identification of errors using standard models such as Rayleigh distribution model (Qian et al. , 2010), Hydra (Xia et al., 2016), etc.

SDP has been affected by several parameters. Relationship between the defects and attributes, having no benchmark measures, class imbalance are some of the major factors that affect SDP (Arora et al., 2015). Some other factors that were unable to recognize were faulty attributes and for addressing this faulty attributes problem Jiang et al. (2007) have made experimentations with the help of data taken from Nasa MDP (Metrics Data Program). Koru & Liu (2005) has developed a novel prediction model by considering static measures in their experimentations as variables and resolved the faulty attribute problem. Some researchers recommend using product metrics, such as Halstead's complexity (Halstead, 1975), McCabe‟s cyclomatic complexity (McCabe, 1976). Various code size measures were also recommended for predicting problematic modules (Nagappan et.al, 2006; Challagulla et al., 2006). Another factor of SDP that needs to be addressed is LLD (Lack of Local Data) for the team of RM (Risk Management) of any organization. To cope with such type of LLD problems, Turhan et al. (2009) developed an SDP model by considering WC (Within Company) data. Metrics of SDP play a major role to establish statistical prediction techniques (Nam, 2014).

Code metrics and process metrics are the two categories of SDP where existed source code can be directly collected by code metrics and process metrics are gathered from historical data that are attained in several software warehouses.

Several other significant metrics of SDP: Metrics Source code (Menzies et.al, 2006), authorship (Rahman & Devanbu, 2011), process (churn (Nagappan & Ball, 2005), change (Moser ET.AL, 2008), entropy (Hassan, 2009), popularity (Bacchelli ET.AL, 2010), ownership (bird et.al, 2011)), Network measure (Meneely et.al, 2008) and Anti-pattern (Taba et.al, 2013) etc. Along with these metrics, SDP can be applied in the source distribution for examining and testing (Menzies et.al, 2006), cost-effectiveness (58) as well as to select or prioritize test cases (Rahman et.al, 2012) respectively.

Undoubtedly, SDP is a challenging problem in the computer science field. To address many of the problems of SDP, many techniques, tools, models have been offered by several domains such as data mining (DM), fuzzy logic (FL), Machine Learning (ML), etc. FL plays an important role for SDP and it has been applied to detect the early SDP using software size metrics (Yadav et.al, 2012). DM also plays a crucial role in SDP and some DM techniques have also been applied to

(3)

SDP. Moreover, among these fields, ML was identified as one of the benchmark methodologies due to its higher usage levels for resolving SDP problems. Many ML algorithms have been applied for SDP. ML algorithms such as BA (Bayesian Algorithms), DTs (Decision Trees), CAs (Clustering Algorithms), ANNs, ELA (Ensemble Learning Algorithms) have been used for SDP (Hassan et al., 2018).

LR, NB, RF (Random Forest), kNN (k-Nearest Neighbor), SVM (Support Vector Machine), ANN (Artificial Neural Networks) are some of the algorithms which were used by SDP models (Raukas, 2017). These approaches are identified as suitable ones for resolving the problem of SDP. These methodologies became evident that ML has a great significance in resolving SDP problems rather than other state-of-the-art approaches.

From the literature, it is evident that there are two types of approaches for predating the software defects: (a) data level and (b) algorithm level. It is evident in many kinds of literature that, these methods are less effective for the quality development of software products. To improve the effectiveness and quality development of software products with good outcomes, machine learning approaches are efficient. Inspired by this, a novel Gradient tree boosting approach is developed for the prediction of software defects. The advantage of using this approach by using a decision tree-based method, Gradient tree boosting approach helps to train the models quickly for better performance. The Gradient tree boosting approach may train the model in a gradual, additive and sequential way.

Moreover, the Gradient boosting approach helps to optimize the user-specified cost function, rather than a loss function. The remaining paper has been segmented into the following sections: Section 2 describes the literature survey of some important competitive literature. Section 3 indicates the proposed method for defect prediction. Experimental setup, information about the datasets and result analysis are explained in section 4. Section 5 concludes the work with some possible future directions.

2. Literature Survey

SDP is a necessary element for analyzing the quality of software. As many papers related to SDP using machine learning approaches have existed in the literature, only some extent of them was considered for the literature study. A combination of two unique standard techniques KPCA (Kernal Principal Component Analysis) and WELM (Weighted Extreme Learning Machine) for solving the problem of SDP was developed by Xu et al. (2019). The authors named the methodology as KPWE. GRBF (Gaussian Radial Basis Function) has been incorporated into KPCA as a kernal function during the phase of feature extraction. NASA data sets have been utilized for performance evaluation. The proposed method yields better accuracy for cracking the conflict of SDP. An empirical analysis of the significance of machine learning algorithms for solving the problem of SDP has

(4)

been made by Yalciner & Ozdes (2019). Standard datasets of Nasa were taken from publicly available dataset repository (PROMISE) and were utilized for performance evaluation. Accuracy, precision rate, etc were considered as performance metrics and distinct algorithms: MLP, RBF, SVM, Bagging, etc were considered for estimating the SD. The authors claimed all the considered techniques yield a better rate of accuracy for solving SDP. Moreover, they have also mentioned that the bagging technique yields better accuracy among all.

Kakkar et al.(2019) have developed a novel strategy for finding out the missing values in the data used for estimating SDP. Several adaptations of the ant dataset were considered for estimating the working methodology of the model. Standard model-based methods in the literature such as SVM, kNN, etc were considered.

RMSE, RR was used as performance factors. Based on the experimentations done, the authors claimed LR (Linear Regression) based CBF (Co-relation Based Feature) was the contemporary best incorporation to construct the models of SDP.

For the prediction of SDP, the hybridization of benchmarks methods in literature such as GRAR (Gradual Relational Association Rules) and ANN was made by Miholca et al. (2018). The proposed classifier was named as HyGRAR. For estimating the performance of HyGRAR classifier standard datasets such as Ar1, Ar3, etc has been used and LOO cross-validation method was incorporated.

Higher classification accuracy was found for SDP with the concerned classifier. A modern ACAR (Atomic Class Association) method for SDP was made by Shao et al. (2018). Redundant pruning was incorporated into the proposed method and fifteen standard datasets were considered for performance evaluation. A comparison has been made with some state-of-the-art of learners such as SVM, CBA2 (Classification Based on Associations 2), DT (Decision Tree), etc. It was found that ACAR yields better performance than CBA2 in terms of balance etc.

Based on Fourier learning, a novel model for SDP was made by Yang et al.

(2018). Nasa dataset was used and verified using cross-fold validation for estimating the performance. A comparison has been made over standard algorithms such as kNN, RF, etc. The authors claimed the FLA (Fourier Learning Algorithm) yields better as well as stable performance rather than the compared techniques. Arar and Ayan (2017) have developed FDNB (Feature Dependent Naive Bayes) methodology with the help of preprocessing to solve the complex problem of SDP. For evaluating the efficacy of the model, standard Nasa promise datasets CM1, PC1, KC3, etc has been used. The proposed FDNB has been compared with the benchmark NB method. Based on the experimentations done, the authors claimed their method yields better performance for solving SDP problem rather than the compared one. Andreou & Chatzis (2016) have proposed a novel method for SDP with the help of SBN (Stochastic Belief Networks) and named the methodology as SBPRN (Stochastic Belief Poisson Regression Network). For the regression of the data, the authors have established a hierarchical BM (Bayesian Model) in the proposed model. For the process of

(5)

formulation, DSPS (Doubly Stochastic Poisson Processes) was incorporated. The backpropagation method has been used for the estimation of parameters. A comparison has been made with developed strategy over standard techniques in literature such as NN (Neural Network) etc. A better prediction rate for resolving the SDP has been found with the proposed model over compared ones. To solve the severe problem of SDP, Arar and Ayan (2015) have developed a novel classification method which is a hybridization of ANN (Artificial Neural Network) and ABC (Artificial Bee Colony). The authors utilized ANN to train ABC for finding out the finest weights. Standard datasets of Nasa (KC1, KC2, etc) were considered and applied to their proposed method. Also, a comparison has been made over benchmark techniques such as Naive Bayes (NB), random forest (RF), etc. Later, based on experimentations made, the higher performance was found for predicting the defect of software with the proposed classifier. A classification model based on RARM (Relational Association Rule Mining) technique for SDP problem was developed by Czibula et al. (2014). To evaluate the proposed model, standard Nasa data sets have been utilized. Mean, standard deviation, the absolute correlation was used ass performance metrics and comparison has been made over benchmark methods: CBA2, etc. A better rate of accuracy for predicting SD (Software Defect) was found with the proposed one other than the compared techniques. Some of the other machine learning approaches used for cracking the problem of SDP problem is mentioned in table 1.

Table 1 SDP using other Machine Learning Approaches

S.no Method used for the study

Application Area Problem Type

Compared method

Performance Factor

Ref 1 SPFCNN High dimensionality

Reduction

SDP LSTM, DSNN,

DBN, RNN etc

PD, PF etc Zhao et al. (2019)

2 PMOFES Feature Selection SDP NB, LR, KNN Complexity, Count Ni et al. (2019) 3 MULTI based

on NSGA-II

Multi-objective Optimization

JIT-SDP RSA, GA ACC, P_OPT Chen et al.(2018) 4 Nonlinear

MDM

Feature Selection SDP BBN, NB etc Accuracy, FMeasure

Ghosh et al. (2018)

5 CoMOGP Classification SDP MOGP,

msMOGP

Fitness Function, population size etc

Mousa & Grabac (2017)

6 rejoELM, IrejoELM

Classification SDP rejoNB, rejoRBF NOC, WMC etc Mesquita et al. (2016)

7 DTB NSR CCDP TNB, NB PD, PF Chen et al.(2015)

8 Enhanced APE

Defect Classification SDP W-SVM Decision Count, Condition Count etc

Laradji et al. (2015) 9 NB classifier Enhancing recall rate

using AM

SDP - Precision, accuracy Rana et al. (2015) 10 CSForest Classification SDP SVM, CSTree Cost Metrics (True

Positive, False Negative etc)

Siers & Islam (2015)

(6)

3. Proposed Method

The proposed method is based on the Gradient Boosting (Mason et. al., 1999) based machine learning approach. Let D



X X1, 2...X_n



be the training set, where



^,



i i i

X  x y is the i input vector with input observation ^th ^x^ⁱ ^



^xⁱ^,1^,^xⁱ^,2^...^x^{i d}^,



^and

associated class labely . Let _i ^x^



^{  }^{x x}¹^, ²^...^xⁿ



be observation vector in D without class label. Here the dataset D is having 10885 instances and 22 attributes such as five various LOC measure, three McCabe metrics, four base Halstead measures, eight derived Halstead measures, one branch-count, and one target field. It has no missing attributes. More details about the attributes are represented in table 2. The dataset considered in this paper is JM1(software defect prediction) which is publicly available in PROMISE repository (Sayyad et al., 2005). In this proposed algorithm design, the x_i

composed of values from the attribute number 1 to 20 and yi

 

^0,1 composed of target attribute i.e. number 22 (“defects”).

Table 2. Attribute Information of the JM1 dataset

Sl No Attribute Name

Information Type of attribute

1 loc McCabe‟s line count of code numeric

2 v(g) McCabe “cyclomatic complexity” numeric

3 ev(g) McCabe “essential complexity” numeric

4 iv(g) McCabe “design complexity” numeric

5 n Halstead total operators +

operands

numeric

6 v Halstead “volume” any (default)

7 l Halstead “program length” numeric

8 d Halstead “difficulty” any (default)

9 i Halstead “intelligence” any (default)

10 e Halstead “effort” any (default)

11 b Halstead numeric

12 t Halstead‟s time estimator any (default)

13 lOCode Halstead‟s line count numeric

14 lOComment Halstead‟s count of lines of comments

numeric 15 lOBlank Halstead‟s count of blank lines numeric 16 lOCodeAndC

omment

comments numeric

17 uniq_Op unique operators numeric

18 uniq_Opnd unique operands numeric

19 total_Op total operators numeric

20 total_Opnd total operands numeric

21 defects reported defects boolean (default)

(7)

Proposed Software Defect Prediction Method 1. Initialize ⁰

 

_

 

^

1

arg min

,

n

i i

i

G x L y y

y 





^{, where}^^yⁱ ^^{DT x}

 

^ⁱ is the prediction of



y from decision tree on the input i x_i . 2. Repeat for j1to m

2.1. Compute

   

 

1

,

m

i i

im

i

G G

L y G x r

G x

 

 

   



 , for i1to n

2.2. Compute Rjm DT^Reg x r



^, im



, for i1to n, where

 

Re , _im

DT g x r is the fitting of regression tree to the target r resulting _im region of terminal R_jm.

2.3. Find





¹

 

^



1

arg min

,

n

jm i m i i

i i

y L y DT x y

y  ^





^  ^{, for} ^j^¹_to ^jm^.

2.4 Update

 

1

   

1 jm

j j jm jm

j

G x G_ x y I x R



 





3. Return G x^

 

Gm

 

x

Initialize the model „G‟ with y such that, _i

 

^

1

,

n

i i

i

L y y



 is minimum, where y is _i the prediction of decision tree on the input y , i.e. _i ^^yⁱ ^^{DT x}

 

^ⁱ . In the above steps of algorithm, Gradient (Psedu-residual) r from _im the loss function



ⁱ^,

 

ⁱ



L y G x

for all

 

^^xⁱ , where L(.) is a differentiable loss function. Fit the base learner, Decision tree regression DT^Reg x r



^, im



to r^im

for all

 

^^xⁱ _in^x_{, which}

resulted the region of terminal ^R^jm. Find out optimal, y^jm

by solving the

optimization problem 



¹

 

^



1

arg min

,

n

i m i i

i i

L y DT x y

y  ^



^ 

.Update the model. Obatin





¹

 

^



1

arg min

,

n

jm i m i i

i i

y L y DT x y

y  ^





^ 

. Update G from G t-1, and yjm

and I(.).

4. Experimental Set Up and Result Analysis

This section comprises of Simulation Environment, System and Parameter Setup, dataset information along with result analysis.

(8)

4.1 Simulation Environment, System and Parameter Setup

The experiment has been conducted using HP (ProDesk 600 G2 MT) desktop where the operating system was Windows 10 Pro 64-bit (10.0, Build 17134) (17134.rs4_release.180410-1804),Processor was Intel(R) Core(TM) i7-6700 CPU

@ 3.40GHz (8 CPUs), ~3.4GHz. The memory of the desktop was 4096MB RAM.

For data analogy Pandas framework, Numpy framework; for data visualization, Matplotlib framework and for data analysis, sklearn framework and classification- metrics framework were used. The information about the parameter set up is indicated in table 3.

Table 3. Parameter Setup Classifiers Parameter Setup

SGD Classifier Random Forest

{loss='modified_huber',shuffle=True,random_state=3}

{n_estimators=100}

Naive Bayes { random_state=3}

LR { random_state=3}

KNN { n_neighbors=3}

Decision Tree {max_depth=1, min_samples_leaf=1}

LDA { random_state=3}

MLP QDA

{ random_state=3}

Proposed Gradient Boosting

{n_estimators=8000, random_state=3)

4.2 Result Analysis

In this paper, ten no of machine learning approaches are developed for analysis the software defect prediction. The proposed method Gradient Boosting has been compared with other nine methods such as DT, SGD, RF, NB, LR,KNN,LDA,MLP and QDA. Fig 2. Represents the Training and Testing Error analysis of all classifiers along with Gradient Boosting method. It is obvious to conclude that, the proposed method is having less error rate in case of both training and testing as compared to other considered approaches. Table 4 indicates the result analysis of all the methods in terms of several performance indicators such as accuracy, TP, FP, TN, FN,TPR, FPR, Precision, TNR, F1 and ROC-AUC.

Out of nine compared methods, the accuracy of RF (92.30) and DT(91.77) are high. At the same time, the accuracy of finding the defects in proposed method is 93.12, which supersedes all other approaches. Similarly, the TP rate of the proposed method is 1572 which is highest among the remaining such as SGD (97), RF(1406), NB(352), LR(181), KNN (843), DT(1431), LDA(266), MLP(55), QDA(419). Similarly for all the other parameters, the performance of the proposed Gradient Tree Boosting method is quite encouraging for effective prediction of software defect.

(9)

Fig 2. Training and Testing Error analysis of all classifiers

Fig 3. represents the ROC curves of all the methods i.e. a) DT, b) KNN, c) LDA, d) LR, e) MLP, f) NB, g) QDA, h) RF, i) SGD and j) Proposed Gradient Boosting.

By analyzing the ROC, it can be concluded that the performance of the proposed method supersedes over the other methods in case of both classes. The training error and testing error of the proposed method are 0.0392 0.0377 and 0.19380.0065 respectively (table 5).

Table 4. Results of all the methods along with Proposed Gradient Tree Boosting

Models Performance Metrics

Accuracy TP FP TN FN TPR FPR Precision TNR F1 ROC-

AUC SGD 80.18 97 148 8631 2009 0.0460 0.0168 0.3959 0.98 0.0825 0.51 RF 92.30 1406 138 8641 700 0.6676 0.0157 0.9106 0.98 0.7704 0.82 NB 80.47 352 371 8408 1754 0.1671 0.0422 0.4868 0.95 0.2488 0.56 LR 80.89 181 155 8624 1925 0.0859 0.0176 0.5386 0.98 0.1482 0.53 KNN 83.09 843 577 8202 1263 0.4002 0.0657 0.5936 0.93 0.4781 0.66 DT 91.77 1431 430 8349 475 0.7744 0.0489 0.7913 0.95 0.7828 0.86 LDA 81.25 266 200 8579 1840 0.1263 0.0227 0.5708 0.97 0.2068 0.55 MLP 55.85 55 33 8746 2051 0.0261 0.0037 0.625 0.99 0.0501 0.51 QDA 80.04 419 485 8294 1687 0.1989 0.0552 0.4634 0.94 0.2784 0.57 Proposed

Gradient Tree Boosting

93.12 1572 214 8565 534 0.7464 0.0243 0.8801 0.97 0.8078 0.86

(10)

DT KNN LDA

LR MLP NB

QDA RF SGD

Proposed Gradient Boosting

Fig. 3. ROC Curve Analysis of a) DT, b) KNN, c) LDA, d) LR, e) MLP, f) NB, g) QDA, h) RF, i) SGD and j) Proposed Gradient Boosting

Table 5. Comparison of Training and Testing error Models Performance Metrics

Train Error Test Error Gradient

Boosting

0.0392 0.0377 0.19380.0065

SGD 0.1968 0.2180

DT 0.1925 0.1925

KNN 0.2323 0.1325

LDA 0.1855 0.1818

LR 0.1962 0.1889

(11)

MLP 0.4252 0.3273

NB 0.1935 0.1944

QDA 0.2011 0.1965

RF 0.0064 0.1772

5. Conclusion

Due to several advantages and real-world applications, the growth of the software industry is in peak. On the other side, the quality of software development is also a prior point for software developers. Every phase in the software life cycle may generate errors and defects. However, it‟s obvious for the software engineer that, the failures must be predicted quite earlier before the process development to avoid big loss in the later stage of development. The key objective of this paper is to propose a machine learning-based approach to effectively handle the issue of failure or defects of software. A novel Gradient tree-based approach is developed, which is found efficient than other standard machine learning-based approaches to predict the defects in software development. Several performance parameters are considered to test the effectiveness of the proposed method and simulated results are evident that the proposed method is quite efficient than others. The accuracy of the proposed approach is 93.12, which is higher than the other compared methods.

As future work, some other ML approaches like AdaBoost, Xgboost, etc may be applied to test and identify the software defects.

References

[1] Andreou, A. S., & Chatzis, S. P. (2016). Software defect prediction using doubly stochastic Poisson processes driven by stochastic belief networks. Journal of Systems and Software, 122, 72-82.

[2] Arar, Ö. F., & Ayan, K. (2017). A feature dependent Naive Bayes approach and its application to the software defect prediction problem. Applied Soft Computing, 59, 197-209.

[3] Arar, Ö. F., & Ayan, K. (2015). Software defect prediction using cost-sensitive neural network. Applied Soft Computing, 33, 263-277.

[4] Arora, I., Tetarwal, V., & Saha, A. (2015). Open issues in software defect prediction. Procedia Computer Science, 46, 906-912.

[5] Bacchelli, A., D‟Ambros, M., & Lanza, M. (2010, March). Are popular classes more defect prone?. In International Conference on Fundamental Approaches to Software Engineering (pp. 59-73). Springer, Berlin, Heidelberg.

[6] Bird, C., Nagappan, N., Murphy, B., Gall, H., & Devanbu, P. (2011, September). Don't touch my code!:

examining the effects of ownership on software quality. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (pp. 4-14).

ACM.

[7] Boetticher, G. (2007). The PROMISE repository of empirical software engineering data.

http://promisedata. org/repository.

[8] Challagulla, V. U., Bastani, F. B., & Yen, I. L. (2006, November). A unified framework for defect data analysis using the mbr technique. In 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06) (pp. 39-46). IEEE.

[9] Chen, L., Fang, B., Shang, Z., & Tang, Y. (2015). Negative samples reduction in cross-company software defects prediction. Information and Software Technology, 62, 67-77.

[10] Chen, X., Zhao, Y., Wang, Q., & Yuan, Z. (2018). MULTI: Multi-objective effort-aware just-in-time software defect prediction. Information and Software Technology, 93, 1-13.

(12)

[11] Czibula, G., Marian, Z., & Czibula, I. G. (2014). Software defect prediction using relational association rule mining. Information Sciences, 264, 260-278.

[12] El Emam, K., Melo, W., & Machado, J. C. (2001). The prediction of faulty classes using object- oriented design metrics. Journal of Systems and Software, 56(1), 63-75.

[13] Ghosh, S., Rana, A., & Kansal, V. (2018). A Nonlinear Manifold Detection based Model for Software Defect Prediction. Procedia computer science, 132, 581-594.

[14] Halstead, M. H. (1975). Elements of SW Science. Elsevier, North-Holland, 1975.

[15] Hassan, A. E. (2009, May). Predicting faults using the complexity of code changes. In Proceedings of the 31st International Conference on Software Engineering (pp. 78-88). IEEE Computer Society.

[16] Hassan, F., Farhan, S., Fahiem, M. A., & Tauseef, H. (2018). A Review on Machine Learning Techniques for Software Defect Prediction. Technical Journal, 23(02), 63-71.

[17] Jiang, Y., Cukic, B., & Menzies, T. (2007, November). Fault prediction using early lifecycle data. In The 18th IEEE International Symposium on Software Reliability (ISSRE'07) (pp. 237-246). IEEE.

[18] Kakkar, M., Jain, S., Bansal, A., & Grover, P. S. (2019, February). Evaluating Missing Values for Software Defect Prediction. In 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon) (pp. 30-34). IEEE

[19] Koru, A. G., & Liu, H. (2005). Building effective defect-prediction models in practice. IEEE software, 22(6), 23-29.

[20] Laradji, I. H., Alshayeb, M., & Ghouti, L. (2015). Software defect prediction using ensemble learning on selected features. Information and Software Technology, 58, 388-402.

[21] Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). "Boosting Algorithms as Gradient Descent"

(PDF). In S.A. Solla and T.K. Leen and K. Müller (ed.). Advances in Neural Information Processing Systems 12. MIT Press. pp. 512–518.

[22] Mauša, G., & Grbac, T. G. (2017). Co-evolutionary multi-population genetic programming for classification in software defect prediction: An empirical case study. Applied soft computing, 55, 331- 351.

[23] McCabe, T. J. (1976). A complexity measure. IEEE Transactions on software Engineering, (4), 308- 320.

[24] Meneely, A., Williams, L., Snipes, W., & Osborne, J. (2008, November). Predicting failures with developer networks and social network analysis. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering (pp. 13-23). ACM.

[25] Menzies, T., Greenwald, J., & Frank, A. (2006). Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering, 33(1), 2-13.

[26] Moser, R., Pedrycz, W., & Succi, G. (2008, May). A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. In Proceedings of the 30th international conference on Software engineering (pp. 181-190). ACM.

[27] Mesquita, D. P., Rocha, L. S., Gomes, J. P. P., & Neto, A. R. R. (2016). Classification with reject option for software defect prediction. Applied Soft Computing, 49, 1085-1093.

[28] Miholca, D. L., Czibula, G., & Czibula, I. G. (2018). A novel approach for software defect prediction through hybridizing gradual relational association rules with artificial neural networks. Information Sciences, 441, 152-170.

[29] Nagappan, N., Ball, T., & Murphy, B. (2006, November). Using historical in-process and product metrics for early estimation of software failures. In 2006 17th International Symposium on Software Reliability Engineering (pp. 62-74). IEEE.

[30] Nagappan, N., & Ball, T. (2005, May). Use of relative code churn measures to predict system defect density. In Proceedings of the 27th international conference on Software engineering (pp. 284-292).

ACM.

[31] Nam, J. (2014). Survey on software defect prediction. Department of Compter Science and Engineerning, The Hong Kong University of Science and Technology, Tech. Rep.

[32] Ni, C., Chen, X., Wu, F., Shen, Y., & Gu, Q. (2019). An empirical study on pareto based multi- objective feature selection for software defect prediction. Journal of Systems and Software, 152, 215- 238.

[33] Qian, L., Yao, Q., & Khoshgoftaar, T. M. (2010). Dynamic Two-phase Truncated Rayleigh Model for Release Date Prediction of Software. Journal of Software Engineering and Applications, 3(06), 603.

[34] Rana, Z. A., Mian, M. A., & Shamail, S. (2015). Improving Recall of software defect prediction models using association mining. Knowledge-Based Systems, 90, 1-13.

(13)

[35] Rahman, F., & Devanbu, P. (2011, May). Ownership, experience and defects: a fine-grained study of authorship. In Proceedings of the 33rd International Conference on Software Engineering (pp. 491- 500). ACM.

[36] Rahman, F., Posnett, D., & Devanbu, P. (2012, November). Recalling the imprecision of cross-project defect prediction. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering (p. 61). ACM.

[37] Raukas, H (2017). Some Approaches for Software Defect Prediction.

[38] Sayyad Shirabad, J. and Menzies, T.J. (2005) The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada . Available: http://promise.site.uottawa.ca/SERepository

[39] Shao, Y., Liu, B., Wang, S., & Li, G. (2018). A novel software defect prediction based on atomic class- association rule mining. Expert Systems with Applications, 114, 237-254.

[40] Siers, M. J., & Islam, M. Z. (2015). Software defect prediction using a cost sensitive decision forest and voting, and a potential solution to the class imbalance problem. Information Systems, 51, 62-71.

[41] Taba, S. E. S., Khomh, F., Zou, Y., Hassan, A. E., & Nagappan, M. (2013, September). Predicting bugs using antipatterns. In 2013 IEEE International Conference on Software Maintenance (pp. 270-279).

IEEE.

[42] Turhan, B., Menzies, T., Bener, A. B., & Di Stefano, J. (2009). On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering, 14(5), 540-578.

[43] Xia, X., Lo, D., Pan, S. J., Nagappan, N., & Wang, X. (2016). Hydra: Massively compositional model for cross-project defect prediction. IEEE Transactions on software Engineering, 42(10), 977-998.

[44] Xu, Z., Liu, J., Luo, X., Yang, Z., Zhang, Y., Yuan, P., ... & Zhang, T. (2019). Software defect prediction based on kernel PCA and weighted extreme learning machine. Information and Software Technology, 106, 182-200.

[45] Yadav, D. K., Chaturvedi, S. K., & Misra, R. B. (2012). Early software defects prediction using fuzzy logic. International Journal of Performability Engineering, 8(4), 399-408.

[46] Yalçıner, B., & Özdeş, M. (2019, September). Software Defect Estimation Using Machine Learning Algorithms. In 2019 4th International Conference on Computer Science and Engineering (UBMK) (pp.

487-491). IEEE.

[47] Yang, K., Yu, H., Fan, G., Yang, X., Zheng, S., & Leng, C. (2018, December). Software Defect Prediction Based on Fourier Learning. In 2018 IEEE International Conference on Progress in Informatics and Computing (PIC) (pp. 388-392). IEEE.

[48] Zhao, L., Shang, Z., Zhao, L., Zhang, T., & Tang, Y. Y. (2019). Software defect prediction via cost- sensitive Siamese parallel fully-connected neural networks. Neurocomputing, 352, 64-74.

(14)

Appendix

AM Association Mining

APE Average Probability Ensemble CCDP Cross Company Defects Prediction

CoMOGP Colonization based Multi Objective Genetic Programming CSForest Cost Sensitive Forest

CSTree Cost Sensitive Tree DTB Double Transfer Boosting

GA Genetic Algorithm

IrejoELM Imbalanced reject option based Extreme Learning Machine JIT-SDP Just In Time Software Defect Prediction

MDM Manifold Detection based Model MOGP Multi Objective Genetic Programming

MsMOGP Multi subpopulation based Multi Objective Genetic Programming

NOC Number Of Children

NSGA-II Non dominated Sorting Genetic Algorithm-II NSR Negative Samples Reduction

PD Probability Detection PF Probability of False Alarm

PMA Pareto based Multi Objective Optimization Algorithm rejoELM Reject option based Extreme Learning Machine rejoNB reject option based Naive Bayes

rejoRBF reject option based Radial Basis Function RSA Random Search Algorithm

SPFCNN Siamese Parallel Fully-Connected Neural Networks TNB Transfer Naive Bayes

WMC Weighted Method Count

W-SVM Weighted- Support Vector Machine

Authors Biography

K. Eswara Rao received the Bachelor's degree in Engineering (CSE) from GITAM Engineering College, Visakhapatnam, AP, India, in 2005, and the Masters degree in Engineering (CSE-NN) from JNT University, Kakinada, AP, India, in 2009. He is currently working as a Sr. Assistant Professor in Aditya Institute of Technology and Management (AITAM), Tekkali, Srikakulam. His research interests include Data Mining, Data Analytics, Operating System. He has published numerous conference proceedings as well as papers in international journals.

Dr. G. Appa Rao completed his Ph.D. and now he is currently working as a Professor in Department of CSE, GITAM Institute of Technology, GITAM University, Visakhapatnam. His research interest includes software engineering, Wireless Sensor Network. He wrote two books named as Microelectronics, Electromagnetics and Telecommunications and Communications in Computer and Information Science. He published numerous conference proceedings as well as papers in national and international journals.

Dr. S. Anuradha Received the Ph.D. degree in JNT University, Kakinada, Andhra Pradesh, India. She is currently working as an Assistant Professor in the Department of CSE, GITAM Institute of Technology, GITAM University, Visakhapatnam. Her research interest includes Image Processing, Data Mining.

She wrote a Book named as Recent Developments in Intelligent Computing, Communication and Devices published by springer. He published numerous conference proceedings as well as papers in national and international journals.