are not more triumphant in predicting the reliability of the software. Also such proposed models are not user friendly and many of such models are probabilistic based approaches. To add up with the above said things, failure data are also not available in the early phases of the software development life cycle. Whilst some of them are presented in the form of expert knowledge through certain software metrics. This paper is organized as follows: this section introduces about softwaredefectprediction.
Feature subset selection is the process of choosing a subset of good features with respect to the target concept. A clustering based feature subset selection algorithm has been applied over softwaredefectprediction data sets. Softwaredefectprediction domain has been chosen due to the growing importance of maintaining high reliability and high quality for any software being developed. A software quality prediction model is built using software metrics and defect data collected from a previously developed system release or similar software projects. Upon validation of such a model, it could be used for predicting the fault-proneness of program modules that are currently under development. The proposed clustering based algorithm for feature selection uses minimum spanning tree based method to cluster features. And then the algorithm is applied over four different data sets and its impact is analyzed.
After generating instances with metrics and labels, we can apply pre-processing techniques, which are common in machine learning. Pre-processing techniques used in defectprediction studies include Feature selection, Dimension reduction, Classification, Prediction and finally Performance analysis. The flow chart below depicts the entire process of softwaredefectprediction. The historical data, including various software metrics captured from software systems, are divided into two groups: the training data set, and the test data set. These data are pre- processed before being fed into the following feature selection and classification algorithms. In the second stage, cost-sensitive feature selection algorithms are applied to the training data to find the optimal features, and thus the dimension can be reduced. The next step is to train the cost-sensitive classification models based on the training data set with selected features.
With real-time systems becoming more complex and unpredictable, partly due to increasingly sophisticated requirements, traditional software development techniques might face difficulties in satisfying these requirements. Further real-time software systems may need to dynamically adapt themselves based on the runtime mission-specific requirements and operating conditions. This involves dynamic code synthesis that generates modules to provide the functionality required to perform the desired operations in real-time. Telecontrol/telepresence, robotics, and mission planning systems are some of the examples that fall in this category. However, the necessitates the need to develop a real-time assessment technique that classifies these dynamically generated systems as being faulty/fault-free. Some of the benefits of dynamic dependability assessment include providing feedback to the operator to modify the mission objectives if the dependability is low, the possibility of masking defects at run-time, and the possibility of pro- active dependability management. One approach in achieving this is to use softwaredefectprediction techniques that can assess the dependability of these systems using defect metrics that can be dynamically measured.
The Association mining undertaking involves perceiving the ceaseless item sets, and a short time later implication rules among them. It is the errand of finding connections between things in information sets. It is a procedure for finding charming associations between variables in vast databases. It is about finding association or correlations among sets of items or objects in database. It basically oversees finding concludes that will anticipate the event of thing in perspective of the events of various things. [13, 14]. 1) SoftwareDefectprediction using association mining In association rule mining procedure we use defect type data to predict softwaredefect associations that are the relations among various defect types. The defect associations can be utilized for three purposes: First, Find whatever number of related defects as possible to the detected defects and make more successful revisions to the software. Second, it assesses analyst's outcomes amid an inspection. Third, it helps in helping supervisors in enhancing the software process through analysis of the reasons some defect every now and again happen together. Association rule mining goes for finding the examples of co-events of the
Despite the fact that a number of approaches have been proposed for effective and accurate pre- diction of software defects, yet most of these have not found widespread applicability. Our objec- tive in this communication is to provide a framework which is expected to be more effective and acceptable for predicting the defects in multiple phases across software development lifecycle. The proposed framework is based on the use of neural networks for predicting defects in software development life cycle. Further, in order to facilitate the easy use of the framework by project managers, a software graphical user interface has been developed that allows input data (includ- ing effort and defect) to be fed easily for predicting defects. The proposed framework provides a probabilistic defectprediction approach where instead of a definite number, a defect range (minimum, maximum, and mean) is predicted. The claim of efficacy and superiority of proposed framework is established through results of a comparative study, involving the proposed frame- work and some well-known models for softwaredefectprediction.
The common software problems appear in a wide variety of applications and environments. Some software related problems arises in software project development i.e. software related problems are known as softwaredefect in which Software bug is a major problem arises in the coding implementation. Softwaredefectprediction has been one of the key areas of exploration in the domain of software quality. The ability of a model to learn from data that does not come from the same project or organization will help organizations that do not have sufficient training data or are going to start work on new projects. The findings of this research are useful not only to the software engineering domain, but also to the empirical studies, which mainly focus on symmetry as they provide steps-by-steps solutions for questions raised in the article. A typical software development process has several stages; each with its own significance and dependency on the other. Each stage is often complex and generates a wide variety of data. Using data mining techniques, Hidden patterns can be uncovered from this data, which measure the impact of each stage on the other and gather useful information to improve the software development process. The insights gained from the extracted knowledge patterns can help software engineers to predict, plan and comprehend the various intricacies of the project, allowing them to optimize future software development activities. As every stage in the development process entails a certain outcome or goal, it becomes crucial to select the best data mining techniques to achieve these goals efficiently.
In this paper, we introduced a highly usable softwaredefectprediction system. The system was assessed using NASA which is a widely used benchmark dataset. In the mining part of the system, a set of dependencies among hard-to-measure features of the dataset and easy-to-measure ones were discovered. Then, we developed a set of fuzzy modeling systems, each of which estimates the value of one of the hard- to-obtain features from its specified determinants. In this part of the system, we followed the Wang and Mendel’s fuzzy rule learning method. The evaluation of the estimation systems was accomplished by computing the MSE values for all features. The results showed the high ability of the system in terms of approximation. Using this system, the user will not have to measure all the required mentioned metrics for any of the modules. All of the hard-to-measure features will automatically be estimated with a high accuracy. As a future
Abstract – SoftwareDefectPrediction (SDP) provides insights that can help software teams to allocate their limited resources in developing software systems. It predicts likely defective modules and helps avoid pitfalls that are associated with such modules. However, these insights may be inaccurate and unreliable if parameters of SDP models are not taken into consideration. In this study, the effect of parameter tuning on the k nearest neighbor (k-NN) in SDP was investigated. More specifically, the impact of varying and selecting optimal k value, the influence of distance weighting and the impact of distance functions on k-NN. An experiment was designed to investigate this problem in SDP over 6 softwaredefect datasets. The experimental results revealed that k value should be greater than 1 (default) as the average RMSE values of k-NN when k>1(0.2727) is less than when k=1(default) (0.3296). In addition, the predictive performance of k- NN with distance weighing improved by 8.82% and 1.7% based on AUC and accuracy respectively. In terms of the distance function, kNN models based on Dilca distance function performed better than the Euclidean distance function (default distance function). Hence, we conclude that parameter tuning has a positive effect on the predictive performance of k-NN in SDP.
For evaluating and comparing performance in softwaredefectprediction of various ensemble methods discussed above and a single classiﬁer, we employed the Weka tool which implements all algorithms we need in experiments. These algorithm packets are Bagging, AdaBoostM1 (which is most popular version of boosting), NaïveBayes (which is a most popular single classiﬁer sim- ple and eﬀective), RandomForest, RandomTree, RandomSubspace, Stacking, and Vote. Bagging, AdaBoostM1 and RandomSubspace are all meta-learner, so we assigned the NaïveBayes to base classiﬁer of these algorithms. We selected four base classiﬁers for Vote evolving NaïveBayes, Logistic, libSVM, and J48, because these are all most popular algorithms in softwaredefect pre- diction research community. The combinationRule of Vote is average of probabilities. We also let the level-1 classiﬁers of Stacking be these four algorithms, and the level-0 classiﬁer be NaïveBayes. In all experiments, we performed 10-fold cross validation. That means we will got 100 accuracy values/ACU values for each algorithm on each dataset, this way, the mean of these 100 values is the average accuracy/average AUC for this algorithm on this dataset.
To fully utilize the valuable cost information, a two-stage cost-sensitive learning (TSCS) method for softwaredefectprediction where the cost information is used in both the feature selection stage and the classification stage. The cost-sensitive feature selection aims to select features that are associated with the interesting class (i.e., defect-prone module), and the cost- sensitive classification deems to make the SDP classifier not dominated by the majority class (i.e., not-defect-prone module). The above two stages are used to solve the class imbalance problem in SDP. Fig 3.1 illustrates a general architecture of the combination TSCS and EAFSC method. As shown in Fig. 3.1, the historical data, including various software metrics captured from software systems, are
Two things should be carefully considered when building ensemble models. First, ensembles should be built from di- verse classifiers. Ensembles should include classifiers that make different incorrect predictions (because classifiers that make the same prediction errors do not add any informa- tion). Second, combining the outputs from all classifiers should be done in a way that encourages the correct decisions are amplified and ignores incorrect decisions. Since different classifiers find different defects, techniques commonly used in defectprediction for combining classifier outputs, such as majority voting, should be reconsidered. Current ensem- ble models in softwaredefectprediction are not specifically designed to combine prediction outputs in such a way that will amplify correct predictions. If several prediction models have uniquely identified different sets of defects, then ma- jority voting will not be a suitable technique to increase pre- diction performance. On the contrary, some of the defects uniquely identified by single classifiers will now be misclas- sified, downgrading the overall performance of the ensemble models. Combining the decisions of individual classifiers can be achieved using other techniques rather than majority vot- ing. In this study, we use the stacking approach first intro- duced by Wolpert . Stacking uses a two layer approach, where the first layer is constituted of individual classifiers, all trained on the same training data. The second layer, also called the meta layer, uses the output predictions of individ- ual classifiers from the first layer as an input. This input is fed into the second layer classifier which then makes the final predictions. Therefore, the stacking approach seeks patterns in predictions made by the first layer, rather then ignoring classifiers that have minority “votes”. Consequently, if a specific subset of defects is detectable only by one classifier, stacking will still have an opportunity to correctly classify such instances. The majority-voting approach would cer- tainty misclassify such instances, since all but one of the classifiers would predict non-defective.
ABSTRACT : An error, bug, flaw, failure, mistake or fault in a computer program or system that generates inaccurate/unexpected outcome or prevents software from behaving as intended is a softwaredefect. A project team wants to procreate a quality software product with zero defects. High-risk components in a software project must be caught early to enhance software quality. Software defects incur cost regarding quality and time. This article investigates Support Vector Machine’s (SVM) classification accuracy for SoftwareDefectPrediction (SDP) and proposes a new optimized MRMR and SVM with firefly algorithm.
To overcome the limitation of numeric feature description of software modules in Softwaredefectprediction, we propose a novel module description technology, which employs the classifying feature, rather than numerical feature to describe the software module. Firstly, we construct independent classifier on each software metric. Then the classifying results in each feature are used to represent every module. We apply two different feature classifier algorithms (based on mean criterion and minimum error rate criterion, respectively) to obtain the classifying feature description of software modules. By using the proposed description technology, the discrimination of each metric is enlarged distinctly. Also, classifying feature description is simpler compared to numeric description, which would accelerate the speed of prediction model learning and reduce the storage space of massive data sets. Experiment results on four NASA data sets (CM1, KC1, KC2 and PC1) demonstrate the effectiveness of classifying feature description, and our algorithms can significantly improve the performance of softwaredefectprediction.
predictive performance of SVM against four NASA datasets with eight machine learning models. Guo et al., utilized ensemble approach (Random Forest) on NASA softwaredefect datasets to predict defect-prone software modules and also analyzed its performance with other existing machine leaning approaches .Ghouti et al., have developed a model for fault prediction using SVM and Probabilistic Neural Network (PNN) and evaluated it with PROMISE datasets. This research work suggested that predictive performance of PNN is better for any size of datasets as compared to SVM . Khoshgoftaar et al., performed experiment on large tele-communication dataset and used Neural Network (NN) to predict either a modules is faulty or not . They compared the performance of NN with other models and found that NN performed well as compared to other approaches in the fault prediction. Utilization of numerous data mining algorithms for example clustering, association, regression and classification in softwaredefectprediction is also discussed by Kaur and Pallavi . Another study used Fuzzy SVM to identify defects in software modules. Since the datasets available for defectprediction are imbalanced in nature, so this study applied Fuzzy SVM to deal with imbalanced software data . Fenton et al., and Okutan et al., used Bayesian Network for predicting the defect in software [4, 10]. Okutan et al., performed experiment on 9 PROMISE data repository and found most effective software metrics are lines of code, response for class and lack of coding quality. SVM and Particle Swarm Optimization (P-SVM) models proposed by Can et al., P-SVM produced promising results as compared to other existing models such as SVM, GA-SVM and Back Propagation NN [3, 21].
Song et al.,(2006) proposed prediction of defect associations and defect correction effort based on association rule mining methods. Test resources are more effectively allocated for detecting software defects. The proposed method was applied to more than 200 projects. The experiment results show the accuracy achieved is high for both defect association prediction and defect correction effort predictions. The result of the proposed method was also compared with PART, C4.5 AND Naïve Bayes methods. The comparison shows that the proposed method accuracy was higher by at least 23 percent. Lessmann et al., 2008 investigated the performance of classification algorithm. To compare the softwaredefectprediction, experiments were conducted using 10 public domain datasets from NASA Metric Data repository, using 22 classifiers. The general impression is that the predictive accuracy metric based classification is useful. The results also indicated that the importance attached to particular classification algorithms was not significant as generally 31 assumed. The results showed that there was no significant difference among the top 17 classifiers.
Abstract: Softwaredefectprediction (SDP) is a challenging factor in the area of Computer Science. Software engineering is the fertile ground to each and every computer science project, which results the Computers the feature to develop the planning on an accurate job by means data. ML Machine based learning was enhanced by those implemented research on Pattern Identification with Computational intelligence based on Artificial Intelligence (AI)”. These (ML) Machine based knowledge tactics are boosted in resolving those faults which are occurring from validation in addition with Domain based systems. Those programming-based difficulties which are designated as the procedure-oriented knowledge with in those situations and alterations. A predictive model which is measured into two ways. First one is Defective Module and second one is Non-defective Module. The two predictive modules are formed by using (ML) Machine Learning techniques. Machine learning methods be cooperative in softwaredefectprediction. For the existing data sets are collected from NASA and Eclipse from promise repository which is a motivated version of UCI repository which is developed in 2005. We have a lot of learning ways to notice defects in software. Here we are revisiting the ML methods for SDP (softwaredefectprediction).
The evaluation program is an important part of the framework for softwaredefectprediction. At this stage, different learning programs are evaluated and built with them to test the learners. The first question of the evaluation of the program is how to divide historical data into training and testing. As described above, the test data must be independent of the pupil's configuration. It is necessary to assess the performance of new data for learners' prerequisites. Cross validation is generally used to estimate the accuracy of the work with the actual prediction model. The circular cross validation involves the compilation of data into complementary subsets that perform analysis in other subsets of analysis and validation of a subset. In order to reduce variability, multiple rounds of cross validation were performed using different partitions, and the results were validated in two rounds.
association rules are suitable for softwaredefectprediction, we compared CBA2  with two other rule-based classification methods, i.e. C4.5  and RIPPER  , across twelve public-domain benchmark data sets obtained from the NASA Metrics Data (MDP) repository  and the PROMISE repository  . Comparisons are based on the area under the receiver operating characteristics curve (AUC). As argued later in this paper, the AUC represents the most informative indicator of predictive accuracy within the field of softwaredefectprediction.
Defects in software systems continue to be a major problem. Defectprediction is an important topic in software quality research and could help on planning, controlling and executing software development activities. Nowadays, computer scientists have shown the interest in the study of social insect’s behaviour in neural networks area for solving different prediction problems.Chief among these is the Artificial Bee Colony (ABC) algorithm. This paper investigates the use of ABC algorithm that simulates the intelligent foraging behaviour of a honey bee swarm. Multilayer Perceptron (MLP) trained with the standard back propagation algorithm normally utilises computationally intensive training algorithms. One of the crucial problems with the backpropagation (BP) algorithm is that it can sometimes yield the networks with suboptimal weights because of the presence of many local optima in the solution space. To overcome ABC algorithm used in this work to train MLP learning the complex behaviour of softwaredefectprediction data trained by BP, the performance of MLP-ABC is benchmarked against MLP training with the standard BP. The experimental result shows that MLP-ABC performance is better than MLP-BP.