In order to assess the e↵ectiveness of mutation-aware faultprediction we define 40 mutation metrics (either ‘static’ or ‘dynamic’) and collect them using the popular (and pub- licly available) tool PITest . Then, we empirically com- pare the performance of mutation-aware prediction models built using these metrics with respect to those of prediction models built using 39 traditional (mutation-unaware) static source code metrics that have been widely used in previ- ous faultprediction work . We also investigate whether the combined use of mutation metrics and source code met- rics improves the accuracy of the resulting prediction model. Moreover, we analyse the extent to which di↵erent predic- tion techniques benefit from mutation awareness by using four di↵erent techniques (i.e., Logistic Regression, Random Forest, Na¨ıve Bayes and J48), implemented in WEKA , to build both ‘mutation-aware’ and ‘mutation-unaware’ pre- dictive models. Although we have our own tools for muta- tion testing and predictive modelling [26, 45, 50], we used publicly available third-party tools to avoid a source of po- tential experimenter bias, and to ensure that our results will have actionable findings for both mutation testing and pre- dictive modelling.
Research efforts to predicting where faults are likely to hide have been substantial. Although a large number of machine learning based software faultprediction approaches have been investigated, none of them has proven to be consistently accurate. As we discussed in the Introduction section, the limitations stem from the lack of long fault history, failure in using appropriate predictive approaches, and low data quality. In many real-world learning scenarios, acquiring a large amount of labeled training data is usually expensive and time-consuming. However, unlabeled data is a powerful resource and is easy to obtain. The key question is how to gather useful information out of unlabeled resources in a wide range of learning environments. In software faultprediction problem, it could be the situation of identifying fault prone modules when no previous subsystems or earlier versions are available for model training. In this dissertation we proposed machine learning solutions to address this problem and also discussed the efficacy of proposed approaches using multiple sources of software fault data. Two machine learning paradigms we studied are semi-supervised learning and active learning.
Fan is a sort of rotating machinery which takes rotors and other rotative parts as its main body for work. In many large-scale industries, it is the core of equipments for production. If the fault of fan happens, it will not only hamper the normal work but also bring on needless loss. So faultprediction of fan is beneficial for the production and safety of enterprise.
In this paper we have investigated an applicability of the Box-Cox power trans- formation to the neuro-based software faultprediction. The ANN employed in this paper is an MIMO type of MLP, and can handle the grouped data on soft- ware fault counts as well as make the long-term prediction. To our best know- ledge, this paper is the primary challenge to treat the long-term prediction of software faults with the grouped data in the ANN approach. Throughout a comprehensive comparison with the existing SRGMs, it has been shown that our MIMO type of MLP could work well to predict the cumulative number of soft- ware faults in the early testing phase. In the future, we will apply the proposed neural network approach to the other software fault count data and conduct more comprehensive data analysis to validate our method with data transforma- tion. Especially, a challenging issue is to develop the prediction interval of the cumulative number of software faults. Even if SRGMs are assumed, it is almost impossible to get the exact predictive intervals of the cumulative number of software faults without applying any approximation method. We will extend our prediction scheme based on the MIMO type of MLP to the interval prediction problem. We will also consider how to select the optimal transformation para- meter in the Box-Cox transformation.
NASA’s publicly available software metrics data have proved very popular in developing faultprediction models. And has the advantage that researchers are able to replicate and compare results using different approaches based on the same data set. However, although the repository holds many metrics and is publicly available, it is not possible to explore the source code or trace back how the metrics were extracted. It is also not always possible to identify if any changes have been made to the extraction and computation mechanisms over time. A further concern is that the data may suffer from ‘noise’ []. It is also questionable whether a model that works well on the NASA data will work on a different type of system; as Menzies et al. [] point out, NASA works in a unique niche market developing software which is not typical of the generality of software systems, though Turhan et al [] have demonstrated that models built on NASA data are useful for predicting faults in washing machines.
Collecting good quality data is very hard. This is partly reﬂected by the number of studies which failed our assessment by not adequately explaining how they had collected their independent or dependent data. Fault data collection has been previously shown to be particu- larly hard to collect, usually because fault data is either not directly recorded or recorded poorly . Collecting data is made more challenging because large data sets are usually necessary for reliable faultprediction. Jiang et al. [] investigate the impact that the size of the training and test data set has on the accuracy of pre- dictions. Tosun et al. [] presents a useful insight into the real challenges associated with every aspect of faultprediction, but particularly on the difﬁculties of collecting reliable metrics and fault data. Once collected data is usually noisy and often needs to be cleaned (e.g. outliers and missing values dealt with ). Very few studies report any data cleaning (even in our 36 ﬁnally included studies).
Jun Zheng 17 described that the software faultprediction model can be built with the help of threshold-moving technique. The motive of the software developer is to develop the better quality software on time and inside the financial plan. Software faultprediction model classifies the modules into two classes: faulty modules and non – faulty modules. They discussed the use of different cost sensitive boosting algorithms for software faultprediction. The accuracy of the cost sensitive boosting algorithms is quite good than the other algorithms.
P.S. Bishnu and V. Bhattacherjee applied unsupervised techniques like clustering for faultprediction in software modules, more so in those cases where fault labels are not available. Their paper elicits a Quad Tree based K-Means algorithm for predicting faults in software modules or datasets. They have used a concept of clustering gain to determine the quality of clusters for evaluation of the Quad Tree based initialization algorithm as compared to other initialization techniques.
The main contribution of the study is following: Basic feature filtering strategies are better to be combined in some way to obtain an improved faultprediction performance. In the software faultprediction literature, there are many hybrid strategies that combine feature selection strategies to obtain hybrid methods. To the best of our knowledge, this is the first study that makes use of a voting mechanism to investigate the most relevant features. The remainder of the paper is organized as follows. In section 2, we briefly discuss related work. The evaluation dataset and related information is given in Section 3. In Section 4, we present ranker based filters and the machine learning algorithms used in the study. Section 5 gives details about proposed feature selection algorithm, detailed results of the conducted experiments, and ANOVA test employed for statistically validate the obtained results. In Section 6, validity threats of the study are presented. The article ends with conclusion and as well a list of references.
Quality is the main concern of every software product. Faultprediction models contribute significantly in the improvement of quality. It also decreases the time, cost, and effort of the organization. There are various challenges that we still face when building an efficient faultprediction model. Therefore, we proposed a model that considers all these issues and helps the managers significantly in managing their resources. This is not implemented yet but in future we are planning to implement that model in various datasets and projects. In future, the presented model can be implemented on commercial projects to generalize the findings of the study. Also, different techniques can be used with stacking to get better prediction result.
In this paper, we design experiments to demonstrate the effectiveness of our proposed algorithm. In fig 3, we have six different schemes are used for evaluation (X, X-IG, X-EE, X-KEE, X-RKEE, and X-All). X is a learning scheme to train base classifiers, we implement three different base classifier models, which are commonly used in software faultprediction, namely 1-nearest neighbor rule (1-NN) , decision tree (C4.5) , and Naïve Bayes (NB) . All these three algorithms are implemented based on WEKA 3.5.5 , which is the most famous library of machine learning algorithms.
Nowadays software faultprediction became crucial for increasing the software reliability. More software reliability provides better software quality. Faults are defects that results in software failure and unnecessary increase the testing costs. Faults or defects in software modules are becoming biggest challenge and it needs to be resolved. Software quality assurance is major concerns in modern era. Many Software companies are accepting that with faults or defects lack quality. Therefore, company needs a methodology which can remove the faults at the early stage of software development process which reduces the testing cost and development cost as well. Various machine learning techniques have been applied for software faultprediction such as support vector machine, neural network, and genetic algorithm and many more. We have used case- based reasoning as a method for finding the errors or fault in software module and this is the novelty of our paper.
Elish et al.  investigated the performance of Support Vector Machines (SVMs) and found SVM better than, or at least is competitive against the other statistical and machine learning models in the context of four NASA datasets. They compared the performance of SVMs with the performance of Logistic Regression, Multi-layer Perceptrons, Bayesian Belief Network, Naive Bayes, Random Forests, and Decision Trees. They used correlation based feature selection technique (CFS) to down select the best predictors out of the numbers of independent variables in the datasets. Catal et al.  investigated effects of dataset size, metrics set, and feature selection techniques and found Random Forests provides the best prediction performance for large datasets and Naive Bayes is the best prediction algorithm for small projects. Tomaszewski  have conducted Statistical models vs. expert estimation for faultprediction and found statistical techniques performed superior to locate software fault than an expert estimations approach stating that” When it comes to comparing both methods we found that statistical models outperformed expert estimations”.
The goal of our research is to analyze the performance of various classifiers for various metrics level data set on defect prediction .We analyzed the performance of the classifiers using Root M ean Square Error. ROC is also used as an alternative metric. The area under ROC curves (AUC-ROC) is calculated by using the trapezoidal method. From the ROC curve (AUC-ROC) it is evident that, for three different metrics level dataset (KC1 method level dataset, KC1 class level dataset and Eclipse dataset) bagging classifier gives better performance in terms of classification rate. When AUC is used as an evaluation metric, for KC1 method level dataset and Eclipse package level dataset, faultprediction rate is 80% with bagging ensemble which is better when compared with other ensemble classifier methods. For KC1 class level dataset, faultprediction rate is 82% with voting ensemble method using AUC-ROC as a metric. Bagging outperforms other ensemble methods when performance is evaluated using both RM SE and AUC-ROC with an exception that for class level metric, voting performs better than bagging with a negligible difference. M any researchers may apply machine learning methods for constructing the model to predict faulty classes. We plan to replicate our study to predict the models based on other machine learning algorithms such as ensemble using neural networks and genetic algorithms.
the three samples set as prediction group. And the three fault type are: Elevator door open, excessive vibration when elevator door open, elevator door cannot open or close. In this experiment, the parameters of PSO-BP are: Population size M = 50, the maxi- mum number of iterations G = 200, the number of input nodes is 8, the number of output nodes is 3, and total weights are 8 × 12 + 12 × 3 = 132, the threshold number are 15, the search spatial dimension are N = 132 + 15 = 147, build 8-12-3 structure of BP network. Set three groups of data as training group and also set three kinds of data as prediction group to simulate through PSO-BP. The simulation data are shown in Table 2. The faultprediction of elevator door system through BP neural network shows in Table 3.
This paper investigated the use of SCADA data for wind tur- bine gearbox planet bearing faultprediction. The case study concerned historical data from three operating wind turbines leading up to the same planet bearing failure mode. An in- sight into the dataset was given for different time steps prior to the component failure. The generator speed and two gear- box temperatures were modelled in normal operating con- dition. Abnormalities can be detected through the error be- tween the predicted and the actual variables. The results in- dicate that the relationship between the generator and rotor speed changes in the time period close to the fault. This could be related to measurement procedure of the generator speed. A classifier using all the measured variables is also presented and it indicates that the relationship between the SCADA variables changes within one month before the bear- ing catastrophic failure, as shown by the normal behaviour model. It is furthermore presented that potential prognosis of the fault can be achieved using enough run-to-failure exam- ples.
These three measures are better than accuracy measure. A model which is able to detect more fault prone modules is good. When the project is of a low budget, a smaller number of faulty prone modules with a low false alarm rate would be desirable. In such conditions, the ﬂexibility of G-mean, J_coeff and F-measure would help in better performance eval- uation. These measures are preferable in evaluating software faultprediction models as the cost associated with misclassify- ing a fault prone module is higher than misclassifying it as non fault prone module, which implies some wastage of resources in veriﬁcation activities.
Fuzzy based models translate the subjective understanding and expertise of the processes into mathematically exposable ﬁgures and rules to generate systems with some degree of uncertainty. The use of fuzzy logic in experimental software engineering to model various aspects of software evolution process is increasingly attaining recognition of the research community (Ahmed and Muzaffar, 2009; Aljahdali and Sheta, 2011; Bouktif et al., 2010; Chiu, 2011; Engel and Last, 2007; Gray and MacDonell, 1997; Khoshgoftaar and Seliya, 2003; MacDonell, 2003; Meneely et al., 2008; Ozcan et al., 2009; Pandey and Goyal, 2009; So et al., 2002; Verma and Sharma, 2010). The present study discusses the use of fuzzy inference mechanism to recognize the most notable rules in the development of faultprediction model using GUAJE framework (Alonso and Magdalena, 2011a,b; Alonso et al., 2012).
Most results from our experiments which test the impact of the training data size on the faultprediction performance are not surprising. When derived from a larger data set, model performance improves. The interesting aspect of our results comes from statistical hypothesis testing. In simple terms, the performance margin between models derived from 50% data sub- sets and those derived from just 10% is not statistically significant. Further, models built from 50% data subsets and 90% data subsets typically belong the the same performance cluster too (but models built from 10% and 90% do not). The implication of this result is, we believe, very positive. Models developed from data sets, presumably early in the project life time, offer faultprediction capability comparable with models that can only be developed much later. There- fore, while updating the faultprediction model is a good idea, it does not have to be practiced often. This conclusion offers the real chance to optimize the cost of faultprediction model development. Faultprediction models, in turn, optimize the cost of verification and validation activities.
The Software quality is the degree to which a system meets specified requirements and specifications. As the dependency on the software is increased there is a need for quality software . Software quality could be affected by factors like faults, approaches, tools, schedule pressure, time limit, technology etc. To achieve a good software quality the faults in the software system as to be identified and removed as early as possible. To predict the faults resides in the system a prediction model is been developed. Software faultprediction model is used to predict the faults in software modules. There are various techniques for faultprediction which depends on historical data. Some of these techniques are logistic regression by Basili et al.1996, classification trees by Gokhale and Lyu 1997, Khoshgoftaar and Seliya 2002 , Selby and Porter 1988, neural networks by Khoshgoftaar and Lanning 1995 effective for predicting faults have a large input data, genetic algorithms by Azar et al.2002.Software quality prediction model is based on the basic components of the systems. Software metrics is the basic component that is used in faultprediction models to predict faults in object oriented systems in this paper. Form various researches it is found that Artificial neural network (ANN) provide best accuracy than other prediction model    . Artificial neural network is a type of network where the node of that network sees as the artificial neurons which are a computational model inspired in the human brain. These neurons solve the complex problem by work together and highly interconnected to each other. The Parameters of ANN are Inputs, Interconnection weights, summing function, activation function. The accurate prediction model is obtained by training the artificial neural network using training algorithms which is based on adjusting the parameters of ANN. The Key parameter to well train the ANN is the weight associated with the connection between the layers. In this paper weight is optimized to train the ANN.