Comparison and evaluation of data mining techniques with algorithmic models in software cost estimation

(1)

Procedia Technology 1 ( 2012 ) 65 – 71

* Zeynab Abbasi Khalifelu. Tel.: +90-554-404-0942; E-mail address: {tabriz.ce, bonab.farhad}@gmail.com

INSODE-2011

Comparison and evaluation of data mining techniques with

algorithmic models in software cost estimation

Zeynab Abbasi Khalifelu

a

, Farhad Soleimanian Gharehchopogh

b

a_{Islamic Azad University Shabestar Branch, Department of Computer Engineering, Shabestar,Iran} b_{Hacettepe University, Department of Computer Engineering,Beytepe, Ankara,Turkey}

Abstract

Software Cost Estimation (SCE) is one of important topics in producing software in recent decades. Real estimation requires cost and effort factors in producing software by using of algorithmic or Artificial Intelligent (AI) techniques. Boehm developed the Constructive Cost Model (COCOMO) that is one of the algorithmic SCE models. Also, these models contain three increasingly basic, intermediate and detailed forms, i.e. basic COCOMO is suitable for quick, early, rough order of among the estimates of required effort in producing software, but its accuracy is limited due to its loss of factors to account for difference between cost drivers. Intermediate COCOMO assumes these project attributes into account. In addition detailed COCOMO accounts for individual project phases used. The COCOMO algorithmic techniques families have used since 1981. In recent years, some techniques emerged by using intelligent techniques to solve and estimate the effort required in producing software. In this paper, different data mining techniques to estimate software costs are presented and then the results of each technique are evaluated and compared. However, NASA's projects to train and test each of these techniques are applied. Then, data set to train and test the data mining techniques improve the estimation accuracy of the models in many cases. We show the comparison between COCOMO model and data mining techniques here. The results indicate that these methods result in many benefit answers. Also we show the comparison of the estimation accuracy of COCOMO model with data mining techniques. Data mining techniques improve the estimation accuracy of the models in many cases. So the estimated effort more improvement in this models.

Keywords: Software Cost Estimation, Data Mining, COCOMO, Linear Regression, Artificial Neural Network, Support Vector Regression, K-Nearest Neighbors;

1.Introduction

One of the most critical impressions on software quality collateral that defects SCE will be an important software development. In recent years, SCE method used for determining requirement effort has been a significant issue. These are applied based on experts’ experiences.

Variety of available methods can be point to the algorithmic models. One of non-linear methods used for estimating SCE is COCOMO 81 Model in [1, 2, 3, and 4]. They are used since 1981[5]. Nowadays, by confecting COCOMO models and using AI techniques, new approach has been created for approximation and calibration new software in great companies and business applications [5]. This problem has direct relationship with success or failure in performed new projects.

(2)

However, Boehm et al [1] describe new approach to improve SCE accuracy. Pahariya et al [2] proposed new computational intelligence sequential hybrid architectures involving programming and Group Method of Data Handling (GMDH). This includes data mining methods such as Multi-Layer Regression (MLR), Radial Basis Function (RBF) and so on [2]. Other researches in ANN models for predicting SCE are [3, 4, 6, 7, 8, 9, 10 and 11]. Andreou et al [12] used Fuzzy Decision Trees (FDTs) for predicting required effort and software size in cost estimation as if strong evidence about those fuzzy transformations of cost drivers contributed to enhancing the prediction process. More studies in this topic are illustrated in [13, 14, 15, 16, 17, and 18]. Reddy et al [19] improved fuzzy approach for software effort of the COCOMO using Gaussian membership function which performs better than the trapezoidal function to presenting cost drivers.

In this paper, we firstly describe COCOMO algorithmic models. Then explain the roles and application of these models in data mining techniques. In the next sections, we research on four data mining techniques results on training and testing data. For implementing AI models, we utilize data mining techniques such as Linear Regression (LR), ANN, Support Vector Regression (SVR) and K-Nearest Neighbors (KNN) classifiers. However, we compare outcomes of these methods with Intermediate COCOMO results in given actual and estimated efforts.

2.Algorithmic Models

Software cost drivers largely impress on required software effort and cost in the software development life cycle [16]. After estimation, project management will create new project development schedule. There are algorithmic models based on Source Line of Codes (SLOC), function point (FP) [16] and Effort Multipliers (EM) [20, 21, 22]. These models have been explained in [16]. Some of non-linear models are COCOMO I [1, 22], COCOMO II [11, 23] advocate local calibration, Bailey et al [24] improved accuracy prediction. We choose intermediate COCOMO from among COCOMO families [1, 2, 5, and 16]. There are 15 parameters expected for SIZE (total 16 parameters in COCOMO 81) [1]. The non-linear formula for calculating the estimated effort in this method is applied:

(1)

Where A is multi-plicative constant factor that is related to local organization processes, E depends on the type of software and EM are project attributes. It is developed by three types of projects (Organic, Semi-detached and Embedded), Size is the code size of the software measured by SLOC. And EMi is an EM made by combining

process product and development attributes [5, 20, and 21]. More details are discussed in [1, 5, 18, 25, and 23]. COCOMO I and its predecessors are given by Boehm [22, 26] or in the COCOMO model definition manual [20]. More information about Intermediate COCOMO EM are discussed in [1, 21]. Each of EM has special rating in six ranges with more details in [1, 6, 7, and 22].

3.Evaluation Method

Data mining techniques were used for analysis. The data set was divided into two groups: training and testing with following percentage ratios:

x Training set – 80 %

x Testing set – 20 %

Evaluating of cost estimation accuracy in performed by comparing actual effort and estimated effort in order to compute MRE (Magnitude of Relative Error) [5] which can be described as follows:

(2) As shown more, we trained dataset with these models, then evaluated and compared it with MRE criterion.

(3)

4.Data Mining in SCE

In recent years, data mining techniques have been applied in a sundry of domains. By means of data mining techniques and machine learning methods, many researches are done on the basis of SCE. We provide several data mining techniques from a historical data set taken from the former projects which will be applied later [27]. Technology Institute of California, NASA’s Jet Propulsion Laboratory peruses circumstance to consider the accuracy of SCE further in the beginning of project life cycle [28]. However, data mining techniques congenitally train and test the data in order to output superior models.

Generally, our approach to estimate required software effort using data mining techniques are includes six steps as illustrate below:

1.Collecting Pre-processing data: This may consist of deleting, adding, transforming and discrete building variables. These variables should have discrete values (such as actual effort, EM variables and Size of the projects).

2.Creating training data sets and test data sets: We use them for training and testing with NASA’s projects

based on Intermediate COCOMO. Major part of data set has been devoted to training data mining techniques, and minor part of it is applied for testing.

3.Constructing data mining classifiers: It is composed of four methods that include LR, ANN, SVR and KNN.

4.Training: These methods will be trained with Intermediate COCOMO dataset.

5.Testing: The results are applied for testing new instances. Since the required effort and the cost estimated help these methods after training. Then accuracy measures are computed.

6.Repetition of steps 4-5 until all the training and testing sets have been processed.

In the following, four data mining techniques have been proposed for predicting new software cost and effort in PM (estimate effort in Person/Month).

4.1.LR Model

It is our intent to compare the standard regression-based local calibration method generated through the data mining techniques [28]. Among the other advantages of the method is its ability to present domains that evolve over time [29]. LR or Multiple LR (MLR) models can offer a resolution to the problem of SCE. The aim of regression analysis is to express the dependent variable in the form of function variable(s) independent coefficients and error values [30]. Error values are random variables and it does not explain changes in the value of independent variables. Our target is to consider this architecture finding’s error values by adopting new project variables with these parameters. In MLR, presume that a data set has N perceptions where the response is effort and P predicators are in [22] (e.g., Size of projects and EM).

Let bi = (xi1, xi2, xi3, …,xip), i = 1, 2, …, 16, be the vector of P predictors and Yi be the response for ith observation. The architecture for MLR is expressed as:

Yi = β0 + β1bi1 +…+ βpbiP (3)

Where β0, β1, ..., βp are the predicts of coefficients, Yi is the predicts of answerable the ith observation [22]. By minimizing the sum of squared errors, calculated ordinary least squares [30] show that the response estimated from regression line minimizes the sum of squared distances between the regression line and observed response [22]. Know that E (formula 1) is exponential constant usually greater than 1.0 which depends on project types (Organic, Semi-detached, And Embedded) in COCOMO I used for estimating SCE in this model. Formula 1 is non-linear. In order to make it linear, we use formula 3 [22] for prediction SCE:

Log (PM) = β0+ β1 log (Size) + β2 log (EM2) + … + β16 log (EM15) (4)

We use greedy method for selecting attributes in LR algorithm application for SCE. By means of MRE in LR, the results of this method display in figure 1.

(4)

4.2.ANN Model

ANN model is one of architecture that is used more in SCE. For performing it, suggested different ANN such as [3, 5, 8, and 29] are used. But in this paper for calibrating data, we have built our model with new Multi-Layer Perceptron (MLP) in ANN. While pending for the training action, the MLP network consecrates weights to the nodes to reach the best relevance between the training input and output values [8]. The MLP network implemented via the process many times sets out weights to minimize the mean squared error. In this model, we have three layers which consist of input layer, hidden layer and output layer. In input layer, there are 16 inputs (including Size of the project and 15 EM).

We assign five units for hidden layer and one unit for output layer to display output values. The number of training time is 2000, of momentum is 0.2 and of learning rate is 0.1. The results of ANN model is illustrated in Figure1and Figure2. MRE in Intermediate COCOMO model and ANN in data set are evaluated, compared and then presented in Table 1.

4.3. SVR Model

The SVR is implemented as the Support Vector Machine (SVM) for regression. Evgeniou et al [31] proposed a powerful learning algorithm based on recent advances in statistical learning theory proposed. It uses a hypothesis space of linear functions in a high dimensional space, trained with a learning algorithm from optimization theory, and it implements a learning bias derived from statistical learning theory [2].

For predicting software costs, SVR applies a linear model to implement linear class borders. It maps non-linear input vectors (consisting of EM and Size of the projects) into a high dimensional attributes space by means of kernels. In this topic, kernel is composed of poly kernel. Then, the support vectors are applied to invent an optimal linear separating hyper plane (in a case of pattern recognition) or a linear regression function (in the case of regression) in this feature space [2].

Details about the evaluation and comparison of training and testing data for accuracy of cost estimation and MRE of SVR are in the following figures and Table 1.

4.4.KNN Model

This model is a method in AI for cases classification. It can select appropriate value of K based on cross-validation [32]. It can also be distance weighting. Unlike all the previous learning methods in this paper, KNN does not build model from the training data. So, for classifying a test case such as d, define K-Neighborhood P as K nearest neighbors of d. Classification time is linear in training set size for each test case. Cross validation will be used to select the best K value. K= 2 in KNN classifier for prediction required effort.

KNN search algorithm is Linear NN search. In this model, we did not use distance weighting. Additional information is shown in Figure 1 and Table1.

5.Results and Discussion

For evaluating the different SCE model with AI techniques, the most widely accepted evaluation criteria are

MRE among 63 NASA’s projects. These data mining techniques are trained and tested in WEKA data mining tools. The comparison between the results of dataset applied in the LR, ANN, SVR and KNN with Intermediate COCOMO show that AI methods in many cases have a high efficiency in producing the correct prediction of Intermediate COCOMO method. As the comparison with COCOMO was presented in training section (Figure1), MRE data mining techniques compared with the COCOMO model are much less general.

(5)

Figure1.The amounts of MRE in Data Mining techniques and COCOMO in Training Data

Comparison of the results suggests that among AI techniques, SVR model has less MRE than other AI techniques. Perhaps the reason for this result reached from training data, as a result of using regression technique in the SVM Model. For this type of data LR method displays much MRE in several cases. However, the amounts of MRE in ANN and SVR introduce better answers compared within COCOMO model. Chiefly SVR is best models in training data with %95 correct prediction.

The following figure (figure 2) indicates that MRE is values in selected testing data. It is especially suitable in ANN model with %95 correct prediction comparison with COCOMO predictions. On the other hand, it illustrates high proficiency of this method in SCE. Also, these have proven success and applicability in SCE domain. AI techniques extensively applied in prediction have demonstrated their capability in predicting problems chiefly ANN methods.

Figure2. The amounts of MRE in Data Mining techniques and COCOMO in Testing Data

Table 1 depicts accuracy among AI techniques. It consists of LR, ANN, SVR and KNN prediction accuracy. Researches on AI techniques point than the estimated effort of software with AI models have more performance and accuracy of intermediate COCOMO model in many training and testing instances. Error values indicate that AI models in the local data are better answers in comparison with the algorithmic models as Intermediate COCOMO. This leads to the fact that in the local data, estimated effort has effectively been calibrated in data mining techniques. The following table (Table 1) shows the techniques in ANN algorithm with better performance than COCOMO models in prediction of software costs in training and testing data sets. This is possible by comparing MRE of the COCOMO with MRE of the data mining techniques.

Table1. Performance Comparison between Data Mining Techniques

Performance Comparison LR ANN SVR KNN

Correct Prediction on Training Data %74 %87 %95 %68 Correct Prediction on Testing data %60 %95 %80 %60

(6)

In Table 2, the data mining techniques have definitely been collected. Prediction precision implies error values in each of them. In training data, studies evince that LR method has better proficiency and is more applicable comparison with KNN method and COCOMO model. Also, both of LR and KNN have almost the same performance on testing data in correct prediction.

Table2. Prediction Accuracy in Data Mining Techniques

Evaluation Accuracy LR ANN SVR KNN

Correlation Coefficient %99 %99 %97 %1

Mean absolute error 39.17 12.07 36.32 1.12

Root mean squared error 81.52 16.64 41.09 5.44

Relative absolute error 8.09% 2.49% 7.5% 0.23%

Root relative squared error 11.6% 2.37% 20.08% 0.77%

The proposed systems improve its estimation as the number of data set increases. Performance measure used in comparing the techniques is MRE. Estimation by data mining techniques also has an advantage which shows that these methods are very sagacious ones.

6.Conclusion and Future Works

This paper presents several SCE models based on data mining techniques for the choice of suitable AI techniques for essential effort in new projects. The techniques considered are LR, ANN, SVR and K-NN. That trained and tested instances are considered with these methods. Aim of all these works is evaluating and comparing data mining methods with Intermediate COCOMO in prediction accuracy.

Studies conducted on AI techniques show that the estimated cost of the software within these models has more speed and accuracy of algorithmic models such as intermediate COCOMO. Other effective results indicate that AI models in the local data are better answers in comparison with algorithmic models. ANN and SVR techniques afford better answers in comparison with other techniques. Review of results on testing data evince as that in most cases, ANN and SVR have high performance in comparison of COCOMO model. So, ANN and SVR methods are more efficient than the LR and KNN models in training and testing data.

The exploitation of data mining techniques for other applications in the software engineering cost management tendency can also be detected in the future as genetic algorithms, fuzzy decision trees, case based reasoning and other up to date data mining techniques.

References

1. Z. Chen, T. Menzies, D. Port, B. W. Boehm, Feature Subset Selection can Improve Software Cost Estimation Accuracy, ACM SIGSOFT Software Engineering Notes, pp 1-6, 2005.

2. J. S. Pahariya, V. Ravi, M. Carr, M. Vasu, Computational Intelligence Hybrids Applied to Software Cost Estimation, International Journal of Computer Information Systems and Industrial Management Applications (IJCISIM),Vol. 2, pp. 104-112, 2010.

3. A. Lee, C. H. Cheng, J. Balakrishnan, Software Development Cost Estimation: Integrating Neural Network with Cluster Analysis, Information & Management, Vol. 34, pp.1-9, 1998.

4. A. Idri, A. Abran, S. Mbarki, Validating and Understanding Software Cost Estimation Models based on Neural Networks, Software Process and Product Measurement, International Conference on Information and Communication Technologies: From Theory to Applications, pp. 1-6, 2004.

5. F. S. Gharehchopogh, Neural Network Application in Software Cost Estimation, International Symposium on Innovations in Intelligent Systems and Applications (INISTA 2011), Istanbul, TURKEY, pp. 69-73, 2011.

6. A. Idri, A. Zakrani, A. Zahi, Design of Radial Basis Function Neural Networks for Software Effort Estimation, IJCSI International Journal of Computer Science Issues 7(3): 11-17, 2010.

(7)

Conference on Computer Engineering and Technology, pp. 487-491, 2010.

8. G. R. Weckman, H. W. Paschold, J. D. Dowler, H. S. Whiting, W. A. Young, Using Neural Networks with Limited Data to Estimate Manufacturing Cost, Journal of Industrial and Systems Engineering 3(4): 257-274, 2010.

9. H. M. Gunaydin, S. Z. Dogan, A Neural Network Approach for Early Cost Estimation of Structural Systems of Buildings, International Journal of Project Management 22(7): 595–602, 2004.

10. A. Idri, T. M. Khoshgoftaar, A. Abran, Can Neural Networks be Easily Interpreted in Software Cost Estimation, 2002 Word Congress on Computational Intelligence, Honolulu, Hawaii, pp. 1-8, 2002.

11. N. Tadayon, Neural Network Approach for Software Cost Estimation, Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC’05), 2005.

12. A. S. Andreou, E. Papatheocharous, Software Cost Estimation using Fuzzy Decision Trees, 23rd IEEE/ACM International Conference on Automated Software Engineering, pp. 371 - 374, 2008.

13. W. Yu, Y. Lee, Mining of Conceptual Cost Estimation Knowledge with a Neuro Fuzzy System, Proceedings of ISARC 2004, Session SA 03-05, Sep. 21-25, Jeju, Korea, pp. 118-124, 2004.

14. I. R. Khan, A. Alam, H. Anwar, Efficient Software Cost Estimation using Neuro-Fuzzy Technique, National Conference on Recent Developments In Computing and Its Applications, pp. 376-381, 2009.

15. M. Y. Cheng, H. C. Tsai, E. Sudjono, Evolutionary Fuzzy Hybrid Neural Network for Conceptual Cost Estimates in Construction Projects, Information and Computational Technology, 26th International Symposium on Automation and Robotics in Construction (ISARC 2009), pp. 512-519, 2009.

16. S. J. Huang, C. Y. Lin, N. H. Chiu, Fuzzy Decision Tree Approach for Embedding Risk Assessment Information in to Software Cost Estimation Model , Journal of Information Science and Engineering, No. 22, pp. 297-313, 2006.

17. I. Attarzadeh, S. Hockow, Improving the Accuracy of Software Cost Estimation Model Based on a New Fuzzy Logic Model, World Applied Sciences Journal 8(2): 177-184, 2010.

18. X. Huang, D. Ho, J. Ren, L. F. Capretz, Improving the COCOMO Model using a Neuro-Fuzzy Approach, Applied Soft Computing, No. 7, pp. 29–40, 2007.

19. C. S. Reddy, K. Raju, An Improved Fuzzy Approach for COCOMO’s Effort Estimation using Gaussian Membership Function, Journal of Software, VOL. 4, NO. 5, pp. 452-459, 2009.

20. C. S. Reddy, P. S. Rao, KVSVN Raju, V. V. Kumari, A New Approach for Estimating Software Effort Using RBFN Network, IJCSNS International Journal of Computer Science and Network Secruity 8(7): 237-241. 2008.

21. T. Menzies, D. Port, Z. Chen, J. Hihn, S. Stukes, Validation Methods for Calibrating Software Effort Models", in Proc. ICSE, pp. 587-595, 2005.

22.V. Nguyen, B. Steece, B. Boehm, A Constrained Regression Technique for COCOMO Calibration, Empirical Software Engineering and Measurement (ESEM) , pp. 213-222, 2008.

23. I. Attarzadeh, S. H. Ow, A Novel Algorithmic Cost Estimation Model Based on Soft Computing Technique, Journal of Computer Science 6 (2): 117-125, 2010.

24. J. W. Bailey, V. R. Basili, A Meta-Model for Source Development Resource Expenditures, in Proceedings of 5th International Conference on Software Engineering, pp. 107-116. 1981.

25. M. Ayyildiz, O. Kalipsiz, S. Yavuz, A Metric-Set and Model Suggestion for Better Software Project Cost Estimation, World Academy of Science Engineering and Technology, Vol.23, pp. 167-172, 2006.

26. D. Ferens , D. Christensen, Calibrating Software Cost Models to Department of Defense Database: A Review of Ten Studies, Journal of Parametrics 18(1): 55-74. 1998.

27. J. V. Balsera, F. O. Fernandez, V. R. Montequin, R. C. Suarez, Effort Estimation in Information Systems Projects using Data Mining Techniques, Proceedings of the 13th WSEAS International Conference on COMPUTERS, pp. 652- 657, 2009.

28. K. T. Lum, D. R. Baker, J. M. Hihn, The Effects of Data Mining Techniques on Software Cost Estimation, IEEE International Engineering Management Conference, IEMC Europe 2008, pp. 1-5, 2008.

29. R. Bhatnagar, V. Bhattacharjee, M. K. Ghose, Software Development Effort Estimation - Neural Network Vs. Regression Modeling Approach, International Journal of Engineering Science and Technology Vol. 2, pp. 2950-2956, 2010.

30. R. Sonmez, Conceptual Cost Estimation of Building Projects with Regression Analysis and Neural Networks, Canadian Journal of Engineering 31(4): 677-683, 2004.

31. T. Evgeniou, M. Pontil,T. Poggio, Statistical Learning Theory: A Primer, International Journal of Computer Vision 38(1): 9–13, 2000. 32. D.W. Aha, D. Kibler, M. K. Albert, Instance-based Learning Algorithms, Machine Learning, Published in Journal Machine Learning

Archive, Vol. 6, Iss. 1, pp. 37-66, Jan. 1991, Kluwer Academic Publishers Hingham, MA, USA ,doi>10.1023/A:1022689900470, (1991).