Experience in Predicting Fault-Prone Software Modules Using Complexity Metrics

(1)

Vol. 9, No. 4, pp. 421-433, 2012

QT

Q

TQ

QM

M

©ICAQM 2012

Exper

ience

in

Pred

ict

ing

Fau

lt-Prone

Software

Modu

les

Us

ing

Comp

lex

ity

Metr

ics

LiguoYu1_and_A_lok_M_ishra2

1_Computer_Sc_ience_and_Informat_ics_,_Ind_iana_Un_ivers_ity_South_Bend_,_IN_,_USA 2_Department_of_Computer_&_Software_Eng_ineer_ing_,_At_i_l_im_Un_ivers_ity_,_Ankara_,_Turkey

(Received November 2011, acceptedMay 2012)

______________________________________________________________________

Abstract:Complexitymetricshavebeenintensivelystudiedinpredictingfault-pronesoftwaremodules. However,littleworkisdoneinstudyinghowtoeffectivelyusethecomplexitymetricsandtheprediction modelsunderrealisticconditions.Inthispaper,wepresentastudyshowinghowtoutilizetheprediction modelsgeneratedfromexistingprojectstoimprovethefaultdetectiononotherprojects.Thebinarylogistic regressionmethodisusedinstudyingpubliclyavailabledataoffivecommercialproducts.Ourstudyshows (1)modelsgeneratedusingmoredatasetscanimprovethepredictionaccuracybutnottherecallrate;(2) loweringthecut-offvaluecanimprovetherecallrate,butthenumberoffalsepositiveswillbeincreased, whichwillresultinhighermaintenanceeffort.We furthersuggestthatinordertoimprovemodel predictionefficiency,theselectionofsourcedatasetsandthedeterminationofcut-offvaluesshouldbe basedonspecificpropertiesofaproject.Sofar,therearenogeneralrulesthathavebeenfoundandreported tofollow.

Keywords:Binarylogisticregression,complexitymetrics,fault-pronesoftwaremodule.

______________________________________________________________________

1. Introduct

ion

oftwarequalityanalysisandpredictionfocusesondetectinghigh-riskfaultprone programmodules,allowingpractitionerstoallocateprojectresourcesstrategically[9, 34].Throughallocatingmoretestingresourcesonfault-pronemodules,wecaneffectivelyand extensivelytesttheproduct.Therefore,predictingfault-pronesoftwaremodules isan importanttechniqueusedinreducingtesttimeandtesteffortandisbecominganimportant researchtopicinrecentyears[7, 15-16, 20, 28-31, 36].

Inpractice,softwareengineersemployvariousmethodstoidentifyandrevisehigh-risk orlow-qualityprogrammodules[6,9].Effectivenessofdetectingsuchmodulesisaffectedby thesoftwaremeasurementsused.Inthesepredictivemodels,sourcecodemetricscouldbe usedasinputvariables[32,34].Oneofthemostcommonlyusedmethodsinthisareais binarylogisticregression[2-3,12-13, 19, 23, 37],whichisproventobeapowerfultechnique insoftwaredevelopmentandmaintenance.

However,mostofthepreviousstudiesusingbinaryregressionanalysisfocusedon identifyingcandidatecomplexitymetricsandbuildingrelationmodelsthatarecapableof identifyingfault-pronesoftwaremodules. Feweffortsarespentoninvestigatingthe applicabilityandusabilityofpredictionmodels.More specifically,theaccuracyofthe reportedmodelsisusuallyevaluatedusinggoodnessoffitofthemodelpredictionstothe samefaultdatageneratingthemodel.Fewstudiesareperformedtobuildcross-project predictionmodelsthatareevaluatedondifferentprojectswithasimilardevelopment environment.

(2)

Theideaofbuildingacross-projectpredictionmodelmightseempromising.Ontheone hand,asthefundamentalstructures,management,andmeasurementoftwoprojectscouldbe different,itmightnot befeasibletobuildthesemodels.Ontheotherhand,softwareprojects alwayssharesomecommonproperties:theyarebuildingsimilarproducts,targetingsimilar problems,usingsimilarmethods,orfollowingsimilardevelopmentprocesses.Itisreasonable tobelievethatsuchcross-projectmodels couldbebuiltundercertaincircumstances, especiallyifthetwoprojectsarewithinthesameapplicationdomain.Theobjectiveofthis studyistobuildcross-projectpredictionmodels andevaluatetheirpredictability.The findingsofthisstudywillbehelpfultounderstandifauniversalcrossprojectmodecouldbe built.

Inthispaper,wepresentacasestudyofusingcomplexitymetricstobuildbinary regressionmodelstopredictfault-pronesoftwaremodules.Comparingwithpreviousstudies, thispapermakesthefollowingcontributions:(1)webuildfault-pronenesspredictionmodels basedonavailableprojectsandassessthemondifferentprojects;and(2)weinvestigatehow toefficientlyadjustpredictionmodelstodetectmorefault-pronesoftwaremodules.The studyisperformedusingpubliclyavailablesoftwarefaultdata,whichmakesthestudyeasyto bereproducedandimproved.

The rest ofthis paperis organized asfollows. Section 2 reviews related work of using software complexity metrics to build binary logistic regression models in order to predict fault-prone software modules. Section 3 describes the data source and the mathematical method. Section 4 presents the results and the analysis of this study. Conclusions are presentedin Section 5.

2. Re

lated

Work

Binarylogisticregressionanalysishasbeenintensivelystudiedinsoftwareengineering field,especiallyinpredictingfault-pronemodules.Forexample,usingMozilla project, Gyimothyetal.[8] studiedhowtobuildapredictionmodelforerror-pronesoftwareclassesin open-sourceprojects.Vokac[35]foundsomedesignpatternscanbeusedtoidentify fault-proneclasses.SubramanyamandKrishnan[25]performedastudythatsupportsusing object-orienteddesigncomplexitymetrics (ChidamberandKemerer’sCKmetrics) in determiningsoftwaredefects.

MostrecentstudiesinthisareaarereportedbyOlague’sgroupandZhou’sgroup.For example,Olague etal.[21-22] foundthatobject-orientedcomplexitymetrics,suchas ChidamberandKemerer’sCKmetrics,Michura’sstandarddeviationmethodcomplexity metrics,BansiyaandDavis’qualitymetrics,andEtzkorn’saveragemethodcomplexityare goodcandidatepredictorsoffault-proneclasses. ZhouandLeung[38] investigatedthe usefulnessofobject-orienteddesignmetricsinpredictingfault-proneness.Theirstudyfound thatthepredictioncapabilitiesoftheinvestigatedmetricsgreatlydependontheseverityof faults.Inanotherstudy,Zhouetal.[39] describedhowtocorrectlyinterprettheperformance ofapredictionmodelbasedonoddsratio.Inparticular,theyfoundthatoddsratioassociated withonestandarddeviationincreaseshouldbeusedtorepresentmodeleffectiveness.

Thestudiesclosetoourworkarethoseintendedtoimprovetheapplicabilityofthe predictionmodels.Forexample,basedoncost-effectivenessanalysis,ArisholmandBriand [1] proposedasimpleandpragmaticmethodforassessingtheeffortofusingtheprediction models.ShatnawiandLi[26] appliedpredictionmodelstotheEclipseprojectandfoundthat althoughsomecomplexitymetrics canpredictfault-proneclasses,theaccuracyofthe predictiondecreasedfromreleasetorelease,whichpreventedthemfrombuildingamodelto

(3)

identifyevolvingerror-proneclasses.Briandetal.[2] studiedtheapplicabilityofusing predictionmodelsgeneratedfromoneJavaprojecttopredictfault-proneclassesinanother Javaproject.Theirstudyshowedthatamodelbuiltononesystemcanbe accuratelyusedto evaluateclassesinanothersystemaccordingtotheirfault-proneness.Menzies etal.[17] studiedtheperformanceceilingofpredictionmodels,i.e.,someinherentupperboundonthe amountofinformationofferedbycomplexitymetricsinidentifyingfault-pronemodules. Theyproposedusingcase-basedreasoningwhenapplyingpredictionmodels.

Asdescribedbefore,noneofthereportedresearchhasfocusedonevaluatingpredicting modelsacrossprojects.However,inreality,asoftwarepractitionerisinterestedinknowing, amongagivensetofsoftwaremetrics,whichonestouseandhowtogetoptimaldefect predictionresults.Thisistheobjectiveofthepresentstudy.

3. Data

Descr

ipt

ion

and

Mathemat

ica

l

Method

ThedatausedinthisstudyisretrievedfromonlinepublicrepositoryPROMISE[4].The originaldataisdonatedbySoftwareResearchLaboratoryofBogaziciUniversity,Istanbul, Turkey[27].Fivedatasetsareutilized(AR1,AR3,AR4,AR5,andAR6).Theseare embeddedsoftwareproductsimplementedinC.Eachdatasetcontainsthemeasurementsof 29staticcodeattributes(complexitymetrics)and1defectinformation (false/true)oftensto hundredsofmodules.ModuleattributeswerecollectedusingPrestMetricsExtractionand AnalysisTool[24,33].

Someofthe29metrics,suchasnumberoflinesofblankcodeandnumberoflinesof commentcode,havenoapparentrelationwithmodulefaults.Theyarenotincludedinthis study.Accordingly,24metricsareinitiallychosentobeusedinthisstudy.Theyaredescribed inTable1.Thedetaileddescriptionsofthesemetricscanbefoundin[18].

Table1.Softwaremoduleattributes(complexitymetrics)usedinthisstudy[18]. LineofCode

(LOC)Metrics HaMelstrteadics McCabeMetrics MisceMellaneoustrics TotalLOC Executable LOC Length Volume Level Difficulty

Programmingeffort Errorestimate Programmingtime

Cyclomatic complexity Cyclomaticdensity Decisiondensity Designcomplexity Designdensity Normalized

cyclomaticcomplexity

Uniqueoperands Uniqueoperators Totaloperands Totaloperators Branchcount Callpairs Conditioncount

Multipleconditioncount Formalparameters

Logisticregressionanalysisisamethodtopredictingacategoricalvariablefromasetof predictorvariables.Inthisstudy,itisusedtoanalyzetherelationsbetweensoftwaremodule attributes(complexitymetrics) andmodule defectinformation.More specifically,the predicteddependentvariablehasabinaryvalue,trueorfalse,wheretrueindicatesfaultsand falseindicatesnofaults.Theindependentvariablesaremoduleattributes(complexitymetrics) whichhavecontinuousnumericalvalues,suchasthoselistedinTable1.

Assume isthedependentvariableandvalue1representsthecorrespondingmodule

hasfaultandvalue0representingthecorrespondingmodulehasnofault.Alsoassume areindependentvariablesand

Y

1, , ,2 n

(4)

probabilitythatY 1when X x X x X x1 1, 2 2 , , n n.Accordingly,binarylogistic

regressionanalysiscangeneratethefollowingmodel[39].

1 1 1 1 1 2 ˆ Pr( 1 , , , ) 1 n n n n a bx bx n a bx bx e Y Y x x x e ˆ 1 Y . (1)

TherightsideexpressionofEquation(1)willgeneratearealnumber.Ifitsvalueis greaterthanorequalto0.5, ;Ifitsvalueislessthan0.5,Yˆ 0

ˆ Y ˆ Y 1 2 , , , , ab b b_n

.Here,0.5iscalledthe cut-offvalue.Usingdifferentcut-offvaluescangiveusdifferent predictionresults.Itcanbe seenthatlowingthecut-offvaluecanresultinhighvalueof ,increasingthecut-offvaluecan resultinlowvalueof .Ithasbeenacommonpracticetoadjustcut-offvaluesinorderto achievedifferentpredictionresults[14].

InEquation(1), and arecoefficientsoftheindependentvariables.In

statistics,theoddsareexpressedastheratiooftheprobabilityaneventthatwillhappenover theprobabilitytheeventwon’thappen.Therefore,wecanderivethefollowingoddsmodel.

1 1 1 2 ˆ ( 1 , , , ) _ˆ 1 n n a bx bx n Y ODDS Y x x x e Y . (2) Ourregressionmodelwillbepredictingthelogit,thatis,thenaturallogoftheoddsa modulehavingfaultsornot.Thatis,

1 1

ln(ODDS a) b x b x_{n n}

1 1

ln(ODDS a) b x

. (3)

4. Ana

lys

is

and

Resu

lts

Twenty-fourmoduleattributes(complexitymetrics)areusedinthisstudy.Notallof themarenecessarytoyieldacceptablepredictability.Therefore,ouranalysisisdividedinto twosteps.Thefirststepistodeterminethemoduleattributes(complexitymetrics)thatcan yieldacceptablepredictability.Thesecondstepisusingtheselectedmodule attributes (complexitymetrics)tobuildpredictionmodels.Thesetwostepsarepresentedinthe followingtwosubsections.

4.1. SelectingIndependentVariables

Foreachdataset(RA1,RA3,RA4,RA5,andRA6),weperformedscoreteststo determinewhethereachofthe24attributeswouldbesignificantpredictorsinthebinary regressionmodel.Inthesetests,asingleattributemodel(whichincludesinterceptofcourse), ,iscomparedwithanullmodel,ln(ODDS a,) whichhasnopredictors butjusttheintercept.Theresultofthescoretestscantellifaddingapredictor(attribute, independentvariable)toanullmodelcansignificantlychangethefitnessoftheprediction model.TheresultsofthescoretestsaresummarizedinTable2,whereitshowsthe significanceofthepredicatorinasingleattributemodelineachdataset.

InTable2,thesignificancevaluesthatareatthe0.01levelarehighlighted.Inthese24 attributes, 4 attributes (branch_count,condition_count,multiple_condition_count,and cyclomatic_complexity)aresignificantpredicatorsinallfivedatasets.Theyarefromtwo categoriesinTable1(McCabeMetricsandMiscellaneousMetrics)andaccordinglychosen astheindependentvariables.Tomakethedependentvariablesrepresentdifferentproperties ofasoftwareproduct,twootherattributesfromcategoriesLineofCodeMetricsandHalstead Metrics aredeterminedtobechosenastheindependentvariables.Physicallineofcode (executable_loc)isoneofthemostcommonlyusedmetricstomeasuresoftwarecomplexity, andestimatednumberofdeliveredbugs(halstead_error)reflectsthecombinedeffectsofother Halsteadmetrics.Accordingly,theyarealsochosenastheadditionalindependentvariables.

(5)

Therefore,inthisstudy,6moduleattributes(complexitymetrics)areusedastheindependent variables,X X 1, , 2 , X .6 TheyareboldedinTable2.Function(3)canaccordinglybe

modifiedasthefollowing,wherex x1, , 2 ,andx6arethemeasurementsoftheattributes

ofsixmetrics: X X 1, , 2 , X6 (branch_count,condition_count,multiple_condition_count,

cyclomatic_complexity,executable_loc,andhalstead_error).

1 1 2 2 6 6

ln(ODDS a bx bx ) bx (4)

Equation(4)isthespecificbinarylogisticmodelwearegoingtobuildinthisstudy.It showstherelationbetweentheoddsamodulehavingfaultsandthesixmeasurementsofa module’sproperties.Inthefollowingofthissection,wewilluseexistingdatatobuildthe modelsrepresentedinEquation(4),i.e.,todeterminethemodelcoefficients( , , , ab b _{1 2} , b₆) andevaluatemodelpredictability.

Table2.Summaryofscoretests.

Metrics AR1 AR3 AR4 AR5 AR6

total_loc .034 .000 .000 .000 .059 executable_loc .021 .000 .000 .000 .029 unique_operands .103 .000 .000 .000 .004 unique_operators .000 .001 .000 .000 .013 total_operands .010 .000 .000 .000 .211 total_operators .010 .000 .000 .000 .016 halstead_length .010 .000 .000 .000 .110 halstead_volume .011 .000 .000 .000 .081 halstead_level .028 .066 .002 .022 .001 halstead_difficulty .000 .000 .000 .007 .809 halstead_effort .001 .000 .000 .003 .304 halstead_error .013 .000 .000 .000 .001 halstead_time .001 .000 .000 .003 .305 branch_count .003 .000 .000 .000 .000 call_pairs .191 .459 .001 .119 .018 condition_count .003 .000 .000 .000 .001 multiple_condition_count .002 .000 .001 .000 .000 cyclomatic_complexity .003 .000 .000 .000 .001 cyclomatic_density .388 .854 .714 .185 .000 decision_density .038 .405 .026 .334 .392 design_complexity .191 .459 .001 .119 .207 design_density .959 .475 .638 .161 .219 normalized_cyclomatic_complexity .820 .879 .329 .591 .900 formal_parameters .988 .220 .515 .181 .126 4.2. BuildingandAssessingPredictionModels

Therearetwowaystobuildandevaluatefaultpredictionmodels:self-assessmentand forwardassessment.Therearefivedatasets(AR1,AR3,AR4,AR5,andAR6)inthisstudy. Figure1(a)illustratestheself-assessmentmodel,whereonepredictionmodelisbuiltoneach datasetandisevaluatedonthesamedataset.Forexample,thepredictionmodelbasedon datasetAR1isevaluatedondatasetAR1;thepredictionmodelbasedondatasetAR3is evaluatedondatasetAR3,andsoon.Figure1(b)illustratestheforwardassessmentmodel, wherepredictionmodelsarebuiltbasedononeormoredatasetsandevaluatedonadifferent dataset.Forexample,thepredictionmodelbasedondatasetAR1isevaluatedondataset AR3;thepredictionmodelbasedondatasetsAR1andAR3isevaluatedondatasetAR4,and soforth.

(6)

Figure1.(a)Self-assessmentmodel and(b)forwardassessmentmodel. 4.2.1.Self-Assessment

Inself-assessment,eachdataset(AR1,AR3,AR4,AR5,andAR6)isusedasthesource datatobuildapredictionmodel.Accordingly,fivelogisticmodelsrepresentedinEquation(4) arebuilt.Table3summarizesthemodelparameters,whereparametersab b, , , 1 2 ,and

6

bareconstant,branch_count,condition_count,multiple_condition_count,cyclomatic_complexity, executable_loc,andhalstead_error,respectively.

Table3.Theparametersoflogisticself-assessmentmodels. Modelparameters

Model numbersourceData_a _b₁ _b₂ _b₃ _b₄ _b₅ _b₆ S1 AR1 -4.027 -0.726 0.380 1.162 1.458 -0.182 21.111 S2 AR3 -3.245 0.367 -0.134 -0.643 -0.580 -0.028 6.530 S3 AR4 -3.913 0.557 -1.306 0.091 0.015 0.062 3.892 S4 AR5 -10.602 -2.890 0.255 5.574 5.631 0.027 5.788 S5 AR6 -2.383 -0.268 1.108 -0.902 -0.013 -0.050 0.016

Thesefivemodesarethenevaluatedusingthesamedatasetgeneratingthemodel.Using amodeltoevaluateeachmodule,wecanobtaintwomeasurementsofdependentvariableY: predictedandobserved.Inbinarylogisticregression,dependentvariableY hastwovalues (inbothobservationandprediction):positive(1),whichindicatesdefect,andnegative(0), whichindicatesdefect-free. Thecross-analysisofpredictionsagainstobservationscandivide dataintofourcategories,asshowninTable4[10].

Table4.Evaluationofpredictionagainstobservation[10]. Predicted

Positive (1) Negative(0) Positive(1) ₍_true_postp _i_t_ive₎₍_fa_lse_negafn _t_ive₎ Observed

(7)

Predicationaccuracy,predictionprecision,andrecallratearecommonlyusedmetricsto evaluatethebinarypredictionmodels.TheyaredefinedinEquations(5),(6),and(7)[10]. Predictionaccuracydescribesthegeneralpredictionpowerofamodel;predictionprecisionis usedtoevaluatethecorrectnessofpositivesignalpredictions;andrecallrateisusedtoevaluate thepredictionpowerforpositivesignals.Insoftwareengineeringfield,detectingfaultis consideredmostimportant,therefore,recallrate hasbeenconsideredtobethemostimportant metric.

tp tn Prediction accuracy

tp tn fn fp. (5) tp

Prediction accuracy

tp fp.

(6

)

tp

Recall rate

tp fn. (7) Table5summarizestheresultsofself-assessmentofthepredictionmodels.Itcanbeseen, exceptAR1,theotherfourmodelsyieldnear50%recallrate.However,aswediscussedin Section1,thesemodelsarebuiltandevaluatedusingthesamedataset.Theymightnot be feasibleforrealapplications.Therefore,thereisaneedtoperformforwardassessmentofthe predictionmodels.

Table5.Self-assessmentofpredictionmodels.

Modelnumber S1 S2 S3 S4 S5 Dataset(sourceandevaluating) AR1 AR3 AR4 AR5 AR6 Numberofdatasets 121 63 107 36 101 Predictionaccuracy 93% 95% 88% 94% 90% Predictionprecision 50% 86% 77% 100% 77% Recallrate 11% 75% 50% 75% 47% 4.2.2.ForwardAssessment

FollowingtheschemedescribedinFigure1(b),webuildfourforward-assessment models.Accordingly,fourlogisticmodelsrepresentedinEquation (4)arebuilt.Table6 summarizesthemodelparameters,whereparameters 1 2 and areconstant,

branch_count,condition_count,multiple_condition_count,cyclomatic_complexity,executable_loc, andhalstead_error,respectively.ItshouldbenotedthatModel F1issameasmodelS1, becausetheyaregeneratedfromthesamedataset(AR1).Differentfromself-assessment models(S1toS5),forward-assessmentmodels(F1throughF4)aregeneratedusingaggregate datasets.

, , , ,

ab b b6

Table6.Theparametersoflogisticforward-assessmentmodels. Modelparameters Model

number sourceData a b1 b2 b3 b4 b5 b6

F1 AR1 -4.027 -0.726 0.380 1.162 1.458 -0.18221.111 F2 AR1,AR3 -2.857 0.159 -0.001 -0.376 -0.226 -0.013 1.846 F3 AR1,AR3,AR4 -3.497 -0.154 -0.191 0.449 0.487 0.025 1.418 F4 AR1,AR3,AR4,AR5 -3.687 -0.268 -0.107 0.601 0.642 0.022 1.958

(8)

Thesefourforward-assessmentmodelsarethenevaluatedonDatasetsAR3,AR4,and AR5,andAR6,respectively.Table7summarizestheevaluationoffourpredictionmodels.In Table7,eachlatestmodelissupposedtooutperformthepreviousmodels,becauseofthe increaseofthesizeofsourcedataset.FromTable7,wecanseethatthepredictionaccuracy increasesfromModelF1toModelF4.Howeverthe predictionprecisionandthe recall rateare notinincreasingtrend.

Table7.Forwardassessmentofpredictionmodels(cut-offvalueis0.5). Modelnumber F1(S1) F2 F3 F4

Sourcedataset AR1 AR1,AR3 AR1,AR3,AR4 AR1,AR3,AR4,AR5 Evaluatingdataset AR3 AR4 AR5 AR6

Predictionaccuracy 76% 82% 83% 87% Predictionprecision 32% 100% 75% 100% Recallrate 75% 5% 38% 13%

Itshouldbenotedherethatasmoredataareincludedinthesourcedataset,weshould expectthemodelsyieldbetterpredictions.However,inTable7,onlythepredictionaccuracyis improvedthroughaddingmoredata;thepredictionprecisionandthe recallrateshownoclear trend.Thereasonforthisbehavioroftheseforward-assessmentmodelscouldbe duetothe differencesamongtheseprojects.Thesedifferencescouldbedesigndifference,measurement difference,managementdifference,orsomethingelse.Ontheotherhand,theseprojectsmust sharesomecommonproperties,asthepredictionaccuracyimproveswithmoredatasetsbeing added.

Furthermore,ifwecompareTable6andTable7,wecanseethatifmodelsare self-evaluatedonthesourcedatasetgeneratingthemodel,themodelperformance(prediction accuracy,predictionprecision,andrecallrate)isbetterthanforward-evaluation.Howeverin practice,self-assessmentislessusefulthanforwardassessment.Therefore,ourgoalisto improvetheperformanceofforwardpredictionmodels.Specifically,wewouldliketo improvetherecall rate,whichcanhelpusidentifymoreerror-pronemodules.

As described before,in Equation(1), has a value between 0 and 1 and represents the probability afault can occur. Whenthis modelis used, acut-off value should be assigned. I_{f is less than the cut-off value, it will be considered as 0; if is greater than the cut-off} value, it will be considered as 1. Conventionally, the cut-off value is set to be 0.5, which means,if islessthan 0.5, we will predict nofaultinthe software module andif is greaterthan 0.5, we will predictfaultinthe software module.

ˆ Y ˆ Y Yˆ ˆ Y Yˆ

Ifwewouldliketoidentifymoretruepositivemodules,weneedtousealowercut-off value.Followingthisscheme,cut-offvaluesof0.2and0.1areutilizedtoevaluatemodelsF1 throughF4.TheresultsaresummarizedinTable8andTable9.

Table8.Forwardassessmentofpredictionmodels(cut-offvalueis0.2). Modelnumber F1 F2 F3 F4

Accuracy 73% 83% 81% 87%

Precision 30% 75% 54% 100%

(9)

Table9.Forwardassessmentofpredictionmodels(cut-offvalueis0.1). Modelnumber F1 F2 F3 F4

Predictionaccuracy 65% 84% 72% 78% Predictionprecision 25% 62% 43% 23% Recallrate 88% 40% 88% 20%

FromTable8andTable9,wecanseethatthroughloweringthecut-offvalue,wecan improvetherecallrate.Forexample,inModelF2,therecallrateisimprovedfrom5%to15% and40%,respectivelywhenthecut-offvalueisloweredfrom0.5to0.2and0.1.Itshouldbe notedthatalthoughloweringthecut-offvaluecanimprovetherecallrate,itwillreducethe predictionaccuracyandthepredictionprecision,whichwillincreasethetestingeffort. Howeverinsomefields,suchassafety-criticalapplicationsandmissioncriticalapplications, detectingerrorsismoreimportantthanreducingthetestingcost,itisthereforereasonableto uselowercut-offvaluesinordertodetectallerror-pronemodules.

Basedonpreviousanalysis,wecanseethatbalancingof predictionprecisionandrecall rateisthebalancingofeffortandrisk.Ahigherpredictionprecisionmeanslowertesting effortinsoftwaredevelopmentandahigherrecallratemeanslowerriskoffaultsindelivered products.Therefore,theselectionofcut-offvaluesisnotjustatechnicalissue,buta management issue.Itdependsonthebudget,theavailableresource,andthequality requirementoftheproduct.

Anotherwaytosystematicallyevaluatetheeffectofdifferentcut-offvaluesistodraw receiveroperatingcharacteristiccurves[11].Accordingly,themetricfalloutisdefined.

fp Fall out

fp tn.

(8

)

InEquation(8),fpandtnaredefinedinTable4.Falloutisalsocalledfalsepositive rate(FPR).Correspondingly,recallrateisalsocalledtruepositiverate(TPR).Areceiver operatingcharacteristic(ROC)curve,orsimplyaROCcurve,isagraphicalplotoftrue positiverate(TPR),vs.falsepositiverate(FPR),forabinaryclassifiersystemasitscut-offvalueis varied.

Figure 2 shows the receiver operating characteristic (ROC) curves of Models F1 through F4. It can be seenthatthese curves have a normal behavior:thetrue positiverate has a positive relation with the false positive rate. In other words, increasing of true positive rate could resultintheincreasing of false positiverate. An optimumcut-off value could be obtained by examiningthe precision accuracy, which balancesthetrue positive rate andthefalse positive rate. Table 10 showsthe optimumcut-off values of Models F1through F4.

InTable10,thecut-offvaluesarethosethatcangeneratethelargestprediction accuracies.Inallfourmodels,throughusingdifferentoptimumcut-offvalues,wecan achieveabout88%predictionaccuracies.Also,itisinterestingtoseethattheoptimumcut-off valuesvarydramaticallyfromonemodel toanothermodel. Therefore,inreal-world applications,theoptimumcut-offvaluesshouldbecalibratedforeachdifferentproject.

(10)

Table10.Optimumcut-offvaluesofforwardassessmentmodels.

Modelnumber F1 F2 F3 F4

Optimumcut-offvalue 0.9999 0.1000 0.2500 0.2000 Falsepositiverate 0.0364 0.0574 0.1071 0.0 Truepositiverate 0.3750 0.4000 0.8750 0.1333 Largestpredictionaccuracy 0.8889 0.8411 0.8889 0.8713

(a) (b)

(c) (d)

Figure2.ROCcurvesofforward-assessmentModel(a)F1;(b)F2;(c)F3;and(d)F4. 4.3. Discussions

AsdescribedintheIntroduction,cross-projectpredictionmodelscanimprovesoftware productqualityandsoftwareprocessquality.Ourstudyshowsthatitispossibletoapplya modelgeneratedusingthedatafromoneprojecttopredictfaultsinanotherproject.Thisis reflectedintheincreasingofpredictionaccuracy ofourmodelsifmoredatasetsareemployed. However,ourstudyalsofoundpredictionprecisionandrecallratearenotimprovedunder largerdataset.Although theseobservations might bespecifictothisstudy,the

(11)

non-monotonicityofthepredictionaccuracy,predictionprecision,andrecallrateinthese experimentsseemtoindicateafundamentallackofuniversalityofthecross-project predictionmodels.Tofurtherproveourobservations,morestudiesareneededinthefuture.

5. Conc

lus

ions

Inthispaper,wepresentedourexperienceofusingcomplexitymetricstopredict error-pronesoftwaremodules.Wefound(1)forwardassessmentofpredictionmodelsusually haslowperformancecomparedtoself-assessment,althoughitisamorerealisticmeasureof thepredictionpowerofabinarylogisticregressionmodel;and(2)loweringcut-offvalues couldbeapracticaltechniqueinrevealingmoretruepositivemodules,althoughitmight generateadditionalfalsepositivesignals,whichwillresultinlowerpredictionaccuracy, lowerpredictionprecision,andhighertestingeffort.

Basedonourstudy,wesuggestthatinordertoimprovethemodelpredictionefficiencies, theselectionofsourcedatasetandthedeterminationofcut-offvaluesshouldbebasedon specificpropertiesofaproject.Forexample,ifreducingcost oftesting isthemajorobjective ofaproject,aregularcut-offvalue(0.5)orahighercut-offvaluecouldbeusedtoimprovethe predictionaccuracy;ifuncoveringthefaultymodulesisthemajorobjectiveoftheproject,a lowercut-offvalueshouldbeusedtoimprovetherecallrate.Currently,thereexistsnosucha generalrulethatwecanfollowtoguaranteedetectingmostofthefault-pronemoduleswhile reducingthetesteffort.

Acknow

ledgements

ThisstudyispartiallysupportedbytheFacultyResearchGrantofIndianaUniversity SouthBend.

References

1. Arisholm, E. and Briand, L. C. (2006). Predicting fault-prone components in a Java legacy system. Proceedings of the 5th_ACM_{-IEEE In}_te_rna_t_iona_{l Sympo}_s_{ium on Emp}_i_r_ica_l

Software Engineering, 8-17. Rio de Janeiro, Brazil.

2. Briand, L. C., Wust, J. and Lounis, H. (2001). Replicated case studies for investigating quality factors in object-oriented designs. Empirical Software Engineering, 6(1), 11-58.

3. Basili, V. R., Briand, L. C. and Melo, W. L. (1996). A validation of object-oriented design metrics as qualityindicators. IEEE Transactions on Software Engineering, 22(10), 751–761.

4. Boetticher, G., Menzies, T. and Ostrand, T. (2007). PROMISE Repository of empirical software engineering data, http://promisedata.org/ repository, Department of Computer Science, West Virginia University.

5. Briand, L. C., Melo, W. L. and Wust, J. (2002). Assessing the applicability of fault-proneness models across object-oriented software projects. IEEE Transactions on Software Engineering, 28(7), 706-720.

6. D’Ambros, M., Lanza, M. and Robbes, R. (2011). Evaluating defect prediction approaches: a benchmark and an extensive comparison. Empirical Software Engineering, 17(4-5), 531-577.

7. Graves, T. L., Karr, A. F., Marron, J. S. and Siy, H.(2000). Predictingfaultincidence using software change history. IEEE Transactions on Software Engineering, 26(7), 653-661.

(12)

8. Gyimothy, T., Ferenc, R. and Siket, L.(2005). Empirical validation of object-oriented metrics on open source software for fault prediction. IEEE Transactions on Software Engineering, 31(10), 897-910.

9. Gao, K., Khoshgoftaar, T. M. and Seliya, N. (2011). Predicting high-risk program modules by selectingthe right software measurements. Software Quality Journal, 20(1), 3-42.

10. http://en.wikipedia.org/wiki/Precision_and_recall

11. http://en.wikipedia.org/wiki/Receiver_operating_characteristic

12. Janes, A., Scotto, M., Pedrycz, W., Russo, B., Stefanovic, M. and Succi, G. (2006). Identification of defect-prone classes in telecommunication software systems using design metrics. Information Sciences, 176(24), 3711-3734.

13. Kanmani, S., Uthariaraj, V. R., Sankaranarayanan, V. and Thambidurai, P. (2007). Object oriented software fault prediction using neural networks. Information and Software Technology, 49(5), 483-492.

14. Kleinbaum, D. and Klein, M. (2002). Logistic Regression-A Self-Learning Text, Second Edition, Springer-Verlag New York, Inc.

15. Kastro, Y. and Bener, A.(2008). A defect prediction methodfor software versioning. Software Quality Journal, 16(4), 543-562.

16. Livshits, B. and Zimmermann, T.(2005). DynaMine:finding common error patterns by mining software revision histories. ACM SIGSOFT Software Engineering Notes, 30(5), 296-305.

17. Menzies, T., Turhan, B., Bener, A., Gay, G., Cukic, B. and Jiang, Y. (2008). Implications of ceiling effects in defect predictors. Proceedings ofthe 4th_In_te_rna_t_iona_l

Workshop on Predictor Modelsin Software Engineering, 47-54. Leipzig, Germany.

18. Metrics Data program, NASA IV&V Facility, http://mdp.ivv.nasa.gov/repository.html 19. Nagappan, N., Ball, T. and Zeller, A. (2006). Mining metrics to predict component

failures. Proceedings ofthe 28th_In_te_rna_t_iona_{l Con}_{ference on So}_f_twa_{re Eng}_inee_r_{ing, 452}_-461_.

Shanghai, China.

20. Ostrand, T. J., Weyuker, E. J. and Bell, R. M. (2005). Predicting the location and number of faults in large software systems. IEEE Transactions on Software Engineering, 31(4), 340-355.

21. Olague, H. M., Etzkorn, L. H., Messimer, S. L. and Delugach, H. S. (2008). An empirical validation of object-oriented class complexity metrics and their ability to predict error-prone classesin highlyiterative, or agile, software: a case study. Journal of Software Maintenance and Evolution: Research and Practice, 20(3), 171-197.

22. Olague, H. M., Etzkorn, L. H., Gholston, S. and Quattlebaum, S. (2007). Empirical validation ofthree software metrics suitesto predictfault-proneness of object-oriented classes developed using highlyiterative or agile software development processes. IEEE Transactions on Software Engineering, 33(6), 402-419.

23. Pai, G. J. and Dugan, J. B. (2007). Empirical analysis of software fault content and fault proneness using Bayesian methods. IEEE Transactions on Software Engineering, 33(10), 675-686.

24. Prest Metrics Extraction and Analysis Tool, http://softlab.boun.edu.tr/?q=resources&i=tools. 25. Subramanyam, R. and Krishnan, M.S. (2003). Empirical analysis of CK metrics for

object oriented design complexity:implicationsfor software defects. IEEE Transactions on Software Engineering, 29(4), 297-310.

(13)

error-prone classes in post-release software evolution process. Journal of Systems and Software, 81(11), 1868-1882.

27. Software Research Laboratory, Bogazici University, http://softlab.boun.edu.tr.

28. Turhan, B. and Bener, A. (2009). Analysis of Naive Bayes’ assumptions on software fault data: an empirical study.Data and Knowledge Engineering Journal, 68(2), 278-290. 29. Turhan, B., Bener, A. and Kocak, G. (2009). Data mining source codeforlocating

software bugs: a case study in telecommunication industry. Expert Systems with Applications, 36(6), 9986-9990.

30. Tosun, A., Turhan, B. and Bener, A.(2010). Ensemble of software defect predictors: a case study. Proceedings of the 2nd_In_te_rna_t_iona_{l Sympo}_s_{ium on Emp}_i_r_ica_{l So}_f_twa_re

Engineering and Measurement, 318-320. Bolzano/Bozen, Italy.

31. Turhan, B. and Bener, A. (2007). A multivariate analysis of static code attributes for defect prediction. Proceedings of the 7th_In_te_rna_t_iona_{l Con}_fe_{rence on Qua}_l_i_{ty So}_f_twa_re,

231-237. Portland, Oregon, USA.

32. Turhan, B. (2011). On the dataset shift problem in software engineering prediction models. Empirical Software Engineering, 2011.

33. Tosun, A., Bener, A., Turhan, B. and Menzies, T. (2010). Practical considerations in deploying statistical methods for defect prediction: A case study within the Turkish telecommunicationsindustry.Information and Software Technology, 52(11), 1242-1257. 34. Vivanco, R., Kamei, Y., Monden, A., Matsumoto, K. and Jin, D. (2010). Using

Search-Based Metric Selection and Oversamplingto Predict Fault Prone Modules. The 23rd_Canad_ian_Con_ference_E_lec_t_r_ica_l_and_Compu_te_r_Eng_inee_r_ing_,₁_-6_.

35. Vokac, M. (2004). Defect frequency and design patterns: an empirical study of industrial code.IEEE Transactions on Software Engineering, 30(12), 904-917.

36. Williams, C. C. and Hollingsworth, J. K. (2005). Automatic mining of source code repositories to improve bug finding techniques. IEEE Transactions on Software Engineering, 31(6), 466-480.

37. Yu, P., Systa, T. and Muller, H.(2002). Predictingfault-proneness using OO metrics: anindustrial case study. Proceedings ofthe 6th_Eu_ropean_Con_fe_rence_on_So_f_twa_re_Ma_in_tenance

and Reengineering, 99-107. Budapest, Hungary.

38. Zhou, Y and Leung, H. (2006). Empirical analysis of object-oriented design metrics for predicting high and low severity faults. IEEE Transactions on Software Engineering, 32(10), 771-789.

39. Zhou, Y., Xu, B. and Leung, H.(2010). Onthe ability of complexity metricsto predict fault-prone classes in object-oriented systems. Journal of Systems and Software, 83 (4), 660-674.

Author’sBiographies:

LiguoYureceivedthePhDdegreeincomputersciencefromVanderbiltUniversity. Heisan

AssociateProfessorofcomputerscienceatIndianaUniversitySouthBend.Hisresearch concentratesonsoftwaremeasurement,softwaredependency,andopen-sourcesoftware development.

AlokMishra isaProfessorofDepartmentofComputer&SoftwareEngineering,Atilim University, Turkey. His researchareaincludesSoftwareProcess,SoftwareQuality Assurance,AgileMethodologies,andSoftwareEngineeringEducation.

(14)