• No results found

3.5 The F007-plus Strategy

3.5.5 Executing F007-plus

In Table 12, we show the cost ratio Cf : Cnf that we used for the different releases of the

UNIX utilities. For example, the value 1:20 (Cf : Cnf) for “release 2” of the program Flex

shows that we set Cnf to 20 and Cf to 1 on the training set obtained from release 1 to

estimate faulty functions in release 2. Similarly, (Cf : Cnf) 1:5 in the last column for the

Flex program demonstrates that we set this value on the training set23 obtained from release 1 to 4 to estimate faulty functions in the release 5 of the Flex program.

23

We actually used SQL queries with “UNION” keyword when extracting code metrics of multiple releases. This means in the training set of release 1 to n-1 more than one record for a function would exist only if a function was changed in any one of the relase from 1 to n-1;otherwise, only one record per function would be present in the training set.

Table 12: Misclassification cost ratio “Cf : Cnf ” for the following releases of the UNIX utilities using training-set of previous releases.

Program Release 2 Release 3 Release 4 Release 5 Flex 1:20 1:30 1:5 1:5 Grep 1:45 1:10 1:5 NA Gzip 1:5 1:7 1:3 NA Sed 1:110 1:10 1:85 1:45

Recall from Section 3.5.2, the selection of cost ratios depend on the subjective judgment of the user of a particular problem (Witten and Frank, 2005). We selected the cost ratios when approximately 70% of the “faulty” functions in the “training-set” were correctly classified as described in Section 3.5.2. That is for all the programs and releases, we developed a criterion that: if 70% of the faulty functions are correctly predicted as faulty in the training set with a small proportion of incorrectly predicted not-faulty functions as faulty, then we select those cost ratios for the test set. For example, consider release 3 (R3) of the Flex program in Table 12, where we selected the cost ratio of (Cf : Cnf) 1:30.

The steps of F007-plus on R3 of Flex are described below:

• We selected this cost ratio of 1:30 for R3 of Flex because in the training set (of R1 and R2) 20 functions out of 26 functions were correctly classified as “faulty” (i.e., approximately 77% of the faulty functions were correctly predicted in the “training-set”), and 110 functions out of 186 were correctly classified as “not- faulty” on those cost ratios.

• We then assigned weights according to Cnf =30 to the faulty instances in the

training-set and according to Cf =1 to the non-faulty instances in the training-set.

This resulted into the new cost sensitive training-set.

• We generated the decision tree from this cost-sensitive training-set of R1 and R2 to predict suspected “faulty” functions in the test set of release 3 (R3). This decision tree predicted 8.0 out of 12.0 functions correctly as “faulty”, and 93 out of 152 functions correctly as “not-faulty”.

• We generated mutants of 67 “faulty” functions—i.e., 8.0 correctly predicted faulty functions and 59 (152-93) incorrectly predicted faulty functions—for the release 3 of the “Flex” program.

• Finally, mutant traces were collected on mutants of 67 faulty functions (as described in Section 3.4.2), and the decision tree is generated (as described in Section 3.4.3) from those mutant traces and failed traces of prior releases R1 and R2. This decision tree then predicted faulty functions in the actual traces of release R3. The accuracy of prediction of faulty functions in actual traces is shown in Section 3.8.

Following this approach of F007-plus, we also performed experiments on all other releases and other programs (i.e., Flex, Grep, Gzip and Sed) using the cost ratios shown in Table 12.

An ultimate measure of performance of a cost sensitive learning algorithm is average misclassification cost of testing examples (or traces in test-sets in our case). A specific threshold doesn’t exist but a cost sensitive learning algorithm should have a low average mislcassification cost. The average misclassification cost is measured by using Equation 3. In Figure 29, we show the average misclassification cost for different releases of each of the four programs: Flex, Grep, Gzip, and Sed. In Figure 29, Y-axis shows the average misclassification cost, and X-axis shows program releases such that earlier releases form training-sets (of code metrics) for F007-plus and succeeding releases form test-sets. Each point on the series represents the average misclassification cost corresponding to the cost ratios in Table 12 for each program. For example, first point on the “Flex” series in Figure 29 show that when F007-plus was trained on the code metrics of release 1 and predicted suspected functions in release 2 using the cost ratios 1:20 then the average misclassification cost was approximately 0.6. Note that the cost ratios are different for every release of a program, but the criterion of setting those cost ratios is the same (i.e., an approximate 70% threshold).

࡭࢜ࢋ࢘ࢇࢍࢋ ࢓࢏࢙ࢉ࢒ࢇ࢙࢙࢏ࢌ࢏ࢉࢇ࢚࢏࢕࢔ ࢉ࢕࢙࢚ = ࡲࡼ ∗ ࡯ࢌ+ ࡲࡺ ∗ ࡯࢔ࢌ ࡺ

Equation 3: Measures the average misclassification cost where: FP is total functions predicted as false positive, Cf is the misclassification cost of predicting a function as

faulty, FN is total functions predicted as false negative, Cnf is the cost of misclassifying a function as not-faulty, and N is the total number of traces.

Figure 29: Average misclassification cost for the UNIX utilities.

It can be observed from Figure 29 that mostly the average misclassification cost deceases as the number of releases increase, or in other words as the number of training instances increase the average misclassification cost goes down. However, in some cases the average misclassification also increases slightly. The reason is that we selected different cost ratios for each release of a program. Usually in the cost sensitive learning, cost ratios are kept same. In our case we have delveoped a criterion (of 70% threshold) to select cost ratios and this crierion remains constant. The reason for developing such a criterion is that every program is different, and setting of different cost values by maintainers is not straight forward even if they have the knowledge of cost sensitive learning. Thus, we

0 0.5 1 1.5 2 2.5 3

Flex Grep Gzip Sed

A v e ra g e m is cl a ss if ic a ti o n c o st R1 to R2 R1-R2 to R3 R1-R3 to R4 R1-R4 to R5

selected the same criterion of 70% threshold for true positives in the training set with a small proportion of false positives. In short, in our case the cost ratios may not be the same but the criterion for selecting those cost ratios is the same.

Overall, the average misclassification cost in Figure 29 remains low or decreases (mostly) over number of releases. This means that the number of false negatives and false positives would approach to zero if the average misclassification cost approaches zero. Also, if there are fewer false negatives (FN) then the average misclassification cost will be high because we have high ‘Cnf’ value (see Equation 3). This implies that F007-plus

predicts fewer false negatives and false positives, even if the training set contains instances from the first release. The suspected (to be faulty) functions predicted by F007- plus constituted approximately 10-40% of the total functions. For example F007-plus predicted: (a) 44-56 faulty functions in the five releases of the Flex program; (b) 12-43 faulty functions in the four releases of the Grep program; (c) 29-45 faulty functions in the four releases of the Gzip program; and (d) 4-28 faulty functions in the five releases of the Sed program. After identifying the susepcted functions, we collected mutant traces for those functions and trained the decision trees on those traces. The results showing the accuracy of prediction of faulty functions in actual traces are described in Section 3.8.

3.6 Implementation, Scalability and Runtime Performance