• No results found

Making only the selected functions faulty using mutants to identify

3.7 Case Studies to Investigate Research Questions: (Q1) Similarity of Traces

3.7.2 Making only the selected functions faulty using mutants to identify

In order to further validate the above proposition, we trained the decision tree on the mutant traces of only those faulty functions that were also faulty in actual traces. We made this decision because: (a) this would allow us to identify whether different faults in the same function occur with the same traces; and (b) this would facilitate in further validating that traces of some faulty functions are similar. This trained decision tree on the mutant traces of the selected faulty functions was then used to identify faulty functions in the actual failed traces. The results are shown in Figure 32 by the series “using mutant traces of the same faulty functions as in the actual traces”: this series

shows both the best and the worst case marked by ■ and ●. The best and the worst case efforts, however, have overlapped and no significant difference is noticeable; except for the first point, which is approximately (1,30) in the worst case and (1,32) in the best case. These results were obtained by using 30 mutant traces per function.

Figure 32: Faulty function prediction accuracy by using failed traces of the same faulty functions on the Space program.

In order to compare the accuracy of mutant traces of the selected faulty functions, we also trained the decision trees on 1% of the actual failed traces and tested them on the rest of the 99% actual failed traces. This 1% of the actual traces has the same faulty functions as the remaining 99% traces for the Space program27. This is shown in Figure 32 (for the Space program) by the best case (marked by▲) and the worst case (marked by ♦) of the series “using only 1% actual traces for training and the actual 99% traces for testing”.

27

We divided data into 100 equal parts using Weka API (Witten and Frank, 2005), each part contained equal proportion of failed traces for every faulty function. We used one part for training and the rest of the 99 parts for testing. This is called stratification; without stratification faulty functions with lesser traces would be missing from some parts-- resulting in incorrect classification accuracy.

Again the difference between the worst case and the best case is almost not noticeable. It can be observed from Figure 32 that the difference between the accuracy of identifying faulty functions using actual traces and using mutant traces is quite narrow. For example, using the mutant traces of the same faulty functions as the actual traces for the training set, the faulty functions in approximately 77% of the actual traces can be identified by reviewing 10% or less of the code. Similarly, in Figure 32, using 1% of the actual traces as the training set, the faulty functions in approximately 97% of the traces can be identified by reviewing 3% or less of the code. This shows that faulty functions in majority of the actual traces were identified by training the decision trees on different faults (mutants) of the same function.

Figure 33: Faulty function prediction accuracy by using failed traces of the same faulty functions on the UNIX utilities.

Following the approach similar to Figure 32, the results on the UNIX utilities are shown in Figure 33. In the UNIX utilities we have used 10% of the actual traces for training,

instead of 1% because there were fewer failed traces in the UNIX utilities compared to the Space program (see Table 11). In the case of the Space program, there were about 72000 failed traces and the use of 10% traces still resulted into approximately 7000 traces for training. On the other hand, all the UNIX utilities had less than 500 failed traces. Due to fewer failed traces of the UNIX utilities, we selected 10% of their traces for training F007. The reason lies in the fact that the decision tree requires a sufficient number of traces for training; for example, literature (Witten and Frank, 2005) recommends selecting more than 50% of data for training when the data set is not large—the 10% we used we used is still much less than recommended 50%.

It can be observed from Figure 33 that the accuracy of “using mutant traces of the same functions as in the actual traces” for training and the accuracy of “using 10% of the actual traces” for training are quite close. Also, note that the difference between the best and the worst case efforts is hardly noticeable for the mutant traces and actual traces series. For the UNIX utilities, in Figure 33, we used 30 mutant traces per function to train the decision tree on the mutant traces (25 mutant traces per function could have been used too, as we discussed in Section 3.7.1).

In both the cases, Figure 32 and Figure 33, faulty functions in 90-95% of the failed traces can be identified by reviewing 2% (≈ 3 functions) of the program, when using a proportion of actual traces for training. On the other hand, by using mutant traces, faulty functions in approximately 60% of the failed traces can be identified by reviewing 3% (≈

4 functions) of the program in Figure 32 and Figure 33. This implies that: (a) function- call paths triggered by different faults in the same function are not exactly the same but they are similar—because accuracy of mutants and proportion of actual faults (in Figure 32 and Figure 33) are not the same; (b) there are actually groups of related functions that occur with overlapping function-calls when a fault occurs in the functions of same group--because we need to review few functions before identifying the faulty function; and (c) different groups of functions have few overlapping function-calls, otherwise we would have had a very low accuracy of identification of faulty functions using mutants— we can identify faulty functions in approximately 70-80% of the failed traces using mutants by reviewing 10% or less of the code (in Figure 32 and Figure 33). These

observations are the same as what we observed in Figure 30 and Figure 31 in Section 3.7.1