3.7 Case Studies to Investigate Research Questions: (Q1) Similarity of Traces
3.7.1 Making every function faulty using mutants to identify faulty
Recall from Section 3.4.2 that we randomly selected three mutants per function and collected ten failed traces per mutant; that is, a maximum of 30 failed traces for every function of a program. During our investigations, we observed that for some functions the number of failed traces per function were less than the maximum limit of 30. This is because, sometimes, the randomly selected mutants did not compile, a few test cases
failed on the mutant, or no test cases failed at all on the mutant. In short, there were 30 or less failed mutant traces per function.
Nonetheless, in order to investigate how many failed traces of mutants per function are enough to identify faulty functions in the actual failed traces, we experimented with a maximum of 5, 10, 15, 20, 25 and 30 mutant traces per function. In other words, we trained the decision tree on 5, 10, 15, 20, 25 or 30 failed traces to identify faulty functions in the actual traces. Figure 30 shows the results obtained for 5, 10, 15, 20, 25 and 30 mutant traces per function for the Space program. In Figure 30, the X-axis represents the percentage of the program to be examined in discovering faulty functions. It is measured by the percentage of functions24 reviewed up to the discovery of faulty functions in a program, as shown in Equation 4.
100
%
∗
=
functions
Total
function
faulty
the
upto
reviewed
Functions
review
to
program
of
Equation 4: Estimating program review effort in functions.
Using Equation 4 we compute a score for each failed trace as the percentage of a program (i.e., functions or statements) need to be reviewed to find the faulty function. Horizontal axis (X-axis) represents the percentage of program that needs to be examined and is divided into segments. Each segment is 10 percentage points except for the first 10 segments which are divided into 1 percentage points; i.e., 1-10% segments are divided into 1 percentage points and 90-100% segments are divided into 10 percentage points each. Vertical axis (Y-axis) measures the cumulative percentage of failed traces that achieve a score within a segment25. For example, in part ‘a’ of Figure 30, the point (10, 60) on a series “using 25 traces per function” (i.e., marked by “▬” shows that faulty
24
We used functions as the programmer would review the functions in the function-call trace to discover the faulty functions, not statements.
25
We have taken this approach from the similar graphical convention used for evaluation of the developer’s effort by Jones and Harrold (2005), Wong et al. (2007) and Di Fatta et al. (2006).
functions in approximately 60% of the actual failed traces were discovered by reviewing 10% or less of the code (functions) for the Space program. This identification in the actual traces was done by training the decision tree algorithm on 25 or lesser failed traces
of mutants of every function of the Space program. Note that, straight lines at the end of a series till the 100% traces when there are no more points visibile on a series mean that: F007-plus does not result in any more predictions of faulty functions in traces and a developer identifies faulty functions by random guesses till the 100% traces. For example, in the case of the series “using 25 traces per function” in part ‘a’ of Figure 30, 84% of the failed traces were resolved correctly by reviewing 50% of the program using F007-plus after which a developer randomly gusesses the faulty functions in the remaining 16% of traces.
Recall from Section 3.4.3 that F007-plus generates a ranked list of suspected functions for a potential failed trace. It is possible that F007-plus can list two or more functions at
Figure 30: Faulty function prediction accuracy for the Space program on its failed traces of actual faults using the failed traces of mutants of all functions.
the same rank, then the best case effort entails that the first function to be examined is faulty and the worst case effort entails that the last function to be examined is faulty. For example, suppose there is one function listed at rank 1, and five functions listed at rank 2. The best case effort is that the faulty function is the second function to be examined (i.e., one at rank 1 and one at rank 2), whereas the worst case is that the faulty function is the sixth to be examined.
In Figure 30 part ‘a’ shows the best case effort of the programmer by using F007-plus with 5-30 mutant traces per function of the Space program, and part ‘b’ of Figure 30 shows the worst case effort of the programmer with F007-plus by using the same number of mutant traces. In Figure 30, the same series is represented by the same colour and symbol in both part ‘a’ and ‘b’. For example, “using 25 traces per function” series is represented by the pink colour and the symbol “▬” in both the worst and the best case. It can be observed from Figure 30 (part ‘a’ and ‘b’) that when F007-plus uses fewer mutant traces per functions then the best case effort is higher than larger number of mutant traces per function; whereas, the worst case effort is lower or similar to larger number of mutant traces per function. For example, when five mutant traces per functions were used then F007-plus identified faulty functions in 90% of the traces on the review of 3% or lesser program in the best case (see part ‘a’); whereas in the worst case (see part ‘b). F007-plus identified faulty functions in only 20% of the failed traces using five mutant traces per function. If the difference between the worst case and the best case is too high for a series then it means most of the functions are listed at the same rank, and the use of particular numbers of mutants per functions represented by that series is ineffective. This also means that using fewer mutant traces, the decision tree was not able to get sufficient information to predict faulty functions in actual traces, and most of the suspected faulty functions were predicted with the same probability.
In the case of Figure 30, the use of 25 and 30 traces per functions series have a small gap between their worst and best cases, respectively: implying that there are fewer functions listed at the same rank for 25 and 30 mutant traces per function. In the case of 25 mutant traces per function, both the worst case effort and the best case effort are better than the
30 mutant traces per function. However, the gap between the worst and the best of 30 mutant traces per functions is smaller than 25 mutant traces per function. Thus, we can say that F007-plus with 25-30 mutant traces per function is able to identify faulty functions in 50-60% of the actual failed traces on the review of 20% of the code for the Space program.
The test suites in the Space program were more extensive than the ones would usually be produced in practice; that is, approximately 13,000 test cases for approximately 6000 lines of code. This means almost all of the functions and control flow paths were exercised by the test cases. However, in the UNIX utilities, the sizes of the programs were almost the same as the Space program, but the test cases were not as extensive as the Space program. The test suites of the UNIX utilities mimic the real world scenario closely. On the other hand, recall from Section 3.4.2, if the test suite does not exercise all the paths then this shows the weakness of the test suite in detecting faults, and, eventually, will leave many mutants live or equivalent.
Nonetheless, we show separately in Figure 31, the accuracy of identification of faulty functions on the four UNIX utilities (i.e., Flex, Grep, Gzip and Sed) using 5, 10, 15, 20, 25 and 30 failed mutant traces. We have randomly chosen release 1 (R1) (see Table 11) from several releases of the UNIX utilities for this experiment as manifested in Figure 31. We generated mutants of release 1 (R1) of each of the Flex, Grep, Gzip and Sed programs. We again randomly selected three mutants per function of a program and collected traces by running the test cases of the respective programs on their mutants. In Figure 31, each series actually shows the accuracy of the identification of faulty functions on the release 1 (R1) of all of the four UNIX utilities we studied. The percentage of failed traces for each series is measured by first summing the number of actual failed traces of all the four UNIX programs that achieve a (program-review) score within each segment (on X-axis) and then dividing them by the total number of failed traces of all the four programs. The results are then shown as the cumulative percentage of failed traces on Y-axis. For example, in part ‘a’ of Figure 31 the point (30, 50) on the series “using 30 (mutant) traces per function” shows that only 50% of the failed traces
were resolved correctly in the Flex, Grep, Gzip and Sed program by reviewing 30% or lesser code.
In the same manner to Figure 30 (for the Space program), we also show the best case and the worst case accuracy for the UNIX utilities in Figure 31. It can be again observed from Figure 31 that the use of fewer mutant traces per function results in larger difference between the worst and the best case. Also, the use of 25 mutant traces per function results in almost identical best and worst case effort for the UNIX utilities in Figure 31. Similarly, the use of 30 mutant traces per function also results in identical best and the worst case effort. This means there was rarely more than one function listed at the same rank for 25 and 30 mutant traces per function. Thus, we can state that F007-plus with 25- 30 mutant traces per function is able to identify faulty functions in 50% of the actual failed traces on the review of 30% of the code for the UNIX utilities.
Figure 31: Accuracy of identification of faulty functions in the actual traces using mutant traces on the UNIX utilities.
Overall, the accuracy of identification of the faulty functions on the UNIX utilities in Figure 31 is low compared to the Space program in Figure 30. In the case of the UNIX utilities, the test suites were not as exhaustive as in the case of the Space program. This left many mutants live or equivalent in the UNIX utilities (see Section 3.4.2), or resulted into few failed test cases on the mutants of faulty functions. For example, in the case of the Sed program (release 1), we were able to collect the failed traces for only 87 functions out of 183 total functions. Similarly, we collected mutant traces for 79 functions out of total 89 of the Gzip program (release 1), 68 out of 142 for the Grep program (release 1), and 130 functions out of 151 for the Flex program (release 1). In the case of the Space program, which has a very large collection of the test suite, mutant traces for 116 functions26 were collected out of total 136 functions.
This implies that extensive test suites would result in better accuracy (as in Figure 30) than shorter test suites (as in Figure 31). The reason is that extensive test suites, in our investigation, resulted in more failing traces, covering many flow paths for the faults in the same functions, and providing the decision tree more knowledge to identify the faulty functions. Further, in both Figure 30 and Figure 31, the use of “25 mutant traces per function” and “30 mutant traces per function” series shows better results than lesser number of mutant traces per functions.
The results in Figure 30 and Figure 31 show that by using the failed traces of the mutants of every function, the faulty functions in the actual trace are not entirely distinguishable, particularly for the smaller test suites. This is because 100% or closer accuracy was not obtained on the review of 1% (or little more) of the code. This also implies that different faults in the same function do not occur with the similar sequence of function calls, but overlap with the function-calls of faults in some other functions.
In fact, Figure 30 and Figure 31 show that there are M groups of closely related functions, and functions in each group make calls to each other or call the same functions
26
Some of the functions in the Space program have no statements in the body; so, no valid mutants and no actual faults were possible in them.
regularly. When a fault occurs in one of the functions of a group (e.g., Mi) then the
function-calls overlap. When a fault occurs in a function in another group Mk then there
are few overlapping function-calls with the function-calls of faults in groups other than Mk. The reason is that if the function-calls of all the functions had overlapped then we
would have had to review about 100% (or closer to 100%) of the program to identify the faulty functions in any trace. We could still find faulty functions in 60% of the failed traces of the Space program (see Figure 30) by looking at 20% percent of the program (functions), when using 25 mutant trace per function. Similarly, we could find faulty functions in 50% of the traces of the UNIX utilities by reviewing 30% of the code (see Figure 31), when using 25-30 mutant trace per function
Thus, from these results we can state that: “A group Mi of related functions has similar
function-call traces when a fault occurs in the functions of that group Mi; but the
function-call traces of Mi are different from the function-call traces of another group of
function Mk if a fault occurs in the functions of group Mk. Where i, k = 1-n and i ≠ k and
Mi ⊂ N and Mk ⊂N and N={functions | functions ∈ program}.” This answers the first
research question (Q1) that traces of different faulty functions are similar and traces of some faulty functions are different. Also, we found that faulty functions in 50-60% of the failed traces can be identified by making every function faulty using mutants. This identification requires the review of 20-30% of the code. This answers the second research question (Q2).