In this section, we answer our final research question: what do our data suggest about the limits on test case selection?
3.5.1
RQ4: How Often Do Tests Expose Faults?
The log parsing technique described in the previous section allowed us to count the number of tests that pass and fail on each build to establish the percentage of tests that failed in failing builds. Figure 3.5 shows the proportion of test failures from 40 projects across 586 Pass → Fail → Pass test suite executions from which we could parse the results. From this set, the average test failure rate was 0.38%. It is important to note that this is not the global test case failure rate, but just the test case failure rate within the builds that had at least one test failure.
This number is helpful for understanding the potential effectiveness of test selection approaches as it establishes the absolute minimum fraction of tests a selection approach could execute. That is, if a test selection approach only executed tests from builds that would fail, and then only executed the failing tests, the approach would have to execute an average of 0.38% of the test suite. Assuming that these failures are spread out equally between the tuple types in Table 3.4 would mean that 25.9% of these test failures did not
Figure 3.5: Proportion of test cases that fail for the 40 projects we were able to parse individual-test results from across the failing build of 586 tuples. Three data points (16.9%, 8.7% and 4.1%) have been elided for clarity. The average project failure rate was 0.38%.
find faults, bringing the percentage of test case executions that do find faults down to 0.28%.
Interestingly, we found that 64% of failed builds contain more than one failed test. These failures usually have the same root cause and therefore all the failed tests were not necessary to find the fault. Test selection strategies typically use coverage overlap to determine whether a test is redundant or not, but this alone can be inaccurate [60]. By only considering a test case redundant if all past failures occurred with other tests the false-positive rate could be reduced. Conversely, if a test has failed alone it would indicate that it is not redundant even if it does have complete coverage overlap with another test.
It is also worth considering that multiple failing tests could provide additional insight into the underlying cause of a failure. For example, suppose a number of tests related to sending an email fail. These tests might differ in whether they include an attachment, use special formatting, and so on. When they all fail together, the developer may realize that the mail server is down and thus the root cause of the problem is different from that suggested by any one test.
Answer 3.4. In the failed builds under study, only 0.28% of test case executions failed
3.6
Threats to Validity
The dataset presented in Section 3.2 has a number of limitations, most notably to external validity. While it presents a diverse set of 61 projects, they were mostly Java-based projects that used the Travis continuous integration platform. In the process of trying to find representative projects, we also filtered out projects that we believed to be unusual, but whose test-inducing behaviour may have been interesting. That said, the set of projects was diverse and contained test suites that were being actively maintained and executed. The projects may have used different development processes, which would increase project diversity but make comparisons between projects less meaningful.
One may also wonder whether developers run some tests locally before committing to the repository, skewing our results. This is a common threat in studies that mine software repositories: one may be exploring development as it was recorded, rather than as it happened. We do not believe developers ran tests locally for three reasons. First, we eliminated the projects with the fewest build failures; i.e., those that mirror stable repositories. Second, of the remaining projects, the large number of error and failure builds suggest that developers frequently break the build; that is, they do not worry about ensuring the code can be compiled or tested before committing. Third, it is unlikely that developers would set up CI infrastructure and then run tests manually. We feel that, just as developers who set up Ant or Maven are unlikely to run javac on the command line, developers who set up Travis are unlikely to replicate its functionality locally.
Our goal in Section 3.4 was to measure how often a test suite failure indicated that test suite maintenance was necessary. Though it seems reasonable to assume that, if only the test code was changed to fix a test failure, the developers were maintaining the test, it is possible that we misclassified some of the test failures. We note that we did not see any such misclassifications during our manual inspection of tests. However, examining each change manually would have been infeasible, and feel our approach is an adequate approximation.
Additionally, in the code fixes category, some of the Code+Test changes could have involved test maintenance that was done before the tests failed because the developer knew that their code changes would cause test failures. Thus, in terms of internal validity, these analyses may be overly conservative as we have excluded some test changes that could have been classified as maintenance.
Our results for the flaky analysis were also limited: while we found that 12.8% of the tuples we examined were flaky, we examined only builds that failed when the developer originally executed them. It is possible for flaky tests to pass by chance, and thus some of the passing builds we did not re-execute could have been flaky as well.
Finally, though we compared the number of times tests were maintained and the number of times they detected faults, we did not attempt to do any cost estimation. A test that frequently requires maintenance but detects one critical fault may be more valuable than one that rarely requires maintenance but only detects minor faults. However, on average, more maintenance means a more expensive test suite while more faults detected means the suite is providing more benefit.