• No results found

Relevance to Thesis

Recall that our thesis statement is as follows:

Thesis Statement. Robust development processes are necessary to minimize the number of

faults introduced when evolving complex software systems. These processes should be based on empirical research findings. Data science techniques allow software engineering researchers to develop research insights that may be difficult or impossible to obtain with other research methodologies. These research insights support the creation of development processes. Thus, data science techniques support the creation of empirically-based development processes.

App development processes depend on the work done by software engineering researchers who study mobile apps. Until now, there have not been any empirical guidelines suggesting the number of apps researchers should use in these studies. Some studies use as few as ten apps (e.g., [73]). This has made it difficult to judge the generalizability of any particular study and thus whether the results are useful to app developers. Our guidelines can help improve app research practices, which in turn will improve app development processes.

Mobile malware detection tools likewise rely on research findings to determine, for instance, which classifier type will perform the best and which input features should be used. If prior work had been based on an incorrect assumption, namely, that API level is irrelevant, their results would be invalid and thus of little use to developers creating malware detection tools. As it happens, API level does not need to be controlled to obtain correct results. This information gives us more confidence in the findings of previous studies and thus in the application of their results to the development of practical tools for detecting malicious apps.

The use of data science was critical to obtain our results. Studying 1.3 million apps and 11 TB of callgraphs would not have been possible without the use of computing clusters, big data storage and handling techniques, statistics, machine learning, and various other tools that belong to the data science toolkit. Thus, this work provides an example of how data science techniques can be applied to develop new research insights that in turn assist software developers.

Chapter 3

Measuring Test Suite Costs and

Benefits

“But what’s the harm in over-testing, Phil, don’t you want your code to be safe? If we catch just one bug from entering production, isn’t it worth it?” [...] This line of argument is how we got the TSA. [36]

Buggy software costs its developers money. Among other things, bugs can discourage new customers from adopting a product and can drive away existing customers. Consequently, many techniques exist for avoiding the introduction of bugs and for quickly identifying and fixing bugs when they are introduced anyway. However, these techniques come with costs of their own, so developers must carefully assess their cost-effectiveness before deciding whether and to what extent to adopt them.

Automated regression testing [33] is one technique for fault detection that has seen wide adoption [23, 41]. However, it is a costly technique: developers must write, maintain, and regularly execute the test suite to see the benefits of regression testing. Kasurinen et al. [56] conducted a survey of industrial developers that, among other findings, identified development expenses and maintenance costs as the main obstacles to adopting automated testing. One of their participants stated:

Developing that kind of test automation system is almost as huge an effort as building the actual project.

Moreover, one company that experimented with automated testing eventually removed the test suite due to the cost of maintaining it.

As this company’s experience indicates, it is important to understand both the costs and benefits of regression testing when deciding whether and how to adopt it. Costs, in this case, include the cost of initially writing the tests, maintaining the test suite, executing the suite, and examining the test output. Benefits include the number of faults found as a result of investing in the test suite. Previous studies have considered the cost-effectiveness of automated regression testing, as we will discuss further in Section 3.1. However, these studies share a common limitation: they were conducted by mining historical data from the test suite repositories. While tracking the evolution of the test suites can provide understanding of how the suites were developed and what their likely maintenance costs might have been, they do not permit measuring possible benefits in terms of detected faults.

To address this limitation, we studied 61 projects that use Travis CI1, a cloud-based continuous integration tool. Travis builds a project and executes its tests every time a developer pushes a change or opens a GitHub pull request, meaning that developers push their commits to the repository before testing them, frequently introducing faults that cause one or more tests to fail. As Travis integrates with project version control systems, when test failures occur, it is possible to precisely determine whether changes to the tests, the system under test, or both were required to make the test suite pass again. Once flaky tests2 are accounted for, if a test failure is resolved by changing the source code, we know

that the test provided a benefit by detecting a fault. On the other hand, if the failure is resolved by changing the test itself, we can conclude that the test was buggy or obsolete; in other words, we can conclude that the change represents a maintenance cost. Examining the Travis results therefore allows us to measure both the costs and benefits of regression testing, in contrast to previous work. Specifically, we considered the following research questions:

Research Question 3.1. What proportion of test-suite executions are flaky?

Research Question 3.2. Once these flaky test-suite executions are accounted for,

what proportion of test-suite failures represent a maintenance cost and what proportion represent a benefit?

Research Question 3.3. Why do tests usually require maintenance, and can mainte-

nance costs be reduced? 1https://travis-ci.org/

By studying a dataset of 106,738 Travis builds, which we describe in Section 3.2, we found that the benefits of regression testing are lower than one might expect relative to the costs. Briefly, we found that 18% of test suite executions fail and that 12.8% of these test suite failures are due to flaky (non-deterministic) tests. Of the non-flaky test suite failures, only 74.1% are caused by a defect in the system under test; the remaining 25.9% are due to tests that are either incorrect or obsolete. The causes of test maintenance vary widely, and some of the causes are avoidable. Section 3.4 provides full answers to the three research questions given above.

Although we feel these findings provide valuable information to developers in themselves, they can also be used to inform test case selection techniques. The goal of test case selection is to reduce the cost of running the regression suite by running only a subset of tests each time. Ideally, one would only select and execute the tests that are going to fail and detect a fault; of course, in practice, one does not know a priori which tests will do so. Our dataset of Travis builds allowed us to address an additional research question related to test selection:

Research Question 3.4. What proportion of test executions expose defects?

In brief, we found that, in failing builds, an individual test execution has only a 0.28% chance of failing and exposing a real defect for any given code change. Section 3.5 provides a full answer to this research question.

Following the presentation of results, Section 3.6 describes threats to validity and Section 3.7 describes our data replication package. Section 3.8 explains how the work in this chapter supports our thesis statement.