• No results found

3.3 An Experimental Software Engineering process . . . . 72 3.4 The experimental process case study . . . . 98 3.5 Related work . . . 112 3.6 Conclusions . . . 116

Background: Experimental Software Engineering (ESE) is concerned with designing and performing experiments to support the validation of Software Engineering claims. The main challenges of ESE include facilitating the replicability of such experiments and the meta-analysis of the obtained results.

Objective: Our goal is to motivate and present a process model for ESE, which we will use throughout the remainder of this dissertation, aimed at solving the above men- tioned challenges.

Method: We use UML diagrams for describing the process model. The dynamic part of the model is specified through activity diagrams, while a taxonomy of relevant con- cepts is modeled with class diagrams.

Results: The process model conforms to current attempts at experimental reporting guidelines. These results are confirmed not only through our own experience in fol- lowing the model, but also with a case study conducted with graduate students.

Limitations: The case study in this chapter assesses the outcome of the work of grad- uate students who followed the process. Further validation with seasoned experi- menters is desirable, to confirm the merits of this process model.

Conclusion: The process model presented in this chapter, and used in the remainder of this dissertation can be successfully used, both by seasoned and novice experimenters.

3.1

The scientific method

In many research areas, the scientific method is used as a cornerstone set of techniques to collect observable, measurable information that can be used as evidence in the pro- cess of understanding a given phenomenon. Based on their perception of the phe- nomenon under scrutiny, researchers propose hypotheses that attempt to explain the phenomenon and create experiments to test those hypotheses. The results of those experiments are used to test the hypotheses, and, often, to feed back the process, thus leading to more refined hypotheses formulation. A possible description of the method1, from the formulation of the research question to the publication of research results is outlined in figure 3.1.

Figure 3.1: The scientific method

The scientific method is designed to reduce as much as possible any potential bias that might be otherwise introduced by the researcher. The whole experimental process is expected to be extensively and unambiguously documented, to facilitate its scrutiny and replication by peers. The level of confidence of the scientific community in the results obtained in this process depends on the level of independent validation the re- sults go through. Knowledge acquired through the scientific method is intrinsically subject to further independent validation, based on experiment replication. Hypothe- ses and theories are always subject to refinements and even disproof, if new validation efforts point to alternative explanations of the observed phenomena.

This generic description of the scientific method can be easily mapped to the state of practice in most mature sciences, such as physics, biology, or chemistry. Consider the example of clinical trials for the introduction of new drugs in the pharmaceutical market [Vogelson 01]: new drugs have to undergo 4 phases of trials, starting from 2

1There are several alternative descriptions of the scientific method available in the literature (e.g.

[Koning 94, Wolfs 96, Wudka 98]). They all involve observing a phenomenon, creating hypotheses for explaining it, conducting experiments for assessing those hypothesis, and using the results of those experiments to either support or refute the hypotheses. Experimental results are also fed back to the hypotheses formulation step, so that more refined hypotheses can be formulated.

single-site phases, and ending with 2 phases where between 30 and 40 sites are in- volved in the trials. The new drugs are often tested against placebos, or other existing drugs, during these trials (particularly on the 3rd phase), and can only be introduced in the market after the first 3 phases. The 4th phase tests the efficacy of the drug for different medical conditions. To remove potential biases, these trials are double- or sometimes triple-blinded. The term blinding refers to how much information the pa- tients, researchers, and study monitors have about which particular treatment course a patient is going through. In a single-blind trial, the subjects are not aware of infor- mation that could bias the results of the trial. In practice, this means subjects do not know whether or not they are part of the control group. In a double-blind trial, the ex- perimenters are also unaware of information that could bias the results, so, neither the subjects nor the experimenters know who belongs to the control group and who does not. A triple-blind trial is similar to a double-blind trial, but the statistician interpret- ing the results is also unaware of the treatments administered in the trial. Naturally, the process includes safeguards, so that the blinding can be broken by the researchers in case of an emergency.

We can contrast this state of practice with that of computer science, and Software Engineering. A recent survey by Sjøberg et al. on controlled experiments in Software Engineering reported that, out of 5453 scientific articles published in 12 leading Soft- ware Engineering journals and conferences from 1993 to 2002, only 103 (1.9%) of them reported controlled experiments where individuals performed Software Engineering tasks [Sjøberg 05]. The authors of the survey define a controlled experiment in soft- ware engineering as “a randomized experiment or quasi-experiment in which individuals or teams (the experimental units) conduct one or more Software Engineering tasks for the sake of comparing different populations, processes, methods, techniques, languages or tools (the treat- ments).”2

Sjøberg et al. counted 14 series of experiment replications. Only 6 of these series included replications performed independently (not by the original authors). 5 out of the 14 series included replications that, at least partially, rejected the findings of the original experiment. Only one of the replications rejecting the findings in the original experiment was conducted by the original team.

The low percentage of experiment-based validation in papers (less than 10 %) 3, when compared to other research methods is noticeable, both in the context of Soft- ware Engineering [Glass 02] and Computer Science [Ramesh 04]. These observations are more compelling, when this state of practice is compared with that of other sci- ences. According to [Tichy 95], the percentage of published papers in computer sci- ence that make claims that should require experimental validation support, but pro-

2Note that other types of empirical studies, such as studies that are based on observations on existing

data, are excluded by this definition. Nevertheless, the insufficient empirical validation of claims is consistently observed in other surveys (e.g. [Zelkowitz 97, Glass 02]).

3Unlike in Sjøberg et al.’s survey, this 10 % value includes not only controlled experiments, but also

vide none (around 40 % for computer science papers, 50 % for papers on Software Engineering), when compared to papers in other scientific areas, such as Optical Engi- neering, Physics, Psychology or Anthropology (in these areas, only around 15 % of the papers present no experimental validation), is significantly higher. These results are consistent with the findings of [Zelkowitz 97].

This does not necessarily imply that the Computer Science and Software Engineer- ing communities are producing bad solutions. In some cases, a theoretical validation may be more adequate than an empirical one. But it does limit our ability to assess new solutions, when compared to previous ones. These findings point to an opportunity for significantly improving the state of practice when some sort of empirical evidence is desirable.

Tichy has argued against what he considers the most typical justifications not to perform experimentation, including, among several others, costs, uselessness, and dif- ficulty to conduct experiments [Tichy 98]. In many situations, neglecting experimental evidence on claims leaves other researchers and practitioners with expert’s qualitative opinion on those claims. Valuable as such opinions may be, they are based on personal experience and intuition, and thus potentially biased by the expert’s background. Ei- ther both researchers and practitioners are convinced by the arguments presented by experts, or they are not. But this is a subjective decision to be made, rather than a more rational one, based on verifiable evidence. It is more vulnerable to hype, or fads. When it comes to selecting appropriate tools, languages, processes and techniques, it is de- sirable to have reliable quantitative facts to support decisions, rather than qualitative opinions alone.