• No results found

3.3 An Experimental Software Engineering process

3.3.5 Results packaging

Once the experimental work per se is finished, it is essential to package the results so that they can be used either within the context of the organization that sponsors the experiment, or by the community. This involves documenting the whole experimental process, as discussed in the previous sections, and including a discussion on the results achieved with the experiment. This discussion should focus on aspects such as the interpretation of the results, the limitations of the study, the inferencing that can be made with respect to the extent to which the study’s results are expected to hold in the population, and the identification of the learned lessons (figure 3.18).

Figure 3.18: Experiment results packaging activity

Results intepretation

This activity concerns the analysis of the outcome of the tests, anchored on the theory that is being assessed through them. When the tests do not confirm the theoretical assumptions, identifying the causes that lead to that failure can be used as a stepping stone to allow the refinement of the theory, or even the construction of a new theoretical framework to explain the object or process under test.

Threats identification

Throughout the experimental process description, we briefly outlined the impact that decisions concerning activities such as sampling, or experimental design selection have on the validity of the results obtained in the experimental process. While packaging the experimental results, one should look back to the whole process and clearly identify the potential threats to the validity of those results. Furthermore, one should discuss the measures that were in place to address each of those threats.

The objective of threats identification is to document potential weak spots of the experimental work being described. Rather than a depreciation of the value of the experimental work, this analysis can be viewed as an active and systematic approach to identifying opportunities for further complementary studies that, as a family of related studies, can contribute to the Software Engineering body of knowledge.

In [Wohlin 99], Wohlin et al. identify four kinds of threats to validity and discuss how these threats can be dealt with:

• Internal validity is concerned with the validity of the study itself, with respect to the causal effect being studied.

• External validity refers to the experimenter’s ability to generalize the results from the experiment to industrial practice.

• Construct validity concerns the generalization of the results of the experiment to the theory behind it.

• Conclusion validity is related to our ability to draw the correct conclusion about the relations between our treatment and the experiment’s outcome.

In what concerns the internal validity of the study, we consider two sorts of valid- ity threats: single group threats, multiple groups threats, and social threats. Single group threats can occur from not having a control group in the experiment.

• History. This threat concerns uncontrolled events that are irrelevant for the the- ory being tested, but may nevertheless introduce a confounding effect on the outcome of tests performed after they occur. For instance, consider the bias that can be introduced in an experiment concerning the productivity of code develop- ers, following an event that might break their concentration on the coding task, such as a very exciting sports event, or breaking news on a catastrophe.

• Maturation. Occurs when subjects react differently as time progresses. Depend- ing on the type of maturation, the bias may be a positive or a negative one. A positive bias may result of a learning process, for instance. A negative one may result from the saturation of subjects.

• Testing. If a test is repeated several times, a non-intentional side-effect, such as learning, can be introduced in the experiment.

• Instrumentation. If the measurement instruments are not working as precisely as they should, errors in the measurements may occur, either systematically or randomly. In both events, they may jeopardize the quality of the data collected in the experiment, and, as a consequence, the interpretation of the results. Another possible instrumentation threat comes from changing the measurement instru- ments during the experiment (e.g. changing a the software metrics collection tool).

• Statistical regression. This threat can occur when the subjects of a study are se- lected for obtaining an extremely high, or an extremely low results in a previous test. When tested again, both are likely to obtain a result closer to the population mean than in the previous test.

• Selection. When sampling from a population, there is a risk that the subjects are not representative of the whole population (see discussion on subjects’ selection, in section 3.3.2).

• Mortality. When the subjects dropping out of an experiment are representative of the population, or one of its subgroups, this has an effect on the overall con- clusions that can be drawn from the experiment.

• Ambiguity about direction of causal influence. The fact that two variables are highly correlated does not imply that one of them has a direct influence on the other. They can be both influenced by a third variable. When planning an ex- periment, effort must be put concerning the correct identification of causes and effects.

Multiple groups threats are the result of the multiple groups being exposed differ- ently to single group threats. This may reduce the comparability of results obtained in different groups.

Social threats to internal validity can stem from the usage of differentiated treat- ments within our sample, if that differentiation causes a change in the behavior of the subjects:

• Diffusion or imitation of treatments. If the members of the control group imitate the behavior of the group being tested, this can introduce a bias from the smaller differentiation among groups, contrary to the expectations.

• Compensatory equalization of treatments When different groups receive differ- ent treatments, the control group may receive some form of compensation from not using the treatment being tested. If that compensation has an effect on the performance of the control group in the experiment, this may jeopardize the con- clusions of the overall test.

• Compensatory rivalry This can occur if subjects from a group not receiving a new treatment feel they are being penalized and work harder than they normally would to counter that effect.

• Resentful demoralization This can occur if subjects from a group not receiving a new treatment feel they are being penalized by this situation and become less involved in the experimental work than their counterparts. It is the opposite situation of compensatory rivalry.

External validity refers to one’s ability of generalizing results beyond the scope of the experiment. We consider three potential sources of threat:

• Selection. This problem occurs if the used sampling does not provide a repre- sentative sample of the population. It may hamper the experimenter’s ability to generalize the results of the experiment outside the used sample.

• Setting. An experiment can be jeopardized by using an unrealistic experimental environment. For instance, if an outdated development environment is used in an experiment concerning a particular aspect of software development, it may be the case that the same experiment, carried out in a modern development environ- ment would yield completely different results, assuming the tasks being tested have a more sophisticated support in the modern development environment. • History. Refer to the discussion on history threats as single operation threats,

earlier in this section. With respect to external validity, a special event biasing the results may damage our ability to extrapolate from them to the most general situation, where that event is irrelevant.

Construct validity threats can assume one of two forms: social and design threats. Social threats result from problems related to the behavior of the subjects and experi- menters, if they somehow act differently than they would otherwise, due to the exper- iment. This behavioral change is a product of the subjects’ and experimenters’ aware- ness to the experiment, although it may be unintentional:

• Hypothesis guessing. As subjects are aware of being observed in the context of an experiment, they may behave differently to provide a specific impression on the observers. This sometimes leads to hypothesis guessing, where subjects try to figure out what is the hypothesis under study, so that they can perform in the test according to their preferences concerning the hypothesis they think is being assessed.

• Evaluation apprehension. If subjects are not comfortable with being assessed in the context of an experiment, a frequent human trait, they may try to provoke a good impression on the experimenters, thus changing their normal behavior.

• Experimenter’s expectancies. Conversely, the experimenter usually has an in- terest in the outcome of an experiment. This may bias the conduction of the experiment toward confirming the theory underlying the experiment, or refuting it.

Construct validity design threats result from difficulties in the rigorous definition of the causes and effects being tested. Such difficulties may lead to a poor choice of measurements and treatments, as the theoretical concepts under test are poorly un- derstood. For instance, a subjective concept such as design quality is open to several conflicting definitions. Construct validity design threats include:

• Inadequate preoperational explication of constructs. This threat occurs when the constructs involved in an experiment are poorly defined. If the theory un- derpinning the experiment is not clear, analyzing experimental results becomes more difficult.

• Mono-operation bias. Considering a single independent variable, cause, or treat- ment in a study may introduce the mono-operation bias, because a single inde- pendent variable (or cause, or treatment) is always flawed with respect to the construct upon which it is based. The countermeasure is, of course, is to use multiple independent variables (or causes, or treatments, respectively).

• Mono-method bias. Using a single kind of measure, or observation, is a threat, in the sense that it may introduce a bias. Using several alternative measures or observations, one can minimize the effect of such bias, by focusing on the commonalities observed with the alternative measurements and observations of the same concept.

• Confounding constructs and level of constructs. Sometimes, rather than assess- ing a construct with respect to its presence, one should focus on its level. Con- sider the example where the performance of the participants on a code inspec- tion experiment are classified as having experience with a given programming language, or not. Concerning the participants who do have experience with the language, different levels of expertise with it may have a stronger relation with the observed effect than a simple binary assessment of such previous experience. • Interaction of different treatments. When subjects are administered different treatments, it is possible that those treatments interact. That interaction may bias the results of each treatment. It is useful to understand the combined effect of sev- eral treatments, to avoid using combinations of treatments that cancel out their benefits. Failing to consider their interaction may lead to erroneous conclusions with respect to each treatment’s effect. In short, when several treatments are in- volved, it may not be possible to distinguish which effects are attributable to each of the treatments and which are a result of the combination of all treatments.

• Interaction of testing and treatment. A fundamental part of testing is the ap- plication of treatments. Subjects undergoing testing activities may act differently from how they normally would, thus biasing the outcome of the tests. In the context of Software Engineering, this threat is stronger in experiments involv- ing human participants, as we have seen while discussing the social threats to construct validity, earlier in this section.

• Restricted generalizability across constructs. Although a treatment may have the desired effect on a construct we are concerned with, it may also have unde- sired side effects on other constructs that should also be relevant in the analysis of the outcome of the treatment. If side effects on those other constructs are not monitored, there is a risk of drawing conclusions that are not generalizable those constructs. Consider, for example, the introduction of a new component technol- ogy that helps improving development productivity (the monitored construct) but also leads to a lower maintainability of the code (the other construct that should have been monitored, but was not).

Conclusion validity threats are threats that are inherent to the usage of statistical tests:

• Statistical power. When sample sizes are too small, the value of α is low, or an inadequate statistical test is chosen, a type II error can occur, due to the lack of statistical power. Conversely, a type I error can occur if α is set too high.

• Violated assumptions of statistical tests. Each statistical test prescribes a set of pre-conditions that are to be verified before using the test. Failure to comply with those assumptions can endanger the validity of conclusions drawn from such tests.

• Fishing and the error rate. When too many tests are performed, there is a chance that some of them will reveal spurious relations between variables, purely by chance.

• Reliability of measures. If measures have a low reliability (e.g., they are not stable), they can contribute to an inflation of the error terms, introducing noise in the statistical test.

• Reliability of treatment implementation. A lack of standardization in the treat- ment implementation may lead to a confounding factor, if the treatment is in- consistently administered in different groups. These variations are more likely to happen when different people administer the treatment, although they can also occur with the same person.

• Random irrelevances in experimental setting. The experimental setting may contain features that interfere with the outcome of the tests being performed, by

providing sources of variation which are not relevant for the tests. These sources of variation will increase the error variance.

• Random heterogeneity of subjects. Heterogeneity of subjects can increase the error variance, as subjects may react differently to treatments.

Inferencing

Following the testing phase of the experimental work, conclusions should follow the statistical tests results, through inference. Considering all the threats previously iden- tified, the researchers have to estimate how the results obtained in the experiment are expected to hold beyond the experiment’s sample (i.e. in the population).

Identification of learned lessons

During the whole experimental process, the practical details of conducting that pro- cess, including potential gaps or impractical design decisions in the experimental pro- tocol should be registered, along with the approach followed to circumvent these prob- lems. These informations are particularly valuable to other researchers and practition- ers who wish to replicate the experiment, as they mitigate to some extent the tacit knowledge problem.