Experiment planning - An Experimental Software Engineering process

3.3 An Experimental Software Engineering process

3.3.2 Experiment planning

While the experiment definition was about why a particular experiment is performed, the experiment planning is about how it will be performed. Before starting the exper- iment, decisions have to be made concerning the context of the experiment (revisited here, with more details), the hypotheses under study, the set of independent and dependent variables that will be used to evaluate the hypotheses, the selection of subjects participating in the experiment, the experiment design and instrumentation, and an evaluation of the experiment’s validity. Only after all these details are sorted out should the experiment be performed. The outcome of planning is the experimental design, which should encompass enough details in order to be replicable by independent teams. Figure 3.6 describes the activities related to the definition of the experiment design.

Figure 3.6: Experiment design planning

Context parameters’ definition

Throughout the experiment, there are a number of context parameters that remain stable (see figure 3.7). Their value is the same for all the subjects in the experiment during the whole process. Therefore, we can safely assume that differences observed in the results can not be attributed to these parameters. While the actual parameters to be reported may vary, Wohlin et al. have identified a core set of context parameters [Wohlin 99].

Concerning their integration within the development process, experiments can be conducted either on line, or off line. The former, carried as part of the software process

Figure 3.7: Detailed experiment context parameters

in a professional environment, involves an element of risk, since experiments may become intrusive in the underlying development activity. This intrusiveness may even manifest itself through resources and time overheads on a real project. A common alternative is to carry out the experiment off line.

An orthogonal classification of context concerns the people involved in the exper- iment. One may choose among performing the experiment with professional practi-

tioners, or with surrogates for those practitioners (typically, students). The first option leads to results that are more easily comparable to others obtained in a professional context, but care must be taken to reduce potential overheads to practitioners’ activities (see [Benestad 05] for a detailed discussion on strategies to mitigate some of these risks and thus recruit professional practitioners for participating in experiments).

Using students as surrogates for professional practitioners is less expensive, but makes the experimental results harder to extrapolate for a professional community. To reduce the gap to practitioners the researcher should prefer using graduate students, whose expertise is closer to novice practitioners. A discussion by Höst et al. on the using students vs. practitioners as subjects and on the circumstances under which students may be used instead of professionals may be found in [Höst 00]. Höst et al. carried out an experiment where they assessed the differences between the performance of students and practitioners while performing a non-trivial Software Engi- neering task. Their overall conclusion was that the differences between students and professionals were only minor and that that students could be used as surrogates for practitioners.

The comparability of results obtained by students and professionals is far from being a thoroughly studied issue. Sjøberg et al.’s review [Sjøberg 05] found only 3 papers that compared the performance of students vs. that of practitioners. In some of the tasks the results were similar, while on others practitioners did have a better performance. Regardless of the problems that still need to be addressed concerning the comparability between these two groups, performing experiments with students is a valid option for a low cost testing of hypotheses and for educational purposes.

Figure 3.8: Sample characteristics

lems. There are at least two issues that motivate the usage of toy problems: the resources available for the experiment and the risks concerned with the outcome of the experiment. The former results from the, often, very limited time subjects can devote to the experiment. The latter relates to the potential harm caused by the outcome of the experiment (e.g. while experimenting with different testing techniques on a real problem, a less effective technique being tested could lead to a lower final product quality being delivered to a customer). The question, here, is whether the results obtained with a toy problem will scale up to real problems, or not. Toy problems are often used in early experiments, as their usage is less expensive. If the results of experiments conducted with toy examples are satisfactory, the risk of scaling up the problem to a real one may be mitigated to a certain extent, although it will not be completely eradicated. Experiments can also range from specific to general, in the sense that their results are applicable to a niche or to a wider population. For instance, when experimenting with the maintainability of object-oriented software, one can design experiments that are language-specific, or experiments that yield results applicable to object-oriented software in general.

Other relevant parameters can be added to this core set. Kitchenham et al. ar- gue that context information such as the domain of the software being developed, or organizational constraints such as the development process used by the subjects performing the experiment should also be made clear [Kitchenham 02].

Hypothesis formulation

The hypothesis formulation should be stated as clearly as possible, and presented in the context of the theoretical background it is derived from. This theoretical context makes the hypothesis’ implications more apparent, and is important to facilitate the inclusion of the experiment’s outcome in the body of knowledge of Software Engi- neering [Kitchenham 02].

Two hypotheses are formulated: a null hypothesis, denoted by H0i j, and its al- ternative hypothesis H1i j. In both cases, i stands for the experiment goal identifier, whereas j corresponds to a hypothesis counter and should be used when more than one hypothesis is being tested for the same goal.

The null hypothesis states that there is no observable pattern in the experiment setting, so any variations found are coincidental. This is the hypothesis the researcher is trying to reject. The alternative is that the variations observed are not coincidental. When the null hypothesis is rejected, we can conclude that the null hypothesis is false. However, if we can not reject the null hypothesis, we can only say that there is no statistical evidence to reject it. Conversely, if we reject the null hypothesis we can accept its alternative. If we can not reject the null hypothesis, we can not accept the alternative.

Hypothesis testing always assumes a given level of significance denoted by α. α represents the a fixed probability of wrongly rejecting the null hypothesis H0i j, if it is in fact true. The probability value (p-value) of a statistical hypothesis test is the probability of getting a value of the test statistic as extreme as or more extreme than that observed by chance alone, if the null hypothesis H0i j, is true.

This leads to two types of error that can be made when testing the hypotheses. One can reject the null hypothesis although it was in fact true (type I error). The probability for making that error is, as we have seen before, α. One can also fail to reject the alternative hypothesis, although it was in fact false (type II error). The probability for making this error, β, is often unknown. Type II errors are frequently associated with samples that are too small.

The power of the test is the probability of not committing a type II error, and should be as close as possible to 1.

Figure 3.9 presents the relationships between the main concepts involved in hypotheses definitions, starting from the overall objectives of the research, through the specific goals of the experiment, and the questions that will allow assessing the achievement of the goals. The hypotheses are then assessed using metrics. The basic concepts concerning variables selection are also included in figure 3.9, and discussed in the next section.

Figure 3.9: Hypothesis specification and variables selection

Variables selection

The process of selecting appropriate variables should be guided by a goal-driven approach, that ties collected information to the research goals that information is in- tended to help achieving. This way, it is possible to prevent the collection of data that, for the sake of the experiment, is useless, thus saving the resources that would otherwise be employed in such data collection.

In the context of experimental software engineering, the Goal-Question-Metric approach (GQM) [Basili 94] is generally accepted as the standard approach to achieve this objective8.

The Software Engineering experimenter selects both dependent and independent variables. Dependent variables should be explicitly tied to the research goals of the experiment. They should be chosen for their relevance with respect to those goals. When it is not feasible to collect direct measures of the level of achievement of the research goals, surrogates can be used, although such replacement is to be avoided, when possible, and clearly justified, when not. Similarly, independent variables are

8_{The GQM approach starts with the definition of a goal, including the purpose of measurement, the}

object to be measured, the issue to be measured and the point of view from which the measure is taken. The goal is refined into questions which, in turn, are refined into metrics that attempt to help answering them.

chosen for their relevance to the research goals.

Kitchenham et al. recommend that, for observational studies and experiments, it may be useful to record additional performance measures that are not directly related with the main research goals of the study, but may nevertheless be affected by the treatments under scrutiny. These extra variables may provide insights concerning possible side effects of the treatments that can be assessed later and motivate further research work [Kitchenham 02].

To facilitate the replicability of experimental and observational work, the variables should be measurable, and, if possible, defined using standard measures. Each measure should be defined as clearly and unambiguously as possible, to prevent different interpretations of its definition. This includes specifying the entity from which the measurement is taken from, the attribute being measured, the counting rule that is ap- plied, and the unit of measurement. We will revisit this subject in detail in chapter 4, when discussing metrics definition techniques.

Subjects selection

The target population has to be defined as clearly as possible. Selecting subjects is not necessarily a trivial task, but it is essential so that:

• the applicability of the results obtained in the experiment is well understood; • a suitable strategy for selecting subjects can be devised;

• the representativity of the subjects that are selected to represent the population can be assessed (inference ability).

Note that these subjects need not be people. Artifacts, such as software components, can also be used as subjects.

The process of clearly defining the population, in itself, sheds some light with respect to the definition of the applicability boundaries of the knowledge that will be collected with the experiment, with respect to the theoretical framework the experiment is trying to address. Therefore, the population’s characteristics, including its invariants, have to be clearly stated.

It is common to use a frame of the population, if it is not feasible to identify all the population’s members. In contrast, all members of the chosen population frame are identified. For example, rather than considering all the software components available from any repository for reuse, one can use a frame that considers only the software components available in a known set of components repositories as the population.

Often, it is not possible to perform the experiment using all the relevant framed population as experiment subjects. Instead, a sample of that framed population is chosen, with the objective of being as representative of the framed population as possible,

considering the resources available to the experimenter. So, while planning an experiment, the sampling technique has to be chosen. Figure 3.10 presents a taxonomy of sampling techniques that are applicable to the scope of Experimental Software Engi- neering.

Figure 3.10: Classification of sampling techniques With respect to the organization, sampling can be:

• simple - all elements are treated equally;

• stratified - the elements are separated into different categories in such a way that the variations within each categories are minimized, while the variations among different categories are maximized;

• clustered - the elements are grouped into clusters;

• quota - the elements are grouped into different categories, as in the stratified sam- ple, but then chosen in a non-random way, to ensure a pre-specified proportion among the different quotas.

With respect to the sampling method, it can be:

• random - equal probability of choosing any element;

• systematic - a rule, such as selecting every ith _{element in the sample is chosen;}

• convenience - elements are chosen based on their easier availability.

In the context of Experimental Software Engineering, the most common sampling is a combination of the simple organization with convenience sampling. The experiments reported in this dissertation use this kind of sampling. The implications of this choice will be discussed, in each of the experiment’s reports (chapters 6, and 7) included in this dissertation.

Experiment design

The previous choices on hypotheses and variables restrict the available experiment designs. The choice of experiment design is crucial, in that it conditions the valid statistical approaches that can be followed to analyze the data collected in the experiment. Wohlin et al. refer 3 general design principles that are used when choosing an experi- ment design: randomization, blocking, and balancing [Wohlin 99]. Randomization is about averaging out a factor that might otherwise influence the outcome of the test, by ensuring that observations are being made on independent random variables. When the experimenter is aware of a particular factor that may have an influence on the test but is not the factor under test, he may choose to block that effect, by creating several groups within the sample. Within each group, that factor is approximately constant, so that it has no influence on the test being performed. Balancing is about ensuring that each treatment is administered to a similar number of subjects, to ensure a fair test. This is desirable for improving the soundness of the statistical analysis performed during the experiment.

There is no shortage of available experimental design lists, both in the context of Software Engineering (e.g. [Basili 96a, Zelkowitz 96, Juristo 98, Wohlin 99, Juristo 01]), and that of other sciences (e.g. [Cook 76, Creswell 03, Trochim 06]. The recommenda- tions on experimental software practices point to the preferential usage of simple, well- known experiment designs, as they are well documented and can be more easily repli- cated and understood. In contrast, the usage of custom designs may require the help of a statistician, so that so that the design’s implications are well understood [Kitchen- ham 02].

The criteria for describing taxonomies of designs varies significantly depending on the concerns of the earlier mentioned experimental design lists’s proponents. Further- more, the plethora of available experimental designs is too vast for its inclusion in this dissertation. Rather than providing yet another list of experimental designs, we focus on the basic experimental design building blocks. An experiment design prescribes the division of our sample into a set of groups, according to some strategy. Each of those groups receives a set of interventions, that may be either observations, or treatments. The sequencing and synchronization of such interventions, their nature, and the group definition policy, define the experimental design (figure 3.11).

In experimentation references such as [Cook 76, Creswell 03, Trochim 06], each experiment design is presented as a sequence, or set of parallel sequences of symbols that represent the main constructs of the design: observations (O) and treatments (X), following a notation proposed by Campbell and Stanley [Campbell 05]. These symbols are decorated with indexes, when different variables, or different treatments are used, respectively. Random assignment of subjects to groups is represented by the symbol R. Trochim [Trochim 06] uses two extra symbols for representing non-equivalent groups (N) and cut-off groups (C), while the original notation used a dashed line to represent

Figure 3.11: Experimental design concepts overview non-equivalent groups (including cut-offs).

For example, consider an experimental design with two groups, where one group will receive a treatment and the other a placebo (no treatment). Suppose that subjects are randomly assigned to groups. Group A is observed before and after receiving the treatment (these observations are often referred to as pre and post-tests. Group B is observed in the same moments as group A, but does not receive the treatment. This design can be described as:

Group A R O X O

Group B R O O

Note that timing and synchronization issues are represented in this notation by the vertical alignment of the symbols. We can integrate these notions in our process model, by refining the action Experimental Design Selection, referred in figure 3.6. Figure 3.12 presents a first overview of the experimental design selection, where a decision is made concerning the number of groups of subjects participating in the design. With single- group designs, which would correspond to designs with a single line in Campbell and Stanley’s notation, the group assignment activity can be skipped as subjects are all assigned to the same group. There are a number of threats to internal validity associated to single-group designs, as we shall discuss in section 3.3.4.

Figure 3.12: Experiment design selection overview

The group assignment is detailed in figure 3.13, where the researcher can decide which of the group division strategies best fits the goals of the experiment.

Random assignment implies that every subject has an equal probability of being assigned to each of the experimental groups. Random assignment is strong against single group internal validity threats, as well as to most multiple group internal validity threats. The latter characteristic stems from the probabilistic equivalence of the groups.

Non-equivalent groupsare used very often, in quasi-experiments. A common example is when an experiment is carried out in an academic context and the groups correspond to different classrooms. This implies that the probabilistic equivalence of the groups is lost. It is still often possible to consider the groups to be comparable, and desirable to form groups as similar as possible. Because the groups are non-equivalent, researchers must consider the additional internal validity threat of selection. If the groups are different in a way that affects the outcome of the experiment, this may become a confounding effect to the analysis of the results.

Cut-off groupsare used in situations where the experimenter wishes to use a quan- tifiable property of the subjects as a discriminator of those subjects. The cut-off point between two groups is used as a limit between those groups, so that subjects with a property value below the cut-off are assigned to one group, while subjects with a property value above the cut-off are assigned to another group. This approach is par- ticularly useful if discontinuities are expected between the different groups.

Figure 3.13: Group assignment

Each group is then subject to a sequence of observations and, possibly, treatments,

In document Component-based software engineering: a quantitative approach (Page 100-111)