Due to financial constraints, each condition in the experiment has a sample size of
approximately 20 participants. A concern with this sample size is that the inferential
statistics, for example, non-parametric tests, will lack sufficient power to detect a
difference between the conditions where one exists. The opposite of this argument
is also possible: small sample size may lead to apparent difference by chance alone;
for example, simply due to the assignment of participants to groups.
To gain a better handle on these issues the analysis approach adopted makes use
of three concepts. Firstly, using a standard approach from psychology, the size of the
effect of the independent variable on the dependent variable is standardised. This
terminology for this statistic is an effect size and provides information about if the effect of the independent variable on dependent is trivial (useful when assessing the
practical significance of a result) or substantial (useful for non-significant results
due to high variability and low sample size). An effect size is also standardised
allowing comparison of measures with different scales within and across studies
(section 6.2.1).
The second approach to tackling the sample size issue is to make use of computer
intensive techniques for statistical inference. Here bootstrapping, i.e. resampling
with replacement, is used as it is often cited as providing greater statistical power
than classical parametric and non-parametric techniques for inference (Wilcox, 2005)
(section 6.2.2).
Lastly, as bootstrapping may be unfamiliar to some Operational Researchers,
traditional non-parametric test (Mann-Whitney) results are provided alongside the
sizes.
6.2.1 Standardising the Effect of an Independent Variable
Psychologists have long advocated the use of a statistic called an effect size to quan-
tify the practical significance/influence of an independent variable on a dependent
(Cohen, 1990; Field, 2009). This position has been adopted as they recognise that
the result of a significance test is a function of sample size, variability and the spec-
ified probability of a type 1 error. All samples give different results, but larger
samples are more likely to give statistically significant results that have little prac-
tical significance (Field, 2009). The opposite is also true: practically significant
results may be statistically non-significant due to a small sample size. Thus in ad-
dition to significance results all psychologists who wish to publish their results must
also quote an effect size to substantiate their claims (Field, 2009).
One approach for effect sizes is to use confidence intervals for mean differences.
However, the bespoke scales and measures used in psychology research require a
method that allows comparison within and across studies. This has led to the use
of standardised effect sizes. Often this takes the form of the correlation between
independent and dependent variables (Field, 2009).
As an example, consider Figure 6.1 illustrating two fictitious relationships be-
tween independent and dependent variables. In each chart the horizontal axis repre-
sents the two levels that the independent variable can take. In Figure 6.1a it appears
that the independent variable is positively correlated to the dependent. The Pearson
correlation,r, serves as the measure of standardised effect size here. (r=.86). Thus
we would conclude that when the level of independent variable is two then there is
a large positive effect on the dependent variable relative to level one. The size of
this effect can then be compared to the impact on dependent variable two (Figure
smaller (r=.17), although it could also be significant given a large enough sample
size. If for example, Figure 6.1b came from a study working in a similar area then
the researcher(s) could interpret the effect size they found in the context of already
existing results (Cohen, 1990); perhaps concluding that their independent variable
has a small effect on attitude change relative to previous findings.
(a)Example Experiment 1 (b) Example Experment 2
Figure 6.1: Interpretting Effect Sizes
No study of learning from simulation that has been found has used effect sizes
this way to date. When no relevant existing studies with effect sizes can be found,
psychologists use a rule of thumb for effect sizes (Cohen, 1990). Table 6.2 provides
details of these taken from Field (2009).
Table 6.2: Standard Interpretation of Effect Sizes
Effect Size Pearson r
Large r≥.50 Medium r≥.30 Small r≥.10
above form. Appendix B.1.1 details the procedure used.
6.2.2 Bootstrap Inference
A rationale for the bootstrap technique starts by considering how a researcher would
gain access to the real world distribution of sample means for a given variable
(Lunneborg, 2000). This would either be through repeated random sampling from a
population or, more conveniently, given a suitable sample size, by use of the Central
Limit Theorem to estimate the standard error of the mean. When asymptotic
normality assumptions do not hold then the researcher does not have to repeatedly,
and expensively, resample from the population, but rather from their best estimate
of it - the original sample taken (Lunneborg, 2000).
A typical critique of bootstrapping states that the resampling may be taking
values from a biased distribution. Hence no amount of resampling will help. This
critique is usually answered in two ways by proponents of the bootstrap technique.
The first argument is that if the sample is unrepresentative of the population of
interest then all of the inference procedures, classical or computer intensive, are
invalid. The second argument is that a large number of simulation studies have
shown the bootstrap to have greater statistical power to classical parametric and
non-parametric tests of inference - especially when the sample size is small (Wilcox,
2005).
This extra power is useful specifically because the analysis should also take
into account that 18 comparisons are made across the conditions. If care is not
taken in the analysis then the probability of incorrectly detecting that an effect is
present (difference between conditions) is inflated with each comparison made (Field,
2009). In other words inference procedures must be stricter. The extra power from
bootstrapping means that multiple comparison control is now practical - introducing
and procedures to correct for it can be found in Appendix B.1.4. All bootstrapped
tests were subject to multiple comparison control.
It is emphasised that the bootstrap technique should not be confused with sim-
ply increasing the sample size of the distribution by repeatedly resampling with
replacement. The point of the bootstrap is to construct the sample distribution of
a test statistic, for example the difference between the means of two populations, so
that inferences can be made about that difference. Appendix B.1.2 details the full
procedure.
6.2.3 Distribution of Participant Ability
An early version of this research was published at the 2009 Winter Simulation Con-
ference (see Monks et al., 2009). Feedback from one reviewer suggested that the
relative statistical ability of each participant across the conditions should be consid-
ered. To an extent this was handled by the randomisation approach for allocation
to conditions and it was it was too late to incorporate this point into the main
experimental design; however, the feedback was taken seriously and a retrospective
analysis was undertaken using exam marks for a business school module that it was
believed many of the participants had taken. Participant permission to perform the
analysis was also obtained retrospectively. The full analysis can be seen in Appendix
B.2.
The sample size obtained for the retrospective analysis is extremely small as it
was not possible to obtain suitable exam marks for all participants and all results
should be considered with this limitation in mind. Nevertheless the analysis found
no substantial evidence for differences in exam performance between the conditions.
When combined with the randomisation approach this should give some confidence