Analysis Considerations - Comparing model reuse with model building : an empirical study of lea

Due to financial constraints, each condition in the experiment has a sample size of

approximately 20 participants. A concern with this sample size is that the inferential

statistics, for example, non-parametric tests, will lack sufficient power to detect a

difference between the conditions where one exists. The opposite of this argument

is also possible: small sample size may lead to apparent difference by chance alone;

for example, simply due to the assignment of participants to groups.

To gain a better handle on these issues the analysis approach adopted makes use

of three concepts. Firstly, using a standard approach from psychology, the size of the

effect of the independent variable on the dependent variable is standardised. This

terminology for this statistic is an effect size and provides information about if the effect of the independent variable on dependent is trivial (useful when assessing the

practical significance of a result) or substantial (useful for non-significant results

due to high variability and low sample size). An effect size is also standardised

allowing comparison of measures with different scales within and across studies

(section 6.2.1).

The second approach to tackling the sample size issue is to make use of computer

intensive techniques for statistical inference. Here bootstrapping, i.e. resampling

with replacement, is used as it is often cited as providing greater statistical power

than classical parametric and non-parametric techniques for inference (Wilcox, 2005)

(section 6.2.2).

Lastly, as bootstrapping may be unfamiliar to some Operational Researchers,

traditional non-parametric test (Mann-Whitney) results are provided alongside the

sizes.

6.2.1 Standardising the Effect of an Independent Variable

Psychologists have long advocated the use of a statistic called an effect size to quan-

tify the practical significance/influence of an independent variable on a dependent

(Cohen, 1990; Field, 2009). This position has been adopted as they recognise that

the result of a significance test is a function of sample size, variability and the spec-

ified probability of a type 1 error. All samples give different results, but larger

samples are more likely to give statistically significant results that have little prac-

tical significance (Field, 2009). The opposite is also true: practically significant

results may be statistically non-significant due to a small sample size. Thus in ad-

dition to significance results all psychologists who wish to publish their results must

also quote an effect size to substantiate their claims (Field, 2009).

One approach for effect sizes is to use confidence intervals for mean differences.

However, the bespoke scales and measures used in psychology research require a

method that allows comparison within and across studies. This has led to the use

of standardised effect sizes. Often this takes the form of the correlation between

independent and dependent variables (Field, 2009).

As an example, consider Figure 6.1 illustrating two fictitious relationships be-

tween independent and dependent variables. In each chart the horizontal axis repre-

sents the two levels that the independent variable can take. In Figure 6.1a it appears

that the independent variable is positively correlated to the dependent. The Pearson

correlation,r, serves as the measure of standardised effect size here. (r=.86). Thus

we would conclude that when the level of independent variable is two then there is

a large positive effect on the dependent variable relative to level one. The size of

this effect can then be compared to the impact on dependent variable two (Figure

smaller (r=.17), although it could also be significant given a large enough sample

size. If for example, Figure 6.1b came from a study working in a similar area then

the researcher(s) could interpret the effect size they found in the context of already

existing results (Cohen, 1990); perhaps concluding that their independent variable

has a small effect on attitude change relative to previous findings.

(a)Example Experiment 1 (b) Example Experment 2

Figure 6.1: Interpretting Effect Sizes

No study of learning from simulation that has been found has used effect sizes

this way to date. When no relevant existing studies with effect sizes can be found,

psychologists use a rule of thumb for effect sizes (Cohen, 1990). Table 6.2 provides

details of these taken from Field (2009).

Table 6.2: Standard Interpretation of Effect Sizes

Effect Size Pearson r

Large r≥.50 Medium r≥.30 Small r≥.10

above form. Appendix B.1.1 details the procedure used.

6.2.2 Bootstrap Inference

A rationale for the bootstrap technique starts by considering how a researcher would

gain access to the real world distribution of sample means for a given variable

(Lunneborg, 2000). This would either be through repeated random sampling from a

population or, more conveniently, given a suitable sample size, by use of the Central

Limit Theorem to estimate the standard error of the mean. When asymptotic

normality assumptions do not hold then the researcher does not have to repeatedly,

and expensively, resample from the population, but rather from their best estimate

of it - the original sample taken (Lunneborg, 2000).

A typical critique of bootstrapping states that the resampling may be taking

values from a biased distribution. Hence no amount of resampling will help. This

critique is usually answered in two ways by proponents of the bootstrap technique.

The first argument is that if the sample is unrepresentative of the population of

interest then all of the inference procedures, classical or computer intensive, are

invalid. The second argument is that a large number of simulation studies have

shown the bootstrap to have greater statistical power to classical parametric and

non-parametric tests of inference - especially when the sample size is small (Wilcox,

2005).

This extra power is useful specifically because the analysis should also take

into account that 18 comparisons are made across the conditions. If care is not

taken in the analysis then the probability of incorrectly detecting that an effect is

present (difference between conditions) is inflated with each comparison made (Field,

2009). In other words inference procedures must be stricter. The extra power from

bootstrapping means that multiple comparison control is now practical - introducing

and procedures to correct for it can be found in Appendix B.1.4. All bootstrapped

tests were subject to multiple comparison control.

It is emphasised that the bootstrap technique should not be confused with sim-

ply increasing the sample size of the distribution by repeatedly resampling with

replacement. The point of the bootstrap is to construct the sample distribution of

a test statistic, for example the difference between the means of two populations, so

that inferences can be made about that difference. Appendix B.1.2 details the full

procedure.

6.2.3 Distribution of Participant Ability

An early version of this research was published at the 2009 Winter Simulation Con-

ference (see Monks et al., 2009). Feedback from one reviewer suggested that the

relative statistical ability of each participant across the conditions should be consid-

ered. To an extent this was handled by the randomisation approach for allocation

to conditions and it was it was too late to incorporate this point into the main

experimental design; however, the feedback was taken seriously and a retrospective

analysis was undertaken using exam marks for a business school module that it was

believed many of the participants had taken. Participant permission to perform the

analysis was also obtained retrospectively. The full analysis can be seen in Appendix

B.2.

The sample size obtained for the retrospective analysis is extremely small as it

was not possible to obtain suitable exam marks for all participants and all results

should be considered with this limitation in mind. Nevertheless the analysis found

no substantial evidence for differences in exam performance between the conditions.

When combined with the randomisation approach this should give some confidence

In document Comparing model reuse with model building : an empirical study of learning from simulation (Page 132-137)