Effect Size and Power - OReilly Statistics in a Nutshell A Desktop Quick Reference Aug 2008 pdf

An important question in studies where sample sizes are small and/or limited is determining how many experimental units are required to observe an experimental effect.Recall the crash test example above.The experimenter wants to minimize the number of vehicles destroyed, each vehicle costs a lot of money.On the other hand, the experimenter must be sure that enough cars have been tested to ensure public safety.Alternatively, in animal or human experimentation, it is unethical to apply a treatment to more participants than necessary to see a partic- ular effect.

A statistically significant difference between the mean of one sample and an expected mean, or between two means, does not in itself indicate whether the difference is important.The importance of an observed effect must be determined by the knowledge domain and/or industry standards that are relevant to the problem.For example, in the crash testing scenario, prior experience and/or observation from real crashes may indicate three different thresholds for a crash impact effect at different velocities: one beyond which death will occur, one beyond which death will not occur but injury will occur, and one beyond which neither death nor significant injury will occur.These thresholds might be used to determine a “star rating,” for example, so that consumers can make informed choices about car purchasing based on safety.These are examples of important differences, and the differences between them should also be statistically significant. But not all statistically significant differences will be important.

The distance between each of the thresholds in this example corresponds to an

effect size, or the magnitude of difference between them.The effect size for any

test of mean comparisons is given by:

Population means can be replaced by sample means in any specific experiment, and the standard deviation should be the same for both samples (assuming that the homogeneity of variance assumption fort-tests is met), or a pooled estimate can be used.As a concrete example, imagine that the threshold for crash impact at 80 mph, in which death will occur, is 2.5 yards, and 1.5 yards in which serious injury but no death will occur.Assuming equal variance, and if the standard deviation is 0.2, then the effect size will be given by:

This is a very large effect size, and so the difference can be considered statistically important.However, if the threshold beyond which no injury will occur is only 1.4 yards, the effect size is much smaller:

µ1–µ2 σ --- 2.5–1.5 0.2 --- = 5.0 1.5–1.4 0.2 --- = 0.5

Exercises | 165

The t-Test

Here, the difference between the populations is going to be very small indeed, since they differ by 0.5 standard deviations on average. In statistical terms, it’s always going to be easier to measure large differences than small differences, when the standard deviation andn are equal.

If you know in advance what the expected difference between two means is before you experiment, based on past experience or observation, and you have a reason- able estimate of the standard deviation, you can compute an effect size prior to experimentation.After selecting an appropriate α (e.g., α = 0.01), you can compute the number of experimental units required to observe a specific level of power.

Calculation of statistical powerbeforeyou run an experiment is an important step in determining its scope, especially in terms of the likelihood of committing a Type II error.So far, you have learned a lot about Type I errors, but the impact of Type II errors can be quite insidious; imagine signing off on crash tests for a vehicle showing that the sample did not differ from the mean, when in fact the crash performance for the population mean was significantly worse than the acceptable level.Thus, statistical power is best understood as the ability of a test—in this case at-test—to discriminate between two means when in fact they are actually different.Power is formally defined as 1 –β, whereβis the probability of committing a Type II error.

Following the crash testing example, if you have an effect size of 4.0, andα= 0. 05, to achieve power of 0.90 (i.e., where β= 0.1), then n should be at least 4. However, if you have an effect size of 0.5, andα= 0.05, to achieve power of 0.90 (i.e., whereβ= 0.1), thennshould be at least 106.That’s a very large difference in n required to see an experimental effect, but serves to illustrate why effect sizes are critical to understanding the importance of statistically significant results.In practice, because of the conservative nature of scientific hypothesis testing, priority is usually given to conservativeαlevels (e.g.,α< 0.01), whileβis typi- cally accepted at 0.80 in many fields, especially where a lot of repetition occurs in experimentation. Effect size and power are further discussed in Chapter 18.

Exercises

While you can use a statistical package like Minitab, SPSS, STATA, SAS, or even Excel to computet-tests and their significance, working through some examples yourself will make the underlying concepts easier to understand (especially the difference, say, between standard error and standard deviation).Also, if you consider scenarios from work or school that involve small samples, you may begin to develop a sense of how to approach them inferentially usingt-tests.If you have understood all of the permutations oft-testing as computed by hand, then using a statistical package will be much easier for you.Also, the output generated by many statistical packages is confusing if you don’t understand what you should be looking for; e.g., most statistical tests are accompanied by various adjustments and corrections that are usually calculated but may not be relevant to your research question, unless one or more of the assumptions underlying the test have been violated (e.g., homogeneity of variance).

Question

A boutique brewery company is trying to determine the optimal period of fermentation for a new organic ale called Old Sarum, which is free of additives and preservatives that may have been hindering the fermentation process, according to the marketing director.However, given that the organic ingredients have never been used before, the brewer needs to know whether the new recipe will require a different fermentation period from the existing recipe.The average fermentation time for existing brews is 48 hours, so the best estimate for the population mean is µµ = 48.The master brewer—skeptical that organic ingredients will make any difference at all to the fermentation process—decides to test the null hypothesis that there is no difference between the population mean and the sample mean of Old Sarum.

However, the pressure from the marketing department means that there is only a limited time available for quality control before the new product is launched, so the brewer is only allowed 20 kegs of beer to be brewed and tested.Since there are 120 kegs, a computer program is used to randomly select 20 from the population. Answer

The brewer finds that average brewing time is 43 hours and s2 = 3.5 for the sample.The number of degrees of freedomdf= 19, sincen= 20, and at the 0.05 level of significance, the null hypothesis can be rejected ift≥1.729, andt≥2.539 at the 0.01 level.

The value oft can be estimated as follows:

Thus, the null hypothesis can be rejected at both the 0.05 and the 0.01 levels. The brewer realizes that the chance of committing a Type I error is less than 1 in 100 of the population, and thus believes that a significant reduction in brewing times exists between Old Sarum and the existing brews.

Question

The finance department is very unhappy with the brewer, since 20 kegs is a lot of beer to waste on a test.The finance manager decides to conduct a power analysis to determine how many kegs should have been used, taking into account that a difference of only two hours more would have resulted in a cost savings in terms of fermentation. t a–µ s n --- --- = 48–43 3.5 20 --- --- = 6.39 =

Exercises | 167

The t-Test

Answer

The manager begins by computing the effect size:

If you have an effect size of 1.06, and α= 0.05, to achieve power of 0.90 (i.e., whereβ= 0.1), thennshould be at least 15.Thus, the finance manager decides to deduct the cost of the five wasted kegs from the brewing department’s accounts. Question

After the success of the Old Sarum in reducing the costly fermentation process, the brewers are under pressure to make sure that it tastes better than other ales. To this end, the marketing department engages a consultant to undertake an expert panel evaluation of the flavor of Old Sarum versus the original ale.The consultant will employ a panel of expert judges, who are expensive to hire, so only 10 will be empanelled to make taste judgments.

Answer

The results from the experiments are shown in Table 8-4.

The null hypothesis in this experiment is thatµd = 0. The mean ofyd is:

Table 8-4. Taste test results for Old Sarum

Existing brew /10 Old Sarum /10 Difference (Difference)2

6 8 –2 4 7 8 –1 1 8 9 –1 1 7 8 –1 1 7 10 –3 9 8 9 –1 1 6 8 –2 4 6 9 –3 9 7 8 –1 1 7 7 0 0 Effect size µ1–µ2 σ --- = 48–46 1.87 --- = 1.06 = yd

∑

= 15 yd 2

∑

= 31 yd n ---

∑

15 10 --- 1.5 = =

The variance is then calculated as:

The value fort at thep= 0.05 probability level is then given by:

The number of degrees of freedomdf= 9, sincen = 10, and at the 0.05 level of significance the null hypothesis can be rejected ift≥1.833, andt≥2.821 at the 0.01 level.Thus, you would reject the null hypothesis in this experiment at both the 0.01 and 0.05 levels of significance and conclude that the panel’s judgments clearly favored Old Sarum.

Of course, since the judges were not randomly selected from the population of judges, they may not reflect broader opinion within the expert community, nor would any inference be able to be made about the wider beer-consuming population, who were not judges.The marketing department would be wise to follow up these studies with a much broader set of tests using randomization.

Note that the flavor ratings here should be interpreted as interval data: if the ratings were to be interpreted as ordinal data, then a nonparametric test for differences between groups, such as the Mann-Whitney U test, would be more appropriate. s_d2 y_d2

∑

(

∑

yd) 2 n --- – n–1 --- = 0.94 = t yd–µd s_d n --- --- = 1.5–0 0.94 10 --- --- = 5.05 =

169 Chapter 9Correlation Coefficient

9

In document OReilly Statistics in a Nutshell A Desktop Quick Reference Aug 2008 pdf (Page 188-193)