The major implication for design is that balanced data sets are usually a good idea. Balanced data are less susceptible to the effects of nonnormality and
Use balanced
designs nonconstant variance. Furthermore, when there is nonconstant variance, we
can usually determine the direction in which we err for balanced data. When we know that our measurements will be subject to temporal or spatial correlation, we should take care to block and randomize carefully. We can, in principle, use the correlation in our design and analysis to increase precision, but these methods are beyond this text.
6.7 Further Reading and Extensions 141
6.7
Further Reading and Extensions
Statisticians started worrying about what would happen to theirt-tests and
F-tests on real data almost immediately after they started using the tests. See, for example, Pearson (1931). Scheff´e (1959) provides a more mathematical introduction to the effects of violated assumptions than we have given here. Ito (1980) also reviews the subject.
Transformations have long been used in Analysis of Variance. Tukey (1957a) puts the power transformations together as a family, and Box and Cox (1964) introduce the scaling required to make the SSE’s comparable.
Atkinson (1985) and Hoaglin, Mosteller, and Tukey (1983) give more exten- sive treatments of transformations for several goals, including symmetry and equalization of spread.
The Type I error rates for nonnormal data were computed using the meth- ods of Gayen (1950). Gayen assumed that the data followed an Edgeworth distribution, which is specified by its first four moments, and then computed the distribution of the F-ratio (after several pages of awe-inspiring calculus). Our Table 6.5 is computed with his formula (2.30), though note that there are typos in his paper.
Box and Andersen (1955) approached the same problem from a differ- ent tack. They computed the mean and expectation of a transformation of the F-ratio under the permutation distribution when the data come from non- normal distributions. From these moments they compute adjusted degrees of freedom for the F-ratio. They concluded that multiplying the numerator and denominator degrees of freedom by(1 + γ2/N ) gave p-values that more
closely matched the permutation distribution.
There are two enormous, parallel areas of literature that deal with out- liers. One direction is outlier identification, which deals with finding out- liers, and to some extent with estimating and testing after outliers are found and removed. Major references include Hawkins (1980), Beckman and Cook (1983), and Barnett and Lewis (1994). The second direction is robustness, which deals with procedures that are valid and efficient for nonnormal data (particularly outlier-prone data). Major references include Andrews et al. (1972), Huber (1981), and Hampel et al. (1986). Hoaglin, Mosteller, and Tukey (1983) and Rey (1983) provide gentler introductions.
Rank-based, nonparametric methods are a classical alternative to linear methods for nonnormal data. In the simplest situation, the numerical values of the responses are replaced by their ranks, and we then do randomization analysis on the ranks. This is feasible because the randomization distribution of a rank test can often be computed analytically. Rank-based methods have sometimes been advertised as assumption-free; this is not true. Rank methods
142 Checking Assumptions
have their own strengths and weakness. For example, the power of two- sample rank tests for equality of medians can be very low when the two samples have different spreads. Conover (1980) is a standard introduction to nonparametric statistics.
We have been modifying the data to make them fit the assumptions of our linear analysis. Where possible, a better approach is to use an analysis that is appropriate for the data. Generalized Linear Models (GLM’s) per- mit the kinds of mean structures we have been using to be combined with a variety of error structures, including Poisson, binomial, gamma, and other distributions. GLM’s allow direct modeling of many forms of nonnormality and nonconstant variance. On the down side, GLM’s are more difficult to compute, and most of their inference is asymptotic. McCullagh and Nelder (1989) is the standard reference for GLM’s.
We computed approximate test sizes for F under nonconstant variance us- ing a method given in Box (1954). When our distributional assumptions and the null hypothesis are true, then our observed F-statisticFobs is distributed as F withg − 1 and N − g degrees of freedom, and
P (Fobs > FE,g−1,N −g) = E.
If the null is true but we have different variances in the different groups, then
Fobs/b is distributed approximately as F (ν1, ν2), where
b = N − g N (g − 1) P i(N − ni)σ2i P i(ni− 1)σi2 , ν1 = [Pi(N − ni)σi2]2 [Piniσ2i]2+ N P i(N − 2ni)σ4i , ν2 = [Pi(ni− 1)σ2i]2 P i(ni− 1)σi4 .
Thus the actual Type I error rate of the usual F test under nonconstant vari- ance is approximately the probability that an F with ν1 and ν2 degrees of
freedom is greater thanFE,g−1,N −g/b.
The Durbin-Watson statistic was developed in a series of papers (Durbin and Watson 1950, Durbin and Watson 1951, and Durbin and Watson 1971). The distribution of DW is complicated in even simple situations. Ali (1984) gives a (relatively) simple approximation to the distribution of DW.
There are many more methods to test for serial correlation. Several fairly simple related tests are called runs tests. These tests are based on the idea that
6.8 Problems 143
if the residuals are arranged in time order, then positive serial correlation will lead to “runs” in the residuals. Different procedures measure runs differently. For example, Geary’s test is the total number of consecutive pairs of residuals that have the same sign (Geary 1970). Other runs include maximum number of consecutive residuals of the same sign, the number of runs up (residuals increasing) and down (residuals decreasing), and so on.
In some instances we might believe that we know the correlation struc- ture of the errors. For example, in some genetics studies we might believe that correlation can be deduced from pedigree information. If the correlation is known, it can be handled simply and directly by using generalized least squares (Weisberg 1985).
We usually have to use advanced methods from times series or spatial statistics to deal with correlation. Anderson (1954), Durbin (1960), Pierce (1971), and Tsay (1984) all deal with the problem of regression when the residuals are temporally correlated. Kriging is a class of methods for dealing with spatially correlated data that has become widely used, particularly in geology and environmental sciences. Cressie (1991) is a standard reference for spatial statistics. Grondona and Cressie (1991) describe using spatial statistics in the analysis of designed experiments.
6.8
Problems
As part of a larger experiment, 32 male hamsters were assigned to four Exercise 6.1
treatments in a completely randomized fashion, eight hamsters per treatment. The treatments were 0, 1, 10, and 100 nmole of melatonin daily, 1 hour prior to lights out for 12 weeks. The response was paired testes weight (in mg). Below are the means and standard deviations for each treatment group (data from Rollag 1982). What is the problem with these data and what needs to be done to fix it?
Melatonin Mean SD 0 nmole 3296 90 1 nmole 2574 153 10 nmole 1466 207 100 nmole 692 332
Bacteria in solution are often counted by a method known as serial dilu- Exercise 6.2
tion plating. Petri dishes with a nutrient agar are inoculated with a measured amount of solution. After 3 days of growth, an individual bacterium will have grown into a small colony that can be seen with the naked eye. Count- ing original bacteria in the inoculum is then done by counting the colonies on
144 Checking Assumptions
the plate. Trouble arises because we don’t know how much solution to add. If we get too many bacteria in the inoculum, the petri dish will be covered with a lawn of bacterial growth and we won’t be able to identify the colonies. If we get too few bacteria in the inoculum, there may be no colonies to count. The resolution is to make several dilutions of the original solution (1:1, 10:1, 100:1, and so on) and make a plate for each of these dilutions. One of the dilutions should produce a plate with 10 to 100 colonies on it, and that is the one we use. The count in the original sample is obtained by multiplying by the dilution factor.
Suppose that we are trying to compare three different Pasteurization treat- ments for milk. Fifteen samples of milk are randomly assigned to the three treatments, and we determine the bacterial load in each sample after treat- ment via serial dilution plating. The following table gives the counts.
Treatment 1 26 × 102 29 × 102 20 × 102 22 × 102 32 × 102 Treatment 2 35 × 103 23 × 103 20 × 103 30 × 103 27 × 103 Treatment 3 29 × 105 23 × 105 17 × 105 29 × 105 20 × 105 Test the null hypothesis that the three treatments have the same effect on bacterial concentration.
In order to determine the efficacy and lethal dosage of cardiac relaxants,
Exercise 6.3
anesthetized guinea pigs are infused with a drug (the treatment) till death occurs. The total dosage required for death is the response; smaller lethal doses are considered more effective. There are four drugs, and ten guinea pigs are chosen at random for each drug. Lethal dosages follow.
1 18.2 16.4 10.0 13.5 13.5 6.7 12.2 18.2 13.5 16.4 2 5.5 12.2 11.0 6.7 16.4 8.2 7.4 12.2 6.7 11.0 3 5.5 5.0 8.2 9.0 10.0 6.0 7.4 5.5 12.2 8.2 4 6.0 7.4 12.2 11.0 5.0 7.4 7.4 5.5 6.7 5.5 Determine which drugs are equivalent, which are more effective, and which less effective.
Four overnight delivery services are tested for “gentleness” by shipping
Exercise 6.4
fragile items. The breakage rates observed are given below: A 17 20 15 21 28 B 7 11 15 10 10 C 11 9 5 12 6
6.8 Problems 145
You immediately realize that the variance is not stable. Find an approximate 95% confidence interval for the transformation power using the Box-Cox method.
Consider the following four plots. Describe what each plot tells you Exercise 6.5
about the assumptions of normality, independence, and constant variance. (Some plots may tell you nothing about assumptions.)
a) -3 -2 -1 0 1 2 1 2 3 4 5 6 7 Yhat S t u d e n t i z e d r e s i d s * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * b) -2 0 2 4 6 8 10 12 14 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Rankits y
146 Checking Assumptions c) -1.5 -1 -0.5 0 0.5 1 1.5 5 10 15 20 25 30 35 40 Time order R e s i d u a l s d) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -0.3 -0.2 -0.1 0 0.1 0.2 Yhat S t u d e n t i z e d r e s i d s * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
An instrument called a “Visiplume” measures ultraviolet light. By com-
Exercise 6.6
paring absorption in clear air and absorption in polluted air, the concentration of SO2in the polluted air can be estimated. The EPA has a standard method
for measuring SO2, and we wish to compare the two methods across a range
of air samples. The recorded response is the ratio of the Visiplume reading to the EPA standard reading. The four experimental conditions are: measure- ments of SO2in an inflated bag (n = 9), measurements of a smoke generator
with SO2injected (n = 11), measurements at two coal-fired plants (n = 5 and
6.8 Problems 147
relative to the standard method across all experimental conditions, between the coal-fired plants, and between the generated smoke and the real coal-fired smoke. The data follow (McElhoe and Conner 1986):
Bag 1.055 1.272 .824 1.019 1.069 .983 1.025 1.076 1.100 Smoke 1.131 1.236 1.161 1.219 1.169 1.238 1.197 1.252 1.435 .827 3.188 Plant no. 1 .798 .971 .923 1.079 1.065 Plant no. 2 .950 .978 .762 .733 .823 1.011
We wish to study the competition of grass species: in particular, big Problem 6.1
bluestem (from the tall grass prairie) versus quack grass (a weed). We set up an experimental garden with 24 plots. These plots were randomly al- located to the six treatments: nitrogen level 1 (200 mg N/kg soil) and no irrigation; nitrogen level 1 and 1cm/week irrigation; nitrogen level 2 (400 mg N/kg soil) and no irrigation; nitrogen level 3 (600 mg N/kg soil) no ir- rigation; nitrogen level 4 (800 mg N/kg soil) and no irrigation; and nitrogen level 4 and 1 cm/week irrigation. Big bluestem was seeded in these plots and allowed to establish itself. After one year, we added a measured amount of quack grass seed to each plot. After another year, we harvest the grass and measure the fraction of living material in each plot that is big bluestem. We wish to determine the effects (if any) of nitrogen and/or irrigation on the ability of quack grass to invade big bluestem. (Based on Wedin 1990.)
N level 1 1 2 3 4 4 Irrigation N Y N N N Y 97 83 85 64 52 48 96 87 84 72 56 58 92 78 78 63 44 49 95 81 79 74 50 53
(a) Do the data need a transformation? If so, which transformation? (b) Provide an Analysis of Variance for these data. Are all the treatments
equivalent?
(c) Are there significant quadratic effects of nitrogen under nonirrigated conditions?
(d) Is there a significant effect of irrigation?
(e) Under which conditions is big bluestem best able to prevent the inva- sion by quack grass? Is the response at this set of conditions signifi- cantly different from the other conditions?
148 Checking Assumptions
What happens to thet-statistic as one of the values becomes extremely Question 6.1
large? Look at the data set consisting of the five numbers 0, 0, 0, 0, K, and compute the t-test for testing the null hypothesis that these numbers come
from a population with mean 0. What happens to thet-statistic as K goes to
infinity?
Why would we expect the log transformation to be the variance-stabilizing
Question 6.2
Chapter 7
Power and Sample Size
The last four chapters have dealt with analyzing experimental results. In this chapter we return to design and consider the issues of choosing and assessing sample sizes. As we know, an experimental design is determined by the units, the treatments, and the assignment mechanism. Once we have chosen a pool of experimental units, decided which treatments to use, and settled on a completely randomized design, the major thing left to decide is the sample
sizes for the various treatments. Choice of sample size is important because Decide how large an experiment is needed
we want our experiment to be as small as possible to save time and money, but big enough to get the job done. What we need is a way to figure out how large an experiment needs to be to meet our goals; a bigger experiment would be wasteful, and a smaller experiment won’t meet our needs.
7.1
Approaches to Sample Size Selection
There are two approaches to specifying our needs from an experiment, and both require that we know something about the system under test to do ef- fective sample size planning. First, we can require that confidence intervals
for means or contrasts should be no wider than a specified length. For exam- Specify maximum CI width
ple, we might require that a confidence interval for the difference in average weight loss under two diets should be no wider than 1 kg. The width of a confidence interval depends on the desired coverage, the error variance, and the sample size, so we must know the error variance at least roughly before we can compute the required sample size. If we have no idea about the size of the error variance, then we cannot say how wide our intervals will be, and we cannot plan an appropriate sample size.
150 Power and Sample Size
The second approach to sample size selection involves error rates for the fixed level ANOVA F-test. While we prefer to usep-values for analysis, fixed
level testing turns out to be a convenient framework for choosing sample size. In a fixed level test, we either reject the null hypothesis or we fail to reject the null hypothesis. If we reject a true null hypothesis, we have made a Type I error, and if we fail to reject a false null hypothesis, we have made a Type II error. The probability of making a Type I error isEI;EI is under our control.
We choose a Type I error rate EI (5%, 1%, etc.), and rejectH0 if the p-
Power is probability of rejecting a false null hypothesis
value is less than EI. The probability of making a Type II error is EII; the
probability of rejectingH0whenH0 is false is1 − EII and is called power.
The Type II error rateEIIdepends on virtually everything:EI,g, σ2, and the
αi’s andni’s. Most books use the symbolsα and β for the Type I and II error
rates. We useE for error rates, and use subscripts here to distinguish types of errors.
It is more or less true that we can fix all but one of the interrelated pa- rameters and solve for the missing one. For example, we may chooseEI,g,
σ2, and theα
i’s andni and then solve for1 − EII. This is called a power
analysis, because we are determining the power of the experiment for the al- ternative specified by the particularαi’s. We may also chooseEI,g, 1 − EII,
σ2 and theαi’s and then solve for the sample sizes. This, of course, is called
Find minimum sample size that gives desired power
a sample size analysis, because we have specified a required power and now find a sample size that achieves that power. For example, consider a situation with three diets, andEIis .05. How large shouldN be (assuming equal ni’s)
to have a 90% chance of rejectingH0whenσ2is 9 and the treatment mean
responses are -7, -5, 3 (αi’s are -4, -2, and 6)?
The use of power or sample size analysis begins by deciding on interest- ing values of the treatment effects and likely ranges for the error variance. “Interesting” values of treatment effects could be anticipated effects, or they
Use prior knowledge of system
could be effects that are of a size to be scientifically significant; in either case, we want to be able to detect interesting effects. For each combina- tion of treatment effects, error variance, sample sizes, and Type I error rate, we may compute the power of the experiment. Sample size computation amounts to repeating this exercise again and again until we find the smallest sample sizes that give us at least as much power as required. Thus what we do is set up a set of circumstances that we would like to detect with a given probability, and then design for those circumstances.
Example 7.1 VOR in ataxia patients
Spinocerebellar ataxias (SCA’s) are inherited, degenerative, neurological dis- eases. Clinical evidence suggests that eye movements and posture are af- fected by SCA. There are several distinct types of SCA’s, and we would like