Implications for Design

The major implication for design is that balanced data sets are usually a good idea. Balanced data are less susceptible to the effects of nonnormality and

Use balanced

designs _{nonconstant variance. Furthermore, when there is nonconstant variance, we}

can usually determine the direction in which we err for balanced data. When we know that our measurements will be subject to temporal or spatial correlation, we should take care to block and randomize carefully. We can, in principle, use the correlation in our design and analysis to increase precision, but these methods are beyond this text.

6.7 Further Reading and Extensions 141

6.7 Further Reading and Extensions

Statisticians started worrying about what would happen to theirt-tests and

F-tests on real data almost immediately after they started using the tests. See, for example, Pearson (1931). Scheff´e (1959) provides a more mathematical introduction to the effects of violated assumptions than we have given here. Ito (1980) also reviews the subject.

Transformations have long been used in Analysis of Variance. Tukey (1957a) puts the power transformations together as a family, and Box and Cox (1964) introduce the scaling required to make the SSE’s comparable.

Atkinson (1985) and Hoaglin, Mosteller, and Tukey (1983) give more exten- sive treatments of transformations for several goals, including symmetry and equalization of spread.

The Type I error rates for nonnormal data were computed using the methods of Gayen (1950). Gayen assumed that the data followed an Edgeworth distribution, which is specified by its first four moments, and then computed the distribution of the F-ratio (after several pages of awe-inspiring calculus). Our Table 6.5 is computed with his formula (2.30), though note that there are typos in his paper.

Box and Andersen (1955) approached the same problem from a different tack. They computed the mean and expectation of a transformation of the F-ratio under the permutation distribution when the data come from nonnormal distributions. From these moments they compute adjusted degrees of freedom for the F-ratio. They concluded that multiplying the numerator and denominator degrees of freedom by(1 + γ2/N ) gave p-values that more

closely matched the permutation distribution.

There are two enormous, parallel areas of literature that deal with outliers. One direction is outlier identification, which deals with finding outliers, and to some extent with estimating and testing after outliers are found and removed. Major references include Hawkins (1980), Beckman and Cook (1983), and Barnett and Lewis (1994). The second direction is robustness, which deals with procedures that are valid and efficient for nonnormal data (particularly outlier-prone data). Major references include Andrews et al. (1972), Huber (1981), and Hampel et al. (1986). Hoaglin, Mosteller, and Tukey (1983) and Rey (1983) provide gentler introductions.

Rank-based, nonparametric methods are a classical alternative to linear methods for nonnormal data. In the simplest situation, the numerical values of the responses are replaced by their ranks, and we then do randomization analysis on the ranks. This is feasible because the randomization distribution of a rank test can often be computed analytically. Rank-based methods have sometimes been advertised as assumption-free; this is not true. Rank methods

142 Checking Assumptions

have their own strengths and weakness. For example, the power of two- sample rank tests for equality of medians can be very low when the two samples have different spreads. Conover (1980) is a standard introduction to nonparametric statistics.

We have been modifying the data to make them fit the assumptions of our linear analysis. Where possible, a better approach is to use an analysis that is appropriate for the data. Generalized Linear Models (GLM’s) per- mit the kinds of mean structures we have been using to be combined with a variety of error structures, including Poisson, binomial, gamma, and other distributions. GLM’s allow direct modeling of many forms of nonnormality and nonconstant variance. On the down side, GLM’s are more difficult to compute, and most of their inference is asymptotic. McCullagh and Nelder (1989) is the standard reference for GLM’s.

We computed approximate test sizes for F under nonconstant variance using a method given in Box (1954). When our distributional assumptions and the null hypothesis are true, then our observed F-statistic_{Fobs is distributed} as F with_{g − 1 and N − g degrees of freedom, and}

P (Fobs > FE,g−1,N −g) = E.

If the null is true but we have different variances in the different groups, then

Fobs/b is distributed approximately as F (ν1, ν2), where

b = N − g N (g − 1) P i(N − ni)σ2i P i(ni− 1)σi2 , ν1 = [P_i_{(N − n}i)σi2]2 [P_iniσ2i]2+ N P i(N − 2ni)σ4i , ν2 = [P_i(ni− 1)σ2i]2 P i(ni− 1)σ_i4 .

Thus the actual Type I error rate of the usual F test under nonconstant variance is approximately the probability that an F with ν1 and ν2 degrees of

freedom is greater thanFE,g−1,N −g/b.

The Durbin-Watson statistic was developed in a series of papers (Durbin and Watson 1950, Durbin and Watson 1951, and Durbin and Watson 1971). The distribution of DW is complicated in even simple situations. Ali (1984) gives a (relatively) simple approximation to the distribution of DW.

There are many more methods to test for serial correlation. Several fairly simple related tests are called runs tests. These tests are based on the idea that

6.8 Problems 143

if the residuals are arranged in time order, then positive serial correlation will lead to “runs” in the residuals. Different procedures measure runs differently. For example, Geary’s test is the total number of consecutive pairs of residuals that have the same sign (Geary 1970). Other runs include maximum number of consecutive residuals of the same sign, the number of runs up (residuals increasing) and down (residuals decreasing), and so on.

In some instances we might believe that we know the correlation struc- ture of the errors. For example, in some genetics studies we might believe that correlation can be deduced from pedigree information. If the correlation is known, it can be handled simply and directly by using generalized least squares (Weisberg 1985).

We usually have to use advanced methods from times series or spatial statistics to deal with correlation. Anderson (1954), Durbin (1960), Pierce (1971), and Tsay (1984) all deal with the problem of regression when the residuals are temporally correlated. Kriging is a class of methods for dealing with spatially correlated data that has become widely used, particularly in geology and environmental sciences. Cressie (1991) is a standard reference for spatial statistics. Grondona and Cressie (1991) describe using spatial statistics in the analysis of designed experiments.

6.8 Problems

As part of a larger experiment, 32 male hamsters were assigned to four Exercise 6.1

treatments in a completely randomized fashion, eight hamsters per treatment. The treatments were 0, 1, 10, and 100 nmole of melatonin daily, 1 hour prior to lights out for 12 weeks. The response was paired testes weight (in mg). Below are the means and standard deviations for each treatment group (data from Rollag 1982). What is the problem with these data and what needs to be done to fix it?

Melatonin Mean SD 0 nmole 3296 90 1 nmole 2574 153 10 nmole 1466 207 100 nmole 692 332

Bacteria in solution are often counted by a method known as serial dilu- Exercise 6.2

tion plating. Petri dishes with a nutrient agar are inoculated with a measured amount of solution. After 3 days of growth, an individual bacterium will have grown into a small colony that can be seen with the naked eye. Count- ing original bacteria in the inoculum is then done by counting the colonies on

144 Checking Assumptions

the plate. Trouble arises because we don’t know how much solution to add. If we get too many bacteria in the inoculum, the petri dish will be covered with a lawn of bacterial growth and we won’t be able to identify the colonies. If we get too few bacteria in the inoculum, there may be no colonies to count. The resolution is to make several dilutions of the original solution (1:1, 10:1, 100:1, and so on) and make a plate for each of these dilutions. One of the dilutions should produce a plate with 10 to 100 colonies on it, and that is the one we use. The count in the original sample is obtained by multiplying by the dilution factor.

Suppose that we are trying to compare three different Pasteurization treatments for milk. Fifteen samples of milk are randomly assigned to the three treatments, and we determine the bacterial load in each sample after treatment via serial dilution plating. The following table gives the counts.

Treatment 1 _{26 × 10}2 _{29 × 10}2 _{20 × 10}2 _{22 × 10}2 _{32 × 10}2 Treatment 2 _{35 × 10}3 _{23 × 10}3 _{20 × 10}3 _{30 × 10}3 _{27 × 10}3 Treatment 3 _{29 × 10}5 _{23 × 10}5 _{17 × 10}5 _{29 × 10}5 _{20 × 10}5 Test the null hypothesis that the three treatments have the same effect on bacterial concentration.

In order to determine the efficacy and lethal dosage of cardiac relaxants,

Exercise 6.3

anesthetized guinea pigs are infused with a drug (the treatment) till death occurs. The total dosage required for death is the response; smaller lethal doses are considered more effective. There are four drugs, and ten guinea pigs are chosen at random for each drug. Lethal dosages follow.

1 18.2 16.4 10.0 13.5 13.5 6.7 12.2 18.2 13.5 16.4 2 5.5 12.2 11.0 6.7 16.4 8.2 7.4 12.2 6.7 11.0 3 5.5 5.0 8.2 9.0 10.0 6.0 7.4 5.5 12.2 8.2 4 6.0 7.4 12.2 11.0 5.0 7.4 7.4 5.5 6.7 5.5 Determine which drugs are equivalent, which are more effective, and which less effective.

Four overnight delivery services are tested for “gentleness” by shipping

Exercise 6.4

fragile items. The breakage rates observed are given below: A 17 20 15 21 28 B 7 11 15 10 10 C 11 9 5 12 6

6.8 Problems 145

You immediately realize that the variance is not stable. Find an approximate 95% confidence interval for the transformation power using the Box-Cox method.

Consider the following four plots. Describe what each plot tells you Exercise 6.5

about the assumptions of normality, independence, and constant variance. (Some plots may tell you nothing about assumptions.)

a) -3 -2 -1 0 1 2 1 2 3 4 5 6 7 Yhat S t u d e n t i z e d r e s i d s * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * b) -2 0 2 4 6 8 10 12 14 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 Rankits y

146 Checking Assumptions c) -1.5 -1 -0.5 0 0.5 1 1.5 5 10 15 20 25 30 35 40 Time order R e s i d u a l s d) -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -0.3 -0.2 -0.1 0 0.1 0.2 Yhat S t u d e n t i z e d r e s i d s * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

An instrument called a “Visiplume” measures ultraviolet light. By com-

Exercise 6.6

paring absorption in clear air and absorption in polluted air, the concentration of SO2in the polluted air can be estimated. The EPA has a standard method

for measuring SO2, and we wish to compare the two methods across a range

of air samples. The recorded response is the ratio of the Visiplume reading to the EPA standard reading. The four experimental conditions are: measurements of SO2in an inflated bag (n = 9), measurements of a smoke generator

with SO2injected (n = 11), measurements at two coal-fired plants (n = 5 and

6.8 Problems 147

relative to the standard method across all experimental conditions, between the coal-fired plants, and between the generated smoke and the real coal-fired smoke. The data follow (McElhoe and Conner 1986):

Bag 1.055 1.272 .824 1.019 1.069 .983 1.025 1.076 1.100 Smoke 1.131 1.236 1.161 1.219 1.169 1.238 1.197 1.252 1.435 .827 3.188 Plant no. 1 .798 .971 .923 1.079 1.065 Plant no. 2 .950 .978 .762 .733 .823 1.011

We wish to study the competition of grass species: in particular, big Problem 6.1

bluestem (from the tall grass prairie) versus quack grass (a weed). We set up an experimental garden with 24 plots. These plots were randomly al- located to the six treatments: nitrogen level 1 (200 mg N/kg soil) and no irrigation; nitrogen level 1 and 1cm/week irrigation; nitrogen level 2 (400 mg N/kg soil) and no irrigation; nitrogen level 3 (600 mg N/kg soil) no irrigation; nitrogen level 4 (800 mg N/kg soil) and no irrigation; and nitrogen level 4 and 1 cm/week irrigation. Big bluestem was seeded in these plots and allowed to establish itself. After one year, we added a measured amount of quack grass seed to each plot. After another year, we harvest the grass and measure the fraction of living material in each plot that is big bluestem. We wish to determine the effects (if any) of nitrogen and/or irrigation on the ability of quack grass to invade big bluestem. (Based on Wedin 1990.)

N level 1 1 2 3 4 4 Irrigation N Y N N N Y 97 83 85 64 52 48 96 87 84 72 56 58 92 78 78 63 44 49 95 81 79 74 50 53

(a) Do the data need a transformation? If so, which transformation? (b) Provide an Analysis of Variance for these data. Are all the treatments

equivalent?

(d) Is there a significant effect of irrigation?

(e) Under which conditions is big bluestem best able to prevent the inva- sion by quack grass? Is the response at this set of conditions signifi- cantly different from the other conditions?

148 Checking Assumptions

What happens to thet-statistic as one of the values becomes extremely Question 6.1

large? Look at the data set consisting of the five numbers 0, 0, 0, 0, K, and compute the t-test for testing the null hypothesis that these numbers come

from a population with mean 0. What happens to thet-statistic as K goes to

infinity?

Why would we expect the log transformation to be the variance-stabilizing

Question 6.2

Chapter 7

Power and Sample Size

The last four chapters have dealt with analyzing experimental results. In this chapter we return to design and consider the issues of choosing and assessing sample sizes. As we know, an experimental design is determined by the units, the treatments, and the assignment mechanism. Once we have chosen a pool of experimental units, decided which treatments to use, and settled on a completely randomized design, the major thing left to decide is the sample

sizes for the various treatments. Choice of sample size is important because Decide how large an experiment is needed

we want our experiment to be as small as possible to save time and money, but big enough to get the job done. What we need is a way to figure out how large an experiment needs to be to meet our goals; a bigger experiment would be wasteful, and a smaller experiment won’t meet our needs.

7.1 Approaches to Sample Size Selection

There are two approaches to specifying our needs from an experiment, and both require that we know something about the system under test to do effective sample size planning. First, we can require that confidence intervals

for means or contrasts should be no wider than a specified length. For exam- Specify maximum CI width

ple, we might require that a confidence interval for the difference in average weight loss under two diets should be no wider than 1 kg. The width of a confidence interval depends on the desired coverage, the error variance, and the sample size, so we must know the error variance at least roughly before we can compute the required sample size. If we have no idea about the size of the error variance, then we cannot say how wide our intervals will be, and we cannot plan an appropriate sample size.

150 Power and Sample Size

The second approach to sample size selection involves error rates for the fixed level ANOVA F-test. While we prefer to usep-values for analysis, fixed

level testing turns out to be a convenient framework for choosing sample size. In a fixed level test, we either reject the null hypothesis or we fail to reject the null hypothesis. If we reject a true null hypothesis, we have made a Type I error, and if we fail to reject a false null hypothesis, we have made a Type II error. The probability of making a Type I error is_EI;EI is under our control.

We choose a Type I error rate _EI (5%, 1%, etc.), and rejectH0 if the p-

Power is probability of rejecting a false null hypothesis

value is less than _EI. The probability of making a Type II error is EII; the

probability of rejectingH0whenH0 is false is1 − EII and is called power.

The Type II error rate_EIIdepends on virtually everything:EI,g, σ2, and the

αi’s andni’s. Most books use the symbolsα and β for the Type I and II error

rates. We use_{E for error rates, and use subscripts here to distinguish types of} errors.

It is more or less true that we can fix all but one of the interrelated pa- rameters and solve for the missing one. For example, we may choose_EI,g,

σ2_{, and the}_α

i’s andni and then solve for1 − EII. This is called a power

analysis, because we are determining the power of the experiment for the alternative specified by the particularαi’s. We may also chooseEI,g, 1 − EII,

σ2 and theαi’s and then solve for the sample sizes. This, of course, is called

Find minimum sample size that gives desired power

a sample size analysis, because we have specified a required power and now find a sample size that achieves that power. For example, consider a situation with three diets, and_EIis .05. How large shouldN be (assuming equal ni’s)

to have a 90% chance of rejectingH0whenσ2is 9 and the treatment mean

responses are -7, -5, 3 (αi’s are -4, -2, and 6)?

The use of power or sample size analysis begins by deciding on interesting values of the treatment effects and likely ranges for the error variance. “Interesting” values of treatment effects could be anticipated effects, or they

Use prior knowledge of system

could be effects that are of a size to be scientifically significant; in either case, we want to be able to detect interesting effects. For each combina- tion of treatment effects, error variance, sample sizes, and Type I error rate, we may compute the power of the experiment. Sample size computation amounts to repeating this exercise again and again until we find the smallest sample sizes that give us at least as much power as required. Thus what we do is set up a set of circumstances that we would like to detect with a given probability, and then design for those circumstances.

Example 7.1 VOR in ataxia patients

Spinocerebellar ataxias (SCA’s) are inherited, degenerative, neurological dis- eases. Clinical evidence suggests that eye movements and posture are af- fected by SCA. There are several distinct types of SCA’s, and we would like

In document Experimentos (Page 161-172)