0 10+1 2 5 + (−1)2 5 = .01 .4 = .025 .
The between groups sum of squares is
10(6.3 − 6.375)2+ 5(6.4 − 6.375)2+ 5(6.5 − 6.375)2 = .1375
which equals .1125 + .025.
We can see from Example 4.2 one of the advantages of contrasts over
the full between groups sum of squares. The control-versus-new contrast has Contrasts isolate differences
a sum of squares which is 4.5 times larger than the sum of squares for the difference of the new treatments. This indicates that the responses from the new treatments are substantially farther from the control responses than they are from each other. Such indications are not possible using the between groups sum of squares.
The actual contrasts one uses in an analysis arise from the context of the problem. Here we had new versus old and the difference between the two new treatments. In a study on the composition of ice cream, we might compare artificial flavorings with natural flavorings, or expensive flavorings with inexpensive flavorings. It is often difficult to construct a complete set of meaningful orthogonal contrasts, but that should not deter you from using an incomplete set of orthogonal contrasts, or from using contrasts that are nonorthogonal.
Use contrasts that address the questions you are trying to answer.
4.4
Polynomial Contrasts
Section 3.10 introduced the idea of polynomial modeling of a response when
the treatments had a quantitative dose structure. We selected a polynomial Contrasts yield improvementSS
in polynomial dose-response models
model by looking at the improvement sums of squares obtained by adding each polynomial term to the model in sequence. Each of these additional terms in the polynomial has a single degree of freedom, just like a contrast. In fact, each of these improvement sums of squares can be obtained as a contrast sum of squares. We call the contrast that gives us the sum of squares for the linear term the linear contrast, the contrast that gives us the improvement sum of squares for the quadratic term the quadratic contrast, and so on.
74 Looking for Specific Differences—Contrasts
When the doses are equally spaced and the sample sizes are equal, then the contrast coefficients for polynomial terms are fairly simple and can be
Simple contrasts for equally spaced doses with equalni
found, for example, in Appendix Table D.6; these contrasts are orthogonal and have been scaled to be simple integer values. Equally spaced doses means that the gaps between successive doses are the same, as in 1, 4, 7, 10. Using these tabulated contrast coefficients, we may compute the linear, quadratic, and higher order sums of squares as contrasts without fitting a sep- arate polynomial model. Doses such as 1, 10, 100, 1000 are equally spaced on a logarithmic scale, so we can again use the simple polynomial contrast coefficients, provided we interpret the polynomial as a polynomial in the log- arithm of dose.
When the doses are not equally spaced or the sample sizes are not equal, then contrasts for polynomial terms exist, but are rather complicated to de- rive. In this situation, it is more trouble to derive the coefficients for the polynomial contrasts than it is to fit a polynomial model.
Example 4.3 Leaflet angles
Exercise 3.5 introduced the leaflet angles of plants at 30, 45, and 60 minutes after exposure to red light. Summary information for this experiment is given here:
Delay time (min) 30 45 60
yi• 139.6 133.6 122.4
ni 5 5 5
M SE = 58.13
With three equally spaced groups, the linear and quadratic contrasts are (-1, 0, 1) and (1, -2, 1).
The sum of squares for linear is
((−1)139.6 + (0)133.6 + (1)122.4)2 (−1)2 5 +05 +1 2 5 = 739.6 ,
and that for quadratic is
((1)139.6 + (−2)133.6 + (1)122.4)2 12 5 + (−2)2 5 +1 2 5 = 22.53 .
Thus the F-tests for linear and quadratic are 739.6/58.13 = 12.7 and 22.53/58.13 = .39, both with 1 and 12 degrees of freedom; there is a strong
4.5 Further Reading and Extensions 75
4.5
Further Reading and Extensions
Contrasts are a special case of estimable functions, which are described in some detail in Appendix Section A.6. Treatment means and averages of treatment means are other estimable functions. Estimable functions are those features of the data that do not depend on how we choose to restrict the treat- ment effects.
4.6
Problems
An experimenter randomly allocated 125 male turkeys to five treatment Exercise 4.1
groups: 0 mg, 20 mg, 40 mg, 60 mg, and 80 mg of estradiol. There were 25 birds in each group, and the mean results were 2.16, 2.45, 2.91, 3.00, and 2.71 respectively. The sum of squares for experimental error was 153.4. Test the null hypothesis that the five group means are the same against the alternative that they are not all the same. Find the linear, quadratic, cubic, and quartic sums of squares (you may lump the cubic and quartic together into a “higher than quadratic” if you like). Test the null hypothesis that the quadratic effect is zero. Be sure to report ap-value.
Use the data from Exercise 3.3. Compute a 99% confidence interval for Exercise 4.2
the difference in response between the average of the three treatment groups (acid, pulp, and salt) and the control group.
Refer to the data in Problem 3.1. Workers 1 and 2 were experienced, Exercise 4.3
whereas workers 3 and 4 were novices. Find a contrast to compare the expe- rienced and novice workers and test the null hypothesis that experienced and novice works produce the same average shear strength.
Consider an experiment taste-testing six types of chocolate chip cookies: Exercise 4.4
1 (brand A, chewy, expensive), 2 (brand A, crispy, expensive), 3 (brand B, chewy, inexpensive), 4 (brand B, crispy, inexpensive), 5 (brand C, chewy, expensive), and 6 (brand D, crispy, inexpensive). We will use twenty different raters randomly assigned to each type (120 total raters).
(a) Design contrasts to compare chewy with crispy, and expensive with inex- pensive.
(b) Are your contrasts in part (a) orthogonal? Why or why not?
A consumer testing agency obtains four cars from each of six makes: Problem 4.1
Ford, Chevrolet, Nissan, Lincoln, Cadillac, and Mercedes. Makes 3 and 6 are imported while the others are domestic; makes 4, 5, and 6 are expensive
76 Looking for Specific Differences—Contrasts
while 1, 2, and 3 are less expensive; 1 and 4 are Ford products, while 2 and 5 are GM products. We wish to compare the six makes on their oil use per 100,000 miles driven. The mean responses by make of car were 4.6, 4.3, 4.4, 4.7, 4.8, and 6.2, and the sum of squares for error was 2.25.
(a) Compute the Analysis of Variance table for this experiment. What would you conclude?
(b) Design a set of contrasts that seem meaningful. For each contrast, outline its purpose and compute a 95% confidence interval.
Consider the data in Problem 3.2. Design a set of contrasts that seem
Problem 4.2
meaningful. For each contrast, outline its purpose and test the null hypothesis that the contrast has expected value zero.
Consider the data in Problem 3.5. Use polynomial contrasts to choose a
Problem 4.3
quantitative model to describe the effect of fiber proportion on the response. Show that orthogonal contrasts in the observed treatment means are un-
Question 4.1
Chapter 5
Multiple Comparisons
When we make several related tests or interval estimates at the same time, we need to make multiple comparisons or do simultaneous inference. The issue of multiple comparisons is one of error rates. Each of the individual
tests or confidence intervals has a Type I error rateEi that can be controlled Multiple
comparisons, simultaneous inference, families of hypotheses
by the experimenter. If we consider the tests together as a family, then we can also compute a combined Type I error rate for the family of tests or intervals. When a family contains more and more true null hypotheses, the probabil- ity that one or more of these true null hypotheses is rejected increases, and the probability of any Type I errors in the family can become quite large. Multiple comparisons procedures deal with Type I error rates for families of tests.
Carcinogenic mixtures Example 5.1
We are considering a new cleaning solvent that is a mixture of 100 chemicals. Suppose that regulations state that a mixture is safe if all of its constituents are safe (pretending we can ignore chemical interaction). We test the 100 chemicals for causing cancer, running each test at the 5% level. This is the individual error rate that we can control.
What happens if all 100 chemicals are harmless and safe? Because we are testing at the 5% level, we expect 5% of the nulls to be rejected even when all the nulls are true. Thus, on average, 5 of the 100 chemicals will be declared to be carcinogenic, even when all are safe. Moreover, if the tests are independent, then one or more of the chemicals will be declared unsafe in 99.4% of all sets of experiments we run, even if all the chemicals are safe. This 99.4% is a combined Type I error rate; clearly we have a problem.
78 Multiple Comparisons
5.1
Error Rates
When we have more than one test or interval to consider, there are several ways to define a combined Type I error rate for the family of tests. This vari- ety of combined Type I error rates is the source of much confusion in the use
Determine error
rate to control of multiple comparisons, as different error rates lead to different procedures. People sometimes ask “Which procedure should I use?” when the real ques- tion is “Which error rate do I want to control?”. As data analyst, you need to decide which error rate is appropriate for your situation and then choose a method of analysis appropriate for that error rate. This choice of error rate is not so much a statistical decision as a scientific decision in the particular area under consideration.
Data snooping is a practice related to having many tests. Data snooping
occurs when we first look over the data and then choose the null hypotheses
Data snooping performs many implicit tests
to be tested based on “interesting” features in the data. What we tend to do is consider many potential features of the data and discard those with uninteresting or null behavior. When we data snoop and then perform a test, we tend to see the smallestp-value from the ill-defined family of tests that we
considered when we were snooping; we have not really performed just one test. Some multiple comparisons procedures can actually control for data snooping.
Simultaneous inference is deciding which error rate we wish to control, and
then using a procedure that controls the desired error rate.
Let’s set up some notation for our problem. We have a set of K null
hypothesesH01,H02,. . ., H0K. We also have the “combined,” “overall,” or
“intersection” null hypothesesH0which is true if all of theH0iare true. In
Individual and combined null hypotheses
formula,
H0= H01∩ H02∩ · · · ∩ H0K.
The collectionH01, H02,. . ., H0K is sometimes called a family of null hy-
potheses. We rejectH0 if any of null hypothesesH0iis rejected. In Exam-
ple 5.1,K = 100, H0iis the null hypothesis that chemicali is safe, and H0
is the null hypothesis that all chemicals are safe so that the mixture is safe. We now define five combined Type I error rates. The definitions of these error rates depend on numbers or fractions of falsely rejected null hypotheses
H0i, which will never be known in practice. We set up the error rates here
and later give procedures that can be shown mathematically to control the error rates.
5.1 Error Rates 79
The per comparison error rate or comparisonwise error rate is the prob- ability of rejecting a particular H0i in a single test when that H0i is true.
Controlling the per comparison error rate atE means that the expected frac- Comparisonwise error rate
tion of individual tests that rejectH0i whenH0is true isE. This is just the
usual error rate for at-test or F-test; it makes no correction for multiple com-
parisons. The tests in Example 5.1 controlled the per comparison error rate at 5%.
The per experiment error rate or experimentwise error rate or familywise
error rate is the probability of rejecting one or more of the H0i (and thus Experimentwise
error rate
rejectingH0) in a series of tests when all of theH0i are true. Controlling
the experimentwise error rate atE means that the expected fraction of exper- iments in which we would reject one or more of theH0i whenH0 is true
isE. In Example 5.1, the per experiment error rate is the fraction of times we would declare one or more of the chemicals unsafe when in fact all were safe. Controlling the experimentwise error rate atE necessarily controls the comparisonwise error rate at no more thanE. The experimentwise error rate considers all individual null hypotheses that were rejected; if any one of them was correctly rejected, then there is no penalty for any false rejections that may have occurred.
A statistical discovery is the rejection of an H0i. The false discovery
fraction is 0 if there are no rejections; otherwise it is the number of false False discovery rate
discoveries (Type I errors) divided by the total number of discoveries. The
false discovery rate (FDR) is the expected value of the false discovery frac-
tion. If H0 is true, then all discoveries are false and the FDR is just the
experimentwise error rate. Thus controlling the FDR atE also controls the experimentwise error atE. However, the FDR also controls at E the average fraction of rejections that are Type I errors when someH0iare true and some
are false, a control that the experimentwise error rate does not provide. With the FDR, we are allowed more incorrect rejections as the number of true re- jections increases, but the ratio is limited. For example, with FDR at .05, we are allowed just one incorrect rejection with 19 correct rejections.
The strong familywise error rate is the probability of making any false discoveries, that is, the probability that the false discovery fraction is greater
than zero. Controlling the strong familywise error rate at E means that the Strong familywise error rate
probability of making any false rejections is E or less, regardless of how many correct rejections are made. Thus one true rejection cannot make any false rejections more likely. Controlling the strong familywise error rate at
E controls the FDR at no more than E. In Example 5.1, a strong familywise
error rate ofE would imply that in a situation where 2 of the chemicals were carcinogenic, the probability of declaring one of the other 98 to be carcino- genic would be no more thanE.
80 Multiple Comparisons
Finally, suppose that each null hypothesis relates to some parameter (for example, a mean), and we put confidence intervals on all these parameters. An error occurs when one of our confidence intervals fails to cover the true parameter value. If this true parameter value is also the null hypothesis value, then an error is a false rejection. The simultaneous confidence intervals cri-
Simultaneous confidence intervals
terion states that all of our confidence intervals must cover their true param- eters simultaneously with confidence1 − E. Simultaneous 1 − E confidence intervals also control the strong familywise error rate at no more thanE. (In effect, the strong familywise criterion only requires simultaneous intervals for the null parameters.) In Example 5.1, we could construct simultaneous confidence intervals for the cancer rates of each of the 100 chemicals. Note that a single confidence interval in a collection of intervals with simultaneous coverage1 − E will have coverage greater than 1 − E.
There is a trade-off between Type I error and Type II error (failing to reject a null when it is false). As we go to more and more stringent Type I
More stringent procedures are less powerful
error rates, we become more confident in the rejections that we do make, but it also becomes more difficult to make rejections. Thus, when using the more stringent Type I error controls, we are more likely to fail to reject some null hypotheses that should be rejected than when using the less stringent rates. In simultaneous inference, controlling stronger error rates leads to less powerful tests.
Example 5.2 Functional magnetic resonance imaging
Many functional Magnetic Resonance Imaging (fMRI) studies are interested in determining which areas of the brain are “activated” when a subject is engaged in some task. Any one image slice of the brain may contain 5000 voxels (individual locations to be studied), and one analysis method produces at-test for each of the 5000 voxels. Null hypothesis H0iis that voxeli is not
activated. Which error rate should we use?
If we are studying a small, narrowly defined brain region and are uncon- cerned with other brain regions, then we would want to test individually the voxels in the brain regions of interest. The fact that there are 4999 other voxels is unimportant, so we would use a per comparison method.
Suppose instead that we are interested in determining if there are any activations in the image. We recognize that by making many tests we are likely to find one that is “significant”, even when all nulls are true; we want to protect ourselves against that possibility, but otherwise need no stronger control. Here we would use a per experiment error rate.
Suppose that we believe that there will be many activations, so thatH0is
not true. We don’t want some correct discoveries to open the flood gates for many false discoveries, but we are willing to live with some false discoveries