overparameterization and estimable functions

Consider a data set consisting of p groups or treatments (i⫽1 to p) and n repli- cates ( j⫽1 to n) within each group (Figure 8.4).

The linear effects model is: y_ij⫽l⫹a_i⫹e_ij

In this model:

y_ijis the jth replicate observation of the response variable from the ith group of factor A;

l is the overall population mean of the response variable (also termed the constant because it is constant for all observations);

if the factor is ﬁxed, aiis the effect of ith group (the difference between each

group mean and the overall mean li⫺l);

if the factor is random, airepresents a random variable with a mean of zero

and a variance of r_a2_{, measuring the variance in mean values of the}

response variable across all the possible levels of the factor that could have been used;

eijis random or unexplained error associated with the jth replicate observation

from the ith group. These error terms are assumed to be normally

distributed at each factor level, with a mean of zero (E(eij) equals zero) and a

variance of r_e2_.

This model is structurally similar to the simple linear regression model described in Chapter 5. The overall mean replaces the intercept as the constant and the treat- ment or group effect replaces the slope as a measure of the effect of the predictor variable on the response variable. Like the regression model, model 8.1 has two components: the model (l⫹ai) and the error (eij).

We can ﬁt a linear model to data where the predictor variable is categorical in a form that is basically a multiple linear regression model with an intercept. The

SINGLE FACTOR (ONE WAY) DESIGNS 179

factor levels (groups) are converted to dummy variables (Chapter 6) and a multiple regression model is ﬁtted of the form:

y_ij⫽l⫹b₁(dummy₁)_ij⫹b₂(dummy₂)_ij⫹b₃(dummy₃)_ij⫹...⫹b_p_⫺1(dummy_p_⫺1)_ij⫹e_ij Fitting this type of model is sometimes called dummy coding in statistical software. The basic results from estimation and hypothesis testing will be the same as when ﬁtting the usual ANOVA models (effects or means models) except that estimates of group effects will often be coded to compare with a reference category so only p⫺1 effects will be presented in output from statistical software.You should always check which category your preferred software uses as its reference group when ﬁtting a model of this type.

The linear effects model is what statisticians call “overparameterized” (Searle 1993) because the number of group means (p) is less than the number of param- eters to be estimated (l, ai. . . ap). Not all parameters in the effects model can be

estimated by OLS unless we impose some constraints because there is no unique solution to the set of normal equations (Searle 1993). The usual constraint, sometimes called a sum-to-zero constraint (Yandell 1997), a兺-restriction (Searle 1993), or a side condition (Maxwell & Delaney 1990), is that the sum of the group effects equals zero, i.e. 兺p

i⫽1ai⫽0. This constraint is not particularly problematical for

single factor designs, although similar constraints for some multifactor designs are controversial (Chapter 9). The sum-to-zero constraint is not the only way of allowing estimation of the overall mean and each of the ai. We can also set one

of the parameters, either l or one of the ai, to zero (set-to-zero constraint;Yandell

1997), although this approach is only really useful when one group is clearly a control or reference group (see also effects coding for linear models in Chapter 5).

An alternative single factor ANOVA model is the cell means model. It simply replaces l⫹aiwith liand therefore uses group means instead of group

effects (differences between group means and overall mean) for the model component:

y_ij⫽l_i⫹e_ij

The cell means model is no longer overparameterized because the number of parameters in the model component is obviously the same as the number of group means. While ﬁtting such a model makes little difference in the single factor case, and the basic ANOVA table and hypothesis tests will not change, the cell means model has some advantages in more complex designs with unequal sample sizes or completely missing cells (Milliken & Johnson 1984, Searle 1993; Chapter 9). Some linear models statisticians (Hocking 1996, Searle 1993) regard the sum- to-zero constraint as an unnecessary complication that limits the practical and ped- agogical use of the effects model and can cause much confusion in multifactor designs (Nelder & Lane 1995). The alternative approach is to focus on parameters or functions of parameters that are estimable. Estimable functions are “those functions of parameters which do not depend on the particular solution of the normal equations” (Yandell 1997, p. 111). Although all of the aiare not estimable (at least,

not without constraints), (l⫹ai) is estimable for each group. If we equate the effects

model with the cell means model: y_ij⫽l⫹a_i⫹e_ij⫽l_i⫹e_ij

180 COMPARING GROUPS OR TREATMENTS

we can see that each estimable function (l⫹ai) is equivalent to the appropriate

cell mean (li), hence the emphasis that many statisticians place on the cell means

model. In practice, it makes no difference for hypothesis testing whether we ﬁt the cell means or effects model. The F-ratio statistic for testing the H₀that l1⫽l2⫽...

⫽li⫽...⫽l is identical to that for testing the H0that all aiequal zero.

We prefer the effects model for most analyses of experimental designs because, given the sum-to-zero constraints, it allows estimation of the effects of factors and their interactions (Chapter 9), allows combinations of continuous and categorical variables (e.g. analyses of covariance, Chapter 12) and is similar in structure to the multiple linear regression model. The basic features of the effects model for a single factor ANOVA are similar to those described for the linear regression model in Chapter 5. In particular, we must make certain assumptions about the error terms (eij) from the model and these assumptions equally apply to the response variable.

1. For each group (factor level, i ) used in the design, there is a population of Y-values (y_ij) and error terms (eij) with a probability distribution. For interval

estimation and hypothesis testing, we assume that the population of y_ijand therefore eijat each factor level (i ) has a normal distribution.

2. These populations of y_ijand therefore eijat each factor level are assumed

to have the same variance (r_e2_{, sometimes simpliﬁed to r}2_{when there is no}

ambiguity). This is termed the homogeneity of variance assumption and can be formally expressed as r1 2_⫽r 2 2_{⫽. . .⫽r} i 2_{⫽. . .⫽r} e 2_.

3. The y_ijand the eijare independent of, and therefore uncorrelated with,

each other within each factor level and across factor levels if the factor is ﬁxed or, if the factor is random, once the factor levels have been chosen (Neter et al. 1996).

These assumptions and their implications are examined in more detail in Section 8.3.

There are three parameters to be estimated when ﬁtting model 8.1: l, aiand

r_e2_{, the latter being the variance of the error terms, assumed to be constant across}

factor levels. Estimation of these parameters can be based on either OLS or ML and when certain assumptions hold (see Section 8.3), the estimates for l and aiare

the same whereas the ML estimate of r_e2_{is slightly biased (see also Chapter 2). We}

will focus on OLS estimation, although ML is important for estimation of some parameters when sample sizes differ between groups (Section 8.2).

The OLS estimates of l, liand aiare presented in Table 8.1. Note the estimate

of aiis simply the difference between the estimates of liand l. Therefore, the pre-

dicted or ﬁtted values of the response variable from our model are: yˆ_ij⫽y¯⫹(y¯_i⫺y¯)⫽y¯_i

So any predicted Y-value is simply predicted by the sample mean for that factor level.

In practice, we tend not to worry too much about the estimates of l and ai

because we usually focus on estimates of group means and of differences or con- trasts between group means for ﬁxed factors (Section 8.6) and components of variance for random factors (Section 8.2). Standard errors for these group means are: s

yi⫽

兹

MSResidual

In document Experimental Design and Data Analysis for Biologists - Quinn & Keough - Cambridge 2002 (Page 198-200)