• No results found

Computability, and Generality

8.5. Log-Linear and Logit Models

The development of statistical methods for categori-cal data has long lagged behind the development of techniques for continuous data. When faced with multivariate data sets consisting of continuous data, researchers could choose from a variety of tools, including regression, principal components and factor analysis, discriminant analysis, cluster analysis, and canonical correlation. When faced with multivariate categorical data, most researchers could do little but collapse over all but two variables and use the usual test of independence on these remaining two variables.

The end result would be a set of tests of independence for all pairs of variables.

This methodology is inadequate for many reasons.

Most important is the problem that the overall relation-ship between two variables, ignoring (i.e., collapsing over) other variables, can be very different from the relationship between those two variables at each level of other variables (i.e., conditional on the others). By now, most researchers have seen examples of this in the form of Simpson’s paradox. For instance, in a random sample of people, there is a strong relationship between whether they get medical treatment and whether they die: Those getting medical treatment are more likely to die. Of course, we have collapsed over an impor-tant variable: Were these people seriously ill? If we look at those who are seriously ill, the relationship is the reverse of the overall relationship: Those who are treated are less likely to die.

8.5.1. Log-Linear Models

To deal with the problem of analyzing multivariate categorical data, we needed new approaches; the solu-tion came with the development of log-linear models.

From one viewpoint, there is a strong analogy between log-linear models and analysis of variance (ANOVA).

The main emphasis in ANOVA is testing hypotheses

about main effects and interactions. The same is true of linear models, but the dependent variable for log-linear models is the logarithm of the cell frequency.

For example, consider a situation with three categor-ical variables A, B, and C, with levels denoted by subscripts i, j , and k, respectively. A log-linear model with only main effects would be represented as

ln(Fijk) = µ + ai+ bj + ck,

where Fijkis the expected cell frequency for A = i, B = j, and C = k, and ln(.) means the natural loga-rithm. Except for the logarithm, the form is identical to an ANOVA model. Because this model contains no interaction terms, which would allow relationships among variables, this is the model for complete inde-pendence among the three variables. The model is usually specified by notation such as [A] [B] [C], {A, B, C}, or simply A, B, C to indicate which terms are included.

Just as with the model for independence in two-way tables, expected frequencies can be calculated for this model. These can then be used to assess whether the model is consistent with the data by comparing the expected with the observed frequencies. This can be done using the usual Pearson goodness-of-fit statistic,

X2=

t

{(Ot− Et)2/Et},

where t has been used to index the cells of the cross-tabulated data, O represents the observed frequency, and E represents the expected frequency in a cell. The symbol 

t means to sum over all the cells of the table. (The use of a single subscript t makes it possible to use this formula to easily represent tables of any dimension and also data sets that are not rectangular.) Conceptually, for each cell of the table, a number is calculated that measures how close the observed cell frequency is to the value that would be expected if the model were true. These numbers are then summed to give X2. If the model is true, we would anticipate the value of X2to be small, but if the model is not true, we would anticipate a large value of X2.

As discussed in the section on partitioning chi-square, an alternative fit statistic is the likelihood-ratio statistic,

G2= 2

t

Otln(Ot/Et).

Although it is not so obvious why this is a rea-sonable measure of fit of a model to the data, notice what would happen if the model fit the data perfectly:

Each observed frequency would equal the expected frequency, so Ot/Et would equal 1 for each cell.

Because the logarithm of 1 is 0, the value of G2would be zero, indicating perfect fit.

How large a value of X2or G2is necessary to reject a model as being inadequate to account for the observed pattern of frequencies? As with any statistic that follows a chi-square distribution, the number of degrees of freedom must be counted to find the critical value in a table. The total number of degrees of freedom in the data is the number of cells in the cross-tabulated table. The number of parameters in the model is sub-tracted from this total to give the number of degrees of freedom for the goodness-of-fit statistic.

Finding the number of parameters in the model is easy, because it is the same as in ANOVA models.

There is 1 degree of freedom for the constant (inter-cept). For each main effect, the number of degrees of freedom is 1 less than the number of levels of that variable. For interactions (discussed further below), multiply the degrees of freedom for each variable involved in the interaction.

For example, consider a table with two variables, and suppose that one variable has three levels, the other four. The table thus has 3 × 4 = 12 cells.

The log-linear model corresponding to the usual test of independence would have an intercept, 3− 1 = 2 parameters for one main effect, and 4−1 = 3 parame-ters for the other main effect. In all, six parameparame-ters are estimated, so the goodness-of-fit test has 12− 6 = 6 degrees of freedom. (Notice that the usual rule for test-ing independence would also give (2)(3)= 6 degrees of freedom.)

As another example, consider the independence model for the three-way table described above, where there are three, four, and five levels of variables A, B, and C, respectively. Then there would be 1+ 2 + 3 + 4= 10 parameters in the model, and 3 × 4 × 5 = 60 cells in the table. The goodness-of-fit test would have 60− 10 = 50 degrees of freedom. (Notice that trying to extend the usual rule would fail here: (2)(3)(4) = 24, which is incorrect.)

Of course, models of complete independence are not only too simple to explain most multivariate data, but researchers would be devastated if they did fit; after all, no one examines variables because they think that all of them will be unrelated to each other. Instead, we expect that there will be relationships, and we want to find the simplest model that accounts for these relationships.

To do this, we start adding to the model what would be called interactions in the context of ANOVA.

To illustrate the general procedure, we will rean-alyze a famous data set on ulcer and blood type, originally reported in Woolf (1955) and reproduced in Table 8.3. This data set has three variables: city

Table 8.3 Relationship Between Ulcer and Blood Type

Ulcer?

City Blood Type Yes No % Ulcer

London O 911 4,578 16.6

A 579 4,219 12.1

Manchester O 361 4,532 7.4

A 246 3,775 6.1

Newcastle O 396 6,598 5.7

A 219 5,261 4.0

Table 8.4 Fit of Log-Linear Models to Ulcer and Blood Type Data

Model G2 df p

U, B, C 754.47 7 .000

BU, C 700.97 6 .000

CU, B 83.59 5 .000

BC, U 737.74 5 .000

BU, BC 684.25 4 .000

BU, CU 30.10 4 .000

BC, CU 66.87 3 .000

BC, BU, CU 2.96 2 .227

BCU 0.0 0 1

NOTE: U= ulcer; B = blood type; C = city.

(London, Manchester, and Newcastle), blood type (only O and A are included here), and ulcer (whether or not the person has an ulcer). The table has 3× 2 × 2 = 12 cells. Table 8.4 contains the fit of several log-linear models for this data set. A common abbreviated nota-tion is used: If an interacnota-tion is listed, then all lower order interactions and main effects of those variables are also in the model. For example, if an AB term is in the model (indicating that an A× B interaction is included), then A and B main effects are also assumed to be present. This is called the hierarchy principle;

most applications of log-linear models follow this principle.

As can be seen in Table 8.4, no simple model fits the data. The last model, called the saturated model, has no degrees of freedom left to test the model: It fits the data exactly because it uses all of the information in each cell. Because this model represents no simplification over the frequencies themselves, one would hope that other models would fit the data. In this case, the model [BC] [BU] [CU], with all main effects and three two-way relationships (BC, BU, and CU) but no three-two-way relationship, fits well. So blood type is related to city, blood type is related to ulcers, and city is related to ulcers, but the relationship between any two variables is the same at each level of the third variable (no three-way relationship).

Researchers often have one or more ordered variables (e.g., no symptoms, mild symptoms, severe symptoms). The most frequently used strategy in the past has been to treat ordered variables as if they were continuous. Now there are many methods for more adequately analyzing such data; these methods are discussed by Johnson and Albert (Chapter 9, this volume).

8.5.2. Logit Models

Frequently, a researcher considers one variable to be an outcome variable, and the others to be control or predictor variables. For this data set, ulcer (U ) might be considered an outcome, blood type (B) a predictor, and city (C) a control or possible moderator of the effect of blood type on ulcer. In the usual ANOVA terminology, we would be interested in the main effect of blood type (on likelihood of ulcer), the main effect of city, and the interaction between blood type and city. The most obvious approach would be to model the probability of ulcer as a function of blood type and city. But using probability as an outcome is problematic: It can only vary between 0 and 1, whereas in ANOVA models, the outcome variable can have any value. The solution is to use the logit of the probability as the outcome, where the logit is defined as the logarithm of the odds:

logit(p)= ln{p/(1 − p)}.

The logit can take on any real value and is therefore appropriate as an outcome variable.

Although logit models can be represented in differ-ent ways, one useful approach is to note a correspon-dence between logit models and log-linear models:

Each logit model is equivalent to a log-linear model (but not all log-linear models are logit models). To understand the equivalence, consider each logit model as if it were a regression model. In regression models, no constraint is placed on relationships among the predictors; the predictors might be independent, but more likely they are related. Similarly, the log-linear version of a logit model contains (i.e., allows) all possible relationships among predictor variables. This is done because we are not concerned with relation-ships among predictors but with relationrelation-ships between predictors and the outcome variable.

In the ulcer data, this means that any logit model would include a BC term (and, because of the hier-archy principle, B and C terms by implication).

Furthermore, all logit models include a term involv-ing the dependent variable U . Any log-linear model that includes these components is a logit model also.

For example, the log-linear model [BC] [BU] can be interpreted as a logit model in which blood type is related to ulcers (BU), but city is not (no CU term).

Note that BU is an interaction in a log-linear model but a main effect (of B on U ) in the corresponding logit model. Furthermore, there is no interaction (BCU term), so the effect of blood type on ulcers is the same in each city.

The log-linear model that actually fit the data was [BC] [BU] [CU]. This can be interpreted as a logit model in which blood type affects ulcers and city affects ulcers, but there is no interaction between the blood type and city effects on ulcers. (As an example of a log-linear model that is not a logit model, consider the independence model: [B][C][U ]. Because there is no BC term, this is not a logit model.)

8.5.3. Logistic Regression

In some cases, the outcome variable is dichotomous, but one or more predictors are continuous. In this case, an analysis is desired that is similar to multiple regres-sion, but that takes into account the categorical nature of the dependent variable. Logistic regression is such a procedure; the outcome is the logit of the probability of the outcome event occurring. That is, the model is the same as a logit model, except that one or more predictors are continuous.

Because one or more predictors are continuous, the data are not easily summarized in a contingency table;

such a table would have a large number of cells. Many of the cells would be empty, and few would contain more than one observation. This means that the G2 and X2statistics are not good approximations to the chi-square distribution and cannot be used to assess the goodness of fit of the model. The utility of the pre-dictor variables must be assessed by examining either the ratio of parameters to their standard errors (com-monly denoted as t or z in computer output), or the difference in G2statistics for models with and without a set of parameters. The first method is similar to what is done in testing individual parameters in a regression model; the second method is comparable to testing the increase in R2when a set of predictors is added to a regression model.

As with log-linear models, extensions of logit and logistic regression models allow polytomous (more than two-category) dependent variables. Some poly-tomous variables are unordered (e.g., race), whereas others are ordered; both types of situations are handled by more complex versions of the models discussed previously.

8.6. Nonstandard

Log-Linear and Logit Models

Partitioning of chi-square is a simple technique to learn and use, and it can go a long way toward test-ing hypotheses that are important to researchers. But because it cannot test all important hypotheses, a more general method is needed.

To provide a context, consider the data on admis-sions to graduate school at UC Berkeley that have become well publicized (see, e.g., Freedman, Pisani, &

Purves, 1978, pp. 12–15). For six major areas of study, Table 8.5 shows data on what proportion of each gen-der were admitted in each of six major areas of study.

The three variables will be called Major, Gender, and Admission (or M, G, and A for brevity).

We would presume that there might be a relationship between gender and major (a G× M effect) because males and females might tend to apply to different major areas at different rates. We would also pre-sume that there might be a relationship between major area and admissions (an M× A effect) because some major areas get many more applicants per opening than others do.

If there is no bias in admissions, however, we would hope to find no relationship between gender and admission in any major area. (We oversimplify here and ignore the possibility of other confound-ing variables, such as prior achievement or aptitude.) The usual log-linear model described by this situation would be specified as [GM] [MA], to show inclusion of both G× M and M × A effects in the model.

If there is bias, the additional term (GA) would be added to the model to show that gender is related to admissions.

If there is bias, and if that bias differs across major areas, then there would be a three-way Gender× Major × Admission interaction (GMA) in the model. This is the saturated log-linear model, with zero degrees of freedom; it will fit the data exactly but provides no simple interpretation of the data.

In fact, this occurs for the Berkeley data: The model of no three-way interaction does not fit the data and is rejected, leaving the conclusion that there is bias, and it differs across major areas. For those who limit themselves to the usual hierarchical log-linear models, there is not much else to say here, but inspection of the data in Table 8.5 shows something very interesting.

For major area A, it appears that males are admitted at a lower rate than females. For each of the other major areas, there is no apparent difference in rates of admission.

Table 8.5 Graduate Admissions Data at UC Berkeley

Major Area Gender % Admitted

A M 62

This description does not correspond to any standard log-linear model; therefore, a nonstandard log-linear model is needed. (In this instance, we could use par-titioning chi-square, but that will not be possible for all nonstandard models.) A model of no bias in major areas B through F, but possible bias in major area A, has a likelihood ratio chi-square of 2.33 with 5 degrees of freedom, and thus fits the data quite well. The simplest way to represent the model is as a logit model, with admission (A) as the dependent variable. The model matrix (sometimes called a design matrix) for the logit form of the model is presented as follows:

The rows of this matrix correspond to the 12 groups in the study, as shown in Table 8.5 (i.e., six major areas by two genders). Those who are used to looking at such matrices will notice that the first column represents the intercept term, and the next five columns represent the main effect of major area. There is no column for the main effect of gender, but there is one column for the interaction of gender and major. Normally, there would be five such columns; they would be the product of the gender effect with each of the five major area effects. Here, however, we are including such an

interaction only for major area A. The omitted main effect for gender and the four omitted interaction terms produce the 5 degrees of freedom mentioned above for testing the model. Of course, the hypothesis tested here is post hoc, and the results must be considered tentative.

Nonstandard log-linear models illustrate one of the main trends in applied statistics discussed previ-ously in the section on partitioning chi-square, the testing of context-dependent models. Nonstandard models also illustrate another trend, the increasing generality of statistical models. They provide a frame-work that includes as special cases many situations that were previously dealt with separately by other researchers. Most obviously, the usual hierarchical log-linear models and partitioning chi-square can be put in this framework. In addition, the nonstandard log-linear approach includes models for data with structural zeros, incomplete designs, models for sym-metry and quasi-symsym-metry, models with linear restric-tions on parameters, polynomial models, and many of Goodman’s models for association with ordered variables (details can be found in Rindskopf, 1990).

One general framework and one computer program can deal with this wide variety of problems.

8.7. Methods for Rates