MORE ON LOGISTIC REGRESSION

(1)

Logistic Regression Table

Odds 95% CI Predictor Coef StDev Z P Ratio Lower Upper Constant -2.711 1.562 -1.74 0.083

Rural -0.01497 0.06533 -0.23 0.819 0.99 0.87 1.12 ADA96 0.14762 0.07474 1.98 0.048 1.16 1.00 1.34 Log-Likelihood = -5.582

Test that all slopes are zero: G = 33.869, DF = 2, P-Value = 0.000

AND

INTERNATIONAL RELATIONS Posc/Uapp 816

MORE ON LOGISTIC REGRESSION

I. AGENDA:

A. Logistic regression

1. Multiple independent variables 2. Example: The Bell Curve 3. Evaluation of fit

4. Inference

B. Reading: Agresti and Finlay Statistical Methods in the Social Sciences, 3^rd edition, pages 576 to 585.

II. MULTIPLE VARIABLE LOGISTIC REGRESSION:

A. As noted last time, we can approach model building, that is, explaining “variation”

in the log odds or odds, in the same way we did with multiple regression.

1. We can add variables of any of the types discussed under that topic.

2. Hence, we can add more continuous Xs;

3. Dummy indicators for categorical variables.

4. Interaction terms.

B. Here once again are the results for the California congressional delegation.

1. The estimated parameters are partial regression coefficients that show the effects of a variable on the logit when the other variables in the model have been held constant or controlled.

2. Here are the results:

C. We’ll consider the “significance” of the terms in a moment.

1. But you might anticipate on what has gone before that there really won’t be an improvement.

i. For one thing, there is a relatively strong negative correlation between ADA and percent rural, -.461.

2. The estimated equation for the log odds is:

(2)

Sˆ

S '' &&2.711 && .01497Rural %% .14762ADA

Richard J. Herrnstein and Charles Murray, The Bell Curve (New York: The Free Press,

1

1994).

3. To see what the numbers mean just substitute some meaningful values for X and X such as 0 and 0.₁ ₂

i. Actually this combination would not make sense in American politics because it’s unlikely that an extreme conservative (ADA = 0) would represent a totally urbanized district (rural = 0).

ii. Anyway, the estimated log odds would be -2.711 and the estimated odds would be e^-2.711 = .0665 to 1.

iii. What would the log odds, odds and predicted probability be for a representative from a district with a rural population of 30 and an ADA score of 70?

D. A more realistic example:

1. Let’s consider Richard Herrnstein and Charles Murray’s The Bell Curve ,¹ an important book that claims IQ(native intelligence) accounts for

variation in achievement much more than social background does.

i. At one point the authors want to explain “being below the poverty line.” That is, they are not concerned with the rate or number of poor people; rather they want to know, given that a person has such and such an IQ, is of such and such age, and comes from such and such family background, what is the probability of the person’s living below the poverty line.

1) In our previous notation, they want a “model” for π,π, where ππ is the probability that Y = 1 (i.e., being in poverty).

2. Major arguments:

i. They claim among other things, that IQ has an independent effect on poverty status, irrespective of other variables such as social economic background.

ii. Hence, whatever the family background, the higher one’s IQ, the lower the one’s chance of living in poverty (a negative correlation).

iii. They use their analysis of poverty to make a much larger point:

1) “Our thesis is that the twentieth century has continued the transformation, so that the twenty-first will open on a world in which cognitive ability [which, they claim, is driven mostly by genes, not environment] is the decisive dividing force....Social class remains the vehicle of life, but

(3)

S

S '' logit(BB) '' log BB (1 && BB)

S

S '' $$₀ %% $$₁X₁ %% $$₂X₂ %% ... %% $$_KX_K %% gg

B

B '' e ^logit 1 %% e& logit^&

Herrnstein and Murray, The Bell Curve, page 25.

2

intelligence now pulls the train.”²

3. I have some comments on their analysis at the end of the notes, plus a couple of suggested readings. But let’s simply use their results as an example of logistic regression.

E. The logit

1. As we saw in Class 22, social scientists prefer not to model probabilities directly (such models are called the linear probability models). Instead, they use a transformation of the probability, ππ.

2. One of the commonest and the one used by Herrnstein and Murray is the logit, defined as:

where ππ is the probability of an event of interest occurring(e.g., being in a state of poverty. (This is the log of the odds.)

3. A linear multiple variable model for the log odds is:

4. Recall some of the properties of log odds and models for them.

i. They can take on any value from minus to plus infinity.

ii. Hence, if we think of the logit as a dependent variable, we model it in much the same way as with regression analysis.

iii. The errors in log odds models (hopefully) meet the statistical assumptions so that we can obtain unbiased and efficient estimators of parameters.

1) They have constant variance and are uncorrelated with each other and explanatory variables.

F. A logit is not a “natural” variable to many of us so in order to understand the substantive significance of these models we can convert it to a probability:

G. Probabilities of these sort constitute the subject of The Bell Curve’s models.

(4)

Sˆ

S '' &&2.64873 && &&.83763(AFQT) && &&.33017(SES) && &&.02384(Age)

1. Much of their data come from the National Longitudinal Survey [of Labor Market Experience] of Youth.

i. It is a “panel” study in which a sample of people aged 14 to 22, first interviewed in 1979, are repeatedly re-interviewed.

ii. The original sample consisted of 12,686 respondents.

2. In one analysis their main explanatory variables are:

i. Armed Forces Qualitfy Test (AFQT) scores, which is used as a measure of cognitive ability and that reflects native intelligence (“g scores”).

ii. Social economic status (SES) of the respondents family.

iii. Age

3. All of the variables are standardized to have mean 0 and standard deviation 1.0.

i. Recall our discussion of standardized data and the hopes placed on them.

H. The estimate parameters for their various models appear in Appendix 4 of The Bell Curve.

1. For example, the following table (based on the one on page 594) gives these estimates of the regression parameters for logits pertaining to falling below the poverty level.

Variable Estimate Constant/

intercept -2.6487288 IQ (AFQT) -.8376338

SES -.3300720

Age -.0238375

2. The equation version of these results, which shows their relations to predicted values of the logit is

3. Once we find predicted logits, we can use the formula on page to convert them to probabilities.

I. Examples:

1. Suppose we let the three independent variables have their mean values, which are 0.

i. As noted, all of the variables in this analysis have been standardized.

(5)

Sˆ

S '' &&2.6487 && .8376(0) && .3301(0) && .0239(0)

&

&2.6487

?ˆ

? '' e^&&2.64870 ' .07074' and

Bˆ

B '' e^&&2.64870

1 %% e&264870^& '' .06607

And recall that the mean of a standardized variable is 0 and the standard deviation is 1.

ii. The use of standardized scores, moreover, means that the coefficients are standardized regression coefficients, whose magnitudes can presumably be directly compared.

1) We know that this assumption is problematic, however.

J. Based on a comparison of the standardized coefficients Herrnstein and Murray claim that IQ is a “more important” explanation of this form of achievement that is social class background.

1. The standardized coefficient is twice as large as the one for social

background, so doesn’t this mean that intelligence is twice as important in explaining poverty status?

i. The drift of their argument is that people who are poor (or don’t achieve) have only their biological endowment to blame. They haven’t been disadvantaged by their “environment.”

K. Interpretation of coefficient:

1. For now we need to put aside any discussion of this claim and look at the numbers’ meanings.

2. So if we let age = IQ = SES = 0, which is the same as looking at some who is at the mean or average of these factors, we can predict this person’s log odds of being below the poverty line:

i. Note carefully: in this context a score of 0 represents the mean. It does not mean, for instance, literally age equals zero.

ii. The odds and probability that corresponds to this logit are:

iii. These numbers mean, first, that the odds of someone with average age, SES standing, and intelligence being below the poverty line are .07 to 1.

iv. The corresponding probability of such a person being in poverty is .06607.

(6)

Sˆ

S '' &&2.6487 && .8376(1) '

' &&2.6487 && .8376 '

' &&3.4864

?ˆ

? '' e^&&2.64870 && .8376338(1) '' e&3.48637^& '' .03061 and

Bˆ

B '' e^&&3.48637

1 %% e^&&3.48637 ' .02970'

Sˆ

S '' &&2.6487 && .8376(&&1) && .3301(0) && .0239(0) '

' &&2.6487 %% .8376 '

' &&1.8110

3. Now suppose a person is exceptionally bright. That is, although the SES and age remain at the mean (0), the individual’s IQ is one full standard deviation above the average (that is, IQ = 1).

i. Again, recall that the data are in standard deviation form.

ii. If IQ is normally distributed this would mean that the person is above two thirds of the sample.

4. The estimated log odds are now:

i. The log odds have decreased slightly

ii. The standardized partial regression coefficient for IQ is -.8376.

Since this is added to the constant, which is negative, we see that the logit (and odds) will decrease.

5. The odds and probability that the person falls into poverty are

i. Being in the upper third of the IQ distribution thus lowers (compared to being average) the odds and probability of being below the poverty level, after social class and age have been controlled..

6. Now let’s see what the estimated log odds are of someone whose IQ falls 1 standard deviation below average:

(7)

?ˆ

? '' e^&&2.64870 && .8376338(&&1) '' e^&&1,81110 '' .16348 and

Bˆ

B '' e^&&1,81110

1 %% e^&&1,81110 '' .1405

i. The log odds have gone up a bit and the estimated odds and probability are:

ii. The chances for someone near the bottom of the IQ ladder (the bottom one third) being below the poverty level have increased quite a bit.

7. We can substitute in other values in order to see what effect they have on the logits and (as shown below) the probabilities. For example, consider a Herrnstein-Murray loser, someone two standard deviations below the mean. The log odds of being in poverty are -.097346, the odds are .37778 to 1; and probability is .27419.

i. This perhaps discouraging results show that a person of

substantially below average intelligence has a more than one in four chance of being poor, even after controlling for age and social background.

L. I’ll try to present some graphs, similar to the ones in Chapter 5, that show how the probability of being in poverty changes with changes in an independent variable, with the other variables held constant.

1. More comments later.

III. EVALUATING LOGISTIC REGRESSION MODELS:

A. The notes in this section simply repeat the ones for Class 22.

B. We can test the significance of the estimated parameters using the same as ideas as we employed in regular regression. In particular, we can

1. Compute statistics roughly comparable to R as a measure of how well the² data fit the model.

2. An overall or global test of the regression parameters.

3. Tests for individual parameters.

4. Confidence intervals for estimated parameters.

5. Confidence intervals for predicted probabilities.

C. Note and warning:

1. The statistical results for logistic regression usually assume that the sample is relatively large.

2. If, for example, a statistic such as the estimator of the regression coefficient is said to be normally distributed with a standard deviation of σσ_ββ, the

(8)

Y 'ˆ ' 0 Y 'ˆ ' 1

statement applies strictly speaking for estimators based on large N.

i. How big does N have to be? A rule of thumb: roughly 60 or more cases.

D. The R analogue.²

1. There really isn’t a completely satisfactory version of R available to² measure the “explained variation” in Y similar to common multiple R, so we will use a different measure, the correct classification rate (CCR).

2. MINITAB effectively constructs a cross-classification table of predicted and observed results that takes this form.

Observed/Predicted

Y = 0 correct incorrect

Number Number

Y = 1 incorrect correct

Number Number

i. The table cross-classifies predicted Y’s by observed Y’s.

3. If the model does a good job, then presumably the total number of correct predictions--the frequencies in the main (shaded) diagonal--should greatly outweigh the incorrect guesses.

i. For instance, suppose a model led to this pattern of correct and incorrect predictions.

Observed/Predicted Y = 0 Y = 0

Y = 0

47 4

Y = 1

3 29

ii. Since there a total of 83 observations in the table and 76 of them have been correctly predicted, the CCR is 76/83 X 100 = 91.16%.

4. Some software reports this number or it can be easily calculated from reported data.

5. MINITAB, however, reports “measures of association” for the table.

i. These measures are bounded between -1.0 and 1.0 and attain maximum values (1.0) when there are no errors.

ii. So a measure equal to .9 indicates that most of the Y’s have been correctly predicted and the model fits reasonably well.

6. The measures for the percent rural and the ADA models are:

(9)

LLR '' &&log L_$$

L₀

2

'

' &&2(logL_$$ && logL₀)

Percent Rural

Somers' D 0.51 Goodman-Kruskal Gamma 0.57 Kendall's Tau-a 0.22

ADA

Somers' D 0.94 Goodman-Kruskal Gamma 0.95 Kendall's Tau-a 0.40

i. The measures association for the independent variable rural are about .5--half way between 0 for no correlation and 1.0 for perfect correlation--so the data fit at best moderately well.

1) Note I prefer using Somer’s measure.

ii. For the ADA variable, however, the value of the measure is nearly 1, which suggests a quite good fit.

iii. One would based on these considerations conclude that ADA scores better explain and predict votes on assault weapons than percent rural does. Needless to say, this conclusion undercuts the original hypothesis.

IV. INFERENCE FOR LOGISTIC REGRESSION:

A. A “global” test of the hypothesis that ββ₁ = ββ 2 = ββ 3 = ... = ββ K is usually done by comparing the “likelihood,” L , for the model to the likelihood (L ) for a model_ββ 0

for the data containing only a constant.

1. Sorry, we can take time to explain “likelihood,” although the concept is not difficult.

2. Think of it as very, very roughly akin to residual sum of squares.

B. A bit more formally one obtains an observed statistic

1. LLR, called the log of the likelihood ratio, is a simple chi square statistic with degrees of freedom equal to the number of variables in the model.

2. The sample size has to be reasonably large, say more than 60 cases.

C. More generally, one can test the significance of a set of parameters by comparing a model that includes them--call it the “complete” model--with one that does not have those parameters--call it the “reduced” model.

1. This strategy parallels in form the one used in multiple regression.

2. That is suppose the complete model has K variables while the reduced

(10)

LLR '' &&log L_reduced L_complete

2

Z '' $$ &ˆ & $$

FFˆ$$ˆ

Robert D. Rutherford and Minja Kim Choe, Statistical Models for Causal Analysis

3

(Wiley, 1993) page 137.

model contains K - q, where q < K.

3. Use a program to obtain the likelihood for the full model (L_complete) and the likelihood for the reduced model (L_reduced)

4. The test statistic:

is distributed as χ² with q degrees of freedom.

D. Maximum likelihood estimation provides (asymptotic or large sample) standard errors of the coefficients. These can be used to test hypotheses about individual parameters and construct (simultaneous) confidence intervals.

E. The test statistic for a (partial) regression parameter resembles the form of the statistic for regular regression parameters: it is the estimated coefficient divided by its standard error.

1. This statistic, called Wald’s Z, is:

2. It is distributed approximately as a standard normal variable, so one uses the z table to find a critical value and test the hypothesis, which usually is that ββ = 0.

i. This is a z statistic, not t.

ii. Report attained level of significance when possible.

3. As noted above, 60 cases should be sufficiently large in most situations to obtain reasonably valid results.

i. I have also seen the rule of thumb: the ratio of the sample size to the number of variables in the model should be 20 to 1 or greater.³ F. I’ll discuss examples in class.

V. A COUPLE OF ADDITIONAL REMARKS ON THE BELL CURVE:

A. The importance of explanatory factors: a critique

1. First, as noted several times Herrnstein and Murray analyze variables with

(11)

$⁽₁ ' $₁Fˆ_Y Fˆ_X

different scales (e.g., AFQT has a different scale than parental socio- economic background).

2. To over come this problem, which is actually not necessarily a problem, the authors standardize each variable so that the means are 0 and standard deviations are 1.

i. Consequently, instead of talking about, say, age in years, they refer to it in standard deviation units. A person, for instance, isn’t 22 years old, but has a score of .3 or 3/10 of a standard deviation.

Another individual might have a score on age of, say, -.21 instead of 19.

3. Standardizing presumably permits one to compare the magnitudes of different ββ’s because they are all based on the same scale.

4. The authors then interpret the numerical size of the coefficients as indicators of importance.

5. As noted countless times before, the ββ’s based on standard scores are called standardized regression coefficients instead of just regression coefficient.

6. One can think of them in resulting from the following manipulation:

i. Here the ββ^* is the standardized coefficient, the one Herrnstein and Murray report, ββ₁₁ is the unstandardized regression coefficient, and the sigmas are the sample standard deviations.

7. In view of this relationship note the following. All else equal, if the variation in X doubled, the size of the standardized coefficient would also double.

8. The lesson is thus that the magnitude of a standardized coefficient is a function not only of the strength of the relationship but also the amount of variation in the independent variable.

9. So if we were comparing two groups (with standardized coefficients) our conclusions about the importance of variables could be affected by the variation in each group.

B. Theoretical importance:

(12)

Figure 1: Importance of Variables?

1. Look at Figure 1 above.

i. Suppose X and X both affect Y. Can we say that X is more₁ ₂ ₁ important than X , even though the first’s coefficient is larger?₂ ii. Suppose both are necessary for the occurrence or variation of Y.

iii. This seems to be a theoretical issue, not one of statistics.

C. A huge amount of has been written about The Bell Curve:

1. For a statistical analysis see Arthur S. Goldberger and Charles F. Manski,

“Review Article: The Bell Curve” in the Journal of Economic Literature, volume 33 (1995) pages 762 to 776.

2. An excellent but mostly verbal collection of essays that critique the book is The Bell Curve Wars: Race, Intelligence, and the Future of America, edited by Steven Fraser (Basic Books, 1995.)

3. For a more “balanced” assessment see Bernie Devlin and others, Intelligence, Genes, and Success (Springler-Verlag, 1997).

4. Also, Claude S. Fisher and others, Inequality By Design: Cracking the Bell Curve Myth (Princeton University Press, 1996).

5. By far the most popular foe of Herrnstein and Murray’s critics is Stephen Jay Gould. As much as I respect and admire his work, I think it is fair to say many biologists, philosophers, and sociologists find great faults in his analysis of sociobiology. But most would agree that The Bell Curve is a flawed study of genes and intelligence.

VI. NEXT TIME:

A. Summary of data analysis Go to Notes page

Go to Statistics page