Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

(1)

Overview Classes

12-3 Logistic regression (5)

19-3 Building and applying logistic regression (6)

26-3 Generalizations of logistic regression (7)

2-4 Loglinear models (8)

5-4 15-17 hrs; 5B02 Building and applying loglinear models (9.1-9.3, 9.8)

23-4 Association (9.4-9.6)

3-5 15-17 hrs: 5A37 Matched pairs (10)

7-5 Repeated measurements (11/12)

14-5 Mixture models (13)

(2)

Logistic Regression

Today’s topics:

1. Introduction

2. Parameter interpretation 3. Inference

4. Categorical predictors

5. Multiple predictors

6. Software: SPSS

7. Software: ` ^EM

(3)

Introduction: Logistic Regression

The response variable (Y ) is a dichotomous variable. We may have one or more, continuous or categorical predictor variables.

For the moment lets consider one predictor variable X. Denote π(x) = P (Y = 1|X = x). The logistic regression model is

π(x) = exp(α + βx) 1 + exp(α + βx) or equivalently

logit [π(x)] = log π(x)

1 − π(x) = α + βx

The logit link is equated to the linear predictor.

(4)

Interpretation

How to interpret β?

1. The sign determines whether the possibility goes up or down with an increase in X.

2. The larger the absolute value of β the steeper the line. When β = 0 the line is flat and X and Y are independent.

3. The relationship between the predictor and the probability follows the

logistic curve.

(5)

Interpretation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

P(Y=1|x)

(6)

Interpretation

How to interpret β?

1. The odds increase multiplicatively by e ^β for a unit change in X.

2. e ^β is an odds ratio. The odds at X = x + 1 divided by the odds at X = x.

3. Use quartiles to get a better understanding.

4. Via linearization argument: The line tangent to the curve has slope βπ(x)[1 − π(x)]. This is approximately the increase in probability with an increase in predictor value of 1.

5. From this, it follows that near x where π(x) = .5, (i.e., x = −α/β) 1/β

approximates the distance between x-values that correspond to π(x) =

.25 or π(x) = .75 and π(x) = .5.

(7)

Inference

Significance tests usually test H 0 : β = 0. Possible tests (see class 1):

1. Wald statistic: z = β/SE. z ² ∼ χ ² with df=1.

2. Likelihood ratio statistic; Uses the difference of twice the maximized loglikelihood at ˆ β and β = 0. Also chi-square distributed with df=1.

The likelihood ratio statistic is preferred over the Wald statistic. It uses more information and has more power.

More information is usually provided by confidence intervals for β. These are

arrived through inverse reasoning.

(8)

Inference

Often we also like a confidence interval for the predicted probabilities (ˆ π(x)).

For a fixed value x = x 0 , logit[ˆ π(x 0 )] = ˆ α + ˆ βx 0 has a large-sample standard error (SE) given by the square root of

var(ˆ α + ˆ βx 0 ) = var(ˆ α) + x ² ₀ var( ˆ β) + 2x 0 cov(ˆ α, ˆ β)

The variances and covariances of the regression weights can be obtained from formula (5.20).

A 95%-confidence interval for the logit is obtained by adding and subtracting 1.96SE from the estimated logit.

From this confidence interval we can obtain a confidence interval for the probabilities by

π(x 0 ) = exp(logit)

(9)

Inference: Goodness-of-fit stats

In practice there is no guarantee that the model fits the data well.

But if all more complex models do not increase the fit then this is some evidence that the chosen model is reasonable.

Detecting lack of fit by searching any way that the model fails. Therefore, X ² and G ² statistics are used. Data must be grouped: Categorize continuous variables.

An example is the Hosmer and Lemeshow statistic: Partition the data in g (approximately) equal groups based on predicted probabilities. Then form a contingency table of the groups against the two response categories. Compare fitted and observed frequencies.

Such tests indicate lack of fit but no insight about its nature.

(10)

Categorical predictors

Categorical variables are often named factors.

log

π _i 1 − π _i

= α + β _i

One must constrain one of the β _i ’s, for example β 1 = 0 or P

i β _i = 0.

This is like the ANOVA model

(11)

Categorical predictors

The same model can be made using dummy variables. A factor with I levels needs I −1 dummy variables. Like in multiple regression with dummy variables.

Example of dummy-variables for three-category Effect Dummy x 1 x 2 x 1 x 2

1 0 1 0

0 1 0 1

-1 -1 0 0

log

π _i 1 − π _i

= α + β 1 x 1 + β 2 x 2 . . .

In effect coding the β _i represents deviance from a ‘mean’. In dummy coding

the β _i denote deviance from the baseline group for which we set β _i = 0.

(12)

Categorical predictors

Effect coding corresponds with the constraint P

i β _i = 0 in the ANOVA set-up whereas Dummy-coding corresponds with β _I = 0.

Depending on the dummies chosen, the interpretation of β _i changes. However, model fit does not change.

Whatever constraint is chosen ˆ α + ˆ β _i does not change and so the probabilities remain the same.

The differences ˆ β _a − ˆ β _b for any pair (a, b) represent estimated log-odds ratios

(13)

Ordered Categorical predictors

If there are ordered categorical predictors for which we can find sensible scores (x 1 , x 2 , . . . , x _I ) these scores might be used and we act as if the predictor is of interval level.

An advantage is that we have increased power if most of the relationship between predictor and logit is linear. We only use one degree of freedom.

Disadvantage: When the relationship between predictor and the logit is non-

linear we loose valuable information.

(14)

Multiple predictors

Like in ordinary regression, logistic regression extends to cases with multiple predictors. Let π( x ) = P (Y = 1|X 1 = x 1 , X 2 = x 2 , . . . , X _p = x _p ), then

π( x ) = exp(α + β ₁ x ₁ + β ₂ x ₂ + . . . + β _p x _p ) 1 exp(α + β 1 x 1 + β 2 x 2 + . . . + β _p x _p )

The parameters β _i refers to the effect of x _i on the log odds that Y = 1, controlling for the other x _j (i.e. keeping the other x _j fixed).

The predictor variables can, of course, be categorical (dummy) or continu- ous. When all predictors are categorical the data can be represented in a contingency table format. (The data has ‘grouped’ format).

With factors the ANOVA-model is written as log

π _i 1 − π _i

= α + β _i ^X + β _k ^Z

(15)

Multiple predictors

Are predictors important ?

1. Use the Wald statistic ( ˆ β ² /SE ² ).

2. Use the likelihood ratio test. Compare two nested models, M ₀ and M ₁ with maximized log likelihood values L 0 and L 1 , respectively. Denote

G ² (M 0 |M 1 ) = −2(L 0 − L 1 ), assuming that model M 1 holds.

G ² (M 0 |M ₁ ) = −2(L ₀ −L ₁ ) has a chi-squared statistic with df the difference

in number of (independent!) parameters of the two models.

(16)

SPSS

SPSS has under

Analyze − > Regression − > Binary Logistic..

a logistic regression program.

Contains many statistics, such as 1. many residuals

2. the Hosmer and Lemeshow statistic

3. influence diagnostics (to be discussed next week)

4. etc

(17)

` ^EM

Program for categorical data analysis (free!) Can be found at:

http://www.uvt.nl/faculteiten/fsw/organisatie/departementen/mto/software2.html This program is especially useful for the analysis of contingency tables but it

can do much more (See ‘examples’).

Overview Classes Logistic regression (5) 19-3 Building and applying logistic regression (6) 26-3 Generalizations of logistic regression (7)

Overview Classes

12-3 Logistic regression (5)

19-3 Building and applying logistic regression (6)

26-3 Generalizations of logistic regression (7)

2-4 Loglinear models (8)

5-4 15-17 hrs; 5B02 Building and applying loglinear models (9.1-9.3, 9.8)

23-4 Association (9.4-9.6)

3-5 15-17 hrs: 5A37 Matched pairs (10)

7-5 Repeated measurements (11/12)

14-5 Mixture models (13)

Logistic Regression

Today’s topics:

1. Introduction

2. Parameter interpretation 3. Inference

4. Categorical predictors

5. Multiple predictors

6. Software: SPSS

7. Software: ` EM

Introduction: Logistic Regression

The response variable (Y ) is a dichotomous variable. We may have one or more, continuous or categorical predictor variables.

For the moment lets consider one predictor variable X. Denote π(x) = P (Y = 1|X = x). The logistic regression model is

π(x) = exp(α + βx) 1 + exp(α + βx) or equivalently

logit [π(x)] = log π(x)

1 − π(x) = α + βx

The logit link is equated to the linear predictor.

Interpretation

How to interpret β?

1. The sign determines whether the possibility goes up or down with an increase in X.

2. The larger the absolute value of β the steeper the line. When β = 0 the line is flat and X and Y are independent.

3. The relationship between the predictor and the probability follows the

logistic curve.

Interpretation

Interpretation

How to interpret β?

1. The odds increase multiplicatively by e β for a unit change in X.

2. e β is an odds ratio. The odds at X = x + 1 divided by the odds at X = x.

3. Use quartiles to get a better understanding.

4. Via linearization argument: The line tangent to the curve has slope βπ(x)[1 − π(x)]. This is approximately the increase in probability with an increase in predictor value of 1.

5. From this, it follows that near x where π(x) = .5, (i.e., x = −α/β) 1/β

approximates the distance between x-values that correspond to π(x) =

.25 or π(x) = .75 and π(x) = .5.

Inference

Significance tests usually test H 0 : β = 0. Possible tests (see class 1):

1. Wald statistic: z = β/SE. z 2 ∼ χ 2 with df=1.

2. Likelihood ratio statistic; Uses the difference of twice the maximized loglikelihood at ˆ β and β = 0. Also chi-square distributed with df=1.

The likelihood ratio statistic is preferred over the Wald statistic. It uses more information and has more power.

More information is usually provided by confidence intervals for β. These are

arrived through inverse reasoning.

Inference

Often we also like a confidence interval for the predicted probabilities (ˆ π(x)).

For a fixed value x = x 0 , logit[ˆ π(x 0 )] = ˆ α + ˆ βx 0 has a large-sample standard error (SE) given by the square root of

var(ˆ α + ˆ βx 0 ) = var(ˆ α) + x 2 0 var( ˆ β) + 2x 0 cov(ˆ α, ˆ β)

The variances and covariances of the regression weights can be obtained from formula (5.20).

A 95%-confidence interval for the logit is obtained by adding and subtracting 1.96SE from the estimated logit.

From this confidence interval we can obtain a confidence interval for the probabilities by

π(x 0 ) = exp(logit)

Inference: Goodness-of-fit stats

In practice there is no guarantee that the model fits the data well.

But if all more complex models do not increase the fit then this is some evidence that the chosen model is reasonable.

Detecting lack of fit by searching any way that the model fails. Therefore, X 2 and G 2 statistics are used. Data must be grouped: Categorize continuous variables.

An example is the Hosmer and Lemeshow statistic: Partition the data in g (approximately) equal groups based on predicted probabilities. Then form a contingency table of the groups against the two response categories. Compare fitted and observed frequencies.

Such tests indicate lack of fit but no insight about its nature.

Categorical predictors

Categorical variables are often named factors.

log

 π i 1 − π i



= α + β i

One must constrain one of the β i ’s, for example β 1 = 0 or P

i β i = 0.

This is like the ANOVA model

Categorical predictors

The same model can be made using dummy variables. A factor with I levels needs I −1 dummy variables. Like in multiple regression with dummy variables.

Example of dummy-variables for three-category Effect Dummy x 1 x 2 x 1 x 2

1 0 1 0

0 1 0 1

-1 -1 0 0

log

 π i 1 − π i

7. Software: ` ^EM

1. The odds increase multiplicatively by e ^β for a unit change in X.

2. e ^β is an odds ratio. The odds at X = x + 1 divided by the odds at X = x.

1. Wald statistic: z = β/SE. z ² ∼ χ ² with df=1.

var(ˆ α + ˆ βx 0 ) = var(ˆ α) + x ² ₀ var( ˆ β) + 2x 0 cov(ˆ α, ˆ β)

Detecting lack of fit by searching any way that the model fails. Therefore, X ² and G ² statistics are used. Data must be grouped: Categorize continuous variables.

π _i 1 − π _i

= α + β _i

One must constrain one of the β _i ’s, for example β 1 = 0 or P

i β _i = 0.

π _i 1 − π _i

In effect coding the β _i represents deviance from a ‘mean’. In dummy coding

the β _i denote deviance from the baseline group for which we set β _i = 0.

i β _i = 0 in the ANOVA set-up whereas Dummy-coding corresponds with β _I = 0.

Depending on the dummies chosen, the interpretation of β _i changes. However, model fit does not change.

Whatever constraint is chosen ˆ α + ˆ β _i does not change and so the probabilities remain the same.

The differences ˆ β _a − ˆ β _b for any pair (a, b) represent estimated log-odds ratios

If there are ordered categorical predictors for which we can find sensible scores (x 1 , x 2 , . . . , x _I ) these scores might be used and we act as if the predictor is of interval level.

Like in ordinary regression, logistic regression extends to cases with multiple predictors. Let π( x ) = P (Y = 1|X 1 = x 1 , X 2 = x 2 , . . . , X _p = x _p ), then

π( x ) = exp(α + β ₁ x ₁ + β ₂ x ₂ + . . . + β _p x _p ) 1 exp(α + β 1 x 1 + β 2 x 2 + . . . + β _p x _p )

The parameters β _i refers to the effect of x _i on the log odds that Y = 1, controlling for the other x _j (i.e. keeping the other x _j fixed).

π _i 1 − π _i

= α + β _i ^X + β _k ^Z

1. Use the Wald statistic ( ˆ β ² /SE ² ).

2. Use the likelihood ratio test. Compare two nested models, M ₀ and M ₁ with maximized log likelihood values L 0 and L 1 , respectively. Denote

G ² (M 0 |M 1 ) = −2(L 0 − L 1 ), assuming that model M 1 holds.

G ² (M 0 |M ₁ ) = −2(L ₀ −L ₁ ) has a chi-squared statistic with df the difference

` ^EM