Examining a Fitted Logistic Model

(1)

Examining a Fitted Logistic Model

Deviance Test for Lack of Fit

The data below describes the male birth fraction male births/total births over the years 1931 to 1990.

A simple logistic model was fit as follows

> glfit <- glm( cbind(mc,fc) ~ year, family=binomial)

> summary(glfit) . . . . .

Null deviance: 80.252 on 59 degrees of freedom Residual deviance: 78.385 on 58 degrees of freedom

> 1-pchisq(glfit$dev,glfit$df.resid) [1] 0.03853689

> glfitSat <- glm( cbind(mc,fc) ~ factor(year), family=binomial)

> anova(glfit,glfitSat,test="Chi") Analysis of Deviance Table

Model 1: cbind(mc, fc) ~ year

Model 2: cbind(mc, fc) ~ factor(year)

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 58 78.385

2 0 0.000 58 78.385 0.03854 *

(2)

STAT 536 Lecture 16 2

Examination of Residuals

Here is a plot of the Pearson residuals, which are defined based on the fitted values ˆyi = miπˆi (letting mi is the total number of births in year i ) as

yi− ˆyi

pVar ( ˆyi) (1)

Another commonly used form of residual is the deviance residual, defined as

±√ 2

s

yilog yi

yˆi

+ (n_i − y_i)log ni− y_i ni− ˆyi

(2) taking the sign of yi − ˆyi. The plot below illustrates the convergence of the two definitions for large values of n_i.

(3)

The plot hints a the possible existence of non-linearity as a source of the lack of fit. A fifth order polynomial fit yields the following plot.

The test of residual deviance yields the following:

> 1-pchisq(glfitp$dev,glfit$df.resid) [1] 0.1483860

(4)

STAT 536 Lecture 16 4 Use of polynomials in non-linearity can be problematic due to their non-robustness.

The use of splines is generally recommended instead.

Goodness of Fit tests in the absence of replication

A number of tests for lack of fit are available in CRAN packages, including library(MKDesign) and library(Design). The latter library (from Frank Harrell, author of Regression Modeling Strategies) provides it’s own functions for logistic regression.

Returning to the ICU mortality (APACHE score) data from last day.

The most widely used (but not the most powerful) test is the Hosmer-Lemeshow test. The test statistic is calculated by first partitioning the observations by deciles of fitted values, π_i. Within each decile, j, one calculates, O_j = P y_i, Ej =P ˆyi and letting nj represent the number in that group (which will be roughly

n

10) we calculate

H =

10

X

j=1

(Oj− Ej)² njπ¯j(1 − ¯πj) where ¯πj = Ej/nj.

> attach(tdf)

> library(Design)

> dd <- datadist(tdf)

> options(datadist="dd")

> lrmFit <- lrm( discharge ~ reason*apache, x=TRUE,y=TRUE)

> library(MKmisc)

(5)

> HLgof.test( predict(lrmFit,type="fitted"), as.integer(lrmFit$y=="D")) Hosmer-Lemeshow C statistic

X-squared = 1.9453, df = 8, p-value = 0.9826 Hosmer-Lemeshow H statistic

X-squared = 7.7566, df = 8, p-value = 0.4576

> residuals(lrmFit,type="gof")

Sum of squared errors Expected value|H0 SD

7.8114008 7.6517094 0.2591992

Z P

0.6160952 0.5378317

Assessing the Strength of Relationships in Logistic Regression

If one treats y as representing a diagnostic results (1 = Positive) and the fitted η’s (i.e. ˆηi’s) as a continuous diagostic indicator, we can use the idea of area under the curve (AUC) to capture the strength of the relatonship.

> rocPlot(lrmFit$y,predict(lrmFit))

Harrell’s library(Design) provides automatic re-scaling of explanatory variables to aid in interpreting the magnitude of logistic regression coefficients and odds ratios

> summary(lrmFit)

Factor Low High Diff. Effect S.E. 95% LL 95% UL

apache 10 24 14 2.03 1.72 -1.33 5.40

Odds Ratio 10 24 14 7.63 NA 0.26 221.17

[ output truncated ]

(6)

STAT 536 Lecture 16 6

Model Building Strategies

The key assumptions of the logistic regression models are

• independence of yi’s

• correct specification of the relationship between π_i and the explanatory values The latter depends on the validity of the link specification and of the appro- priateness of the linear predictor. One particular issue that must be addressed is the potential utility for transforming continuous variables to improve the quality of the fit. Residual plots are often used.

The data analysed below describes occurrence of bleeding in patients enrolled in a clinical trial testing the efficacy of two protocols for treating blood clots (thromboses) using heparin, an anti-clotting drug. Bleeding is often a side-effect of heparin therapy. Physicians had notice that older women seemed to be susceptible to bleeds. Weight is also a factor, as well as a measure of the patients innate clotting tendency, measured by activated partial thromboplastin time (aPTT for short) which is the time taken for clots to form in a laboratory blood sample test. Patients with longer aPTT values are more susceptible to bleeding.

Here are deviance residual plots after fitting a model with age, sex, weight and aPTT.

The dichotomous nature of logistic regression residuals makes it almost im- possible to discern any pattern in such plots. Generalized Additive Modeling is is alternate approach to examining functional form developed by Hastie and Tibshi- rani. Iterative non-parametric fits are performed using scatter-plot smoothing to estimate the additive components. The algorithm produces both estimates and an assessment of the statistical signficance of deviation from linearity.

(7)

Call: gam(formula = any.bld ~ gender + s(weight) + s(age) + s(aptt0), family = binomial)

[ output truncated ]

Df Npar Df Npar Chisq P(Chi) (Intercept) 1

gender 1

s(weight) 1 3 4.6494 0.1994

s(age) 1 3 2.4205 0.4898

s(aptt0) 1 3 6.1242 0.1057