Examining a Fitted Logistic Model
Deviance Test for Lack of Fit
The data below describes the male birth fraction male births/total births over the years 1931 to 1990.
A simple logistic model was fit as follows
> glfit <- glm( cbind(mc,fc) ~ year, family=binomial)
> summary(glfit) . . . . .
Null deviance: 80.252 on 59 degrees of freedom Residual deviance: 78.385 on 58 degrees of freedom
> 1-pchisq(glfit$dev,glfit$df.resid) [1] 0.03853689
> glfitSat <- glm( cbind(mc,fc) ~ factor(year), family=binomial)
> anova(glfit,glfitSat,test="Chi") Analysis of Deviance Table
Model 1: cbind(mc, fc) ~ year
Model 2: cbind(mc, fc) ~ factor(year)
Resid. Df Resid. Dev Df Deviance P(>|Chi|)
1 58 78.385
2 0 0.000 58 78.385 0.03854 *
STAT 536 Lecture 16 2
Examination of Residuals
Here is a plot of the Pearson residuals, which are defined based on the fitted values ˆyi = miπˆi (letting mi is the total number of births in year i ) as
yi− ˆyi
pVar ( ˆyi) (1)
Another commonly used form of residual is the deviance residual, defined as
±√ 2
s
yilog yi
yˆi
+ (ni − yi)log ni− yi ni− ˆyi
(2) taking the sign of yi − ˆyi. The plot below illustrates the convergence of the two definitions for large values of ni.
The plot hints a the possible existence of non-linearity as a source of the lack of fit. A fifth order polynomial fit yields the following plot.
The test of residual deviance yields the following:
> 1-pchisq(glfitp$dev,glfit$df.resid) [1] 0.1483860
STAT 536 Lecture 16 4 Use of polynomials in non-linearity can be problematic due to their non-robustness.
The use of splines is generally recommended instead.
Goodness of Fit tests in the absence of replication
A number of tests for lack of fit are available in CRAN packages, including li- brary(MKDesign) and library(Design). The latter library (from Frank Harrell, author of Regression Modeling Strategies) provides it’s own functions for logistic regression.
Returning to the ICU mortality (APACHE score) data from last day.
The most widely used (but not the most powerful) test is the Hosmer-Lemeshow test. The test statistic is calculated by first partitioning the observations by deciles of fitted values, πi. Within each decile, j, one calculates, Oj = P yi, Ej =P ˆyi and letting nj represent the number in that group (which will be roughly
n
10) we calculate
H =
10
X
j=1
(Oj− Ej)2 njπ¯j(1 − ¯πj) where ¯πj = Ej/nj.
> attach(tdf)
> library(Design)
> dd <- datadist(tdf)
> options(datadist="dd")
> lrmFit <- lrm( discharge ~ reason*apache, x=TRUE,y=TRUE)
> library(MKmisc)
> HLgof.test( predict(lrmFit,type="fitted"), as.integer(lrmFit$y=="D")) Hosmer-Lemeshow C statistic
X-squared = 1.9453, df = 8, p-value = 0.9826 Hosmer-Lemeshow H statistic
X-squared = 7.7566, df = 8, p-value = 0.4576
> residuals(lrmFit,type="gof")
Sum of squared errors Expected value|H0 SD
7.8114008 7.6517094 0.2591992
Z P
0.6160952 0.5378317
Assessing the Strength of Relationships in Logistic Regression
If one treats y as representing a diagnostic results (1 = Positive) and the fitted η’s (i.e. ˆηi’s) as a continuous diagostic indicator, we can use the idea of area under the curve (AUC) to capture the strength of the relatonship.
> rocPlot(lrmFit$y,predict(lrmFit))
Harrell’s library(Design) provides automatic re-scaling of explanatory variables to aid in interpreting the magnitude of logistic regression coefficients and odds ratios
> summary(lrmFit)
Factor Low High Diff. Effect S.E. 95% LL 95% UL
apache 10 24 14 2.03 1.72 -1.33 5.40
Odds Ratio 10 24 14 7.63 NA 0.26 221.17
[ output truncated ]
STAT 536 Lecture 16 6
Model Building Strategies
The key assumptions of the logistic regression models are
• independence of yi’s
• correct specification of the relationship between πi and the explanatory values The latter depends on the validity of the link specification and of the appro- priateness of the linear predictor. One particular issue that must be addressed is the potential utility for transforming continuous variables to improve the quality of the fit. Residual plots are often used.
The data analysed below describes occurrence of bleeding in patients enrolled in a clinical trial testing the efficacy of two protocols for treating blood clots (thromboses) using heparin, an anti-clotting drug. Bleeding is often a side-effect of heparin therapy. Physicians had notice that older women seemed to be susceptible to bleeds. Weight is also a factor, as well as a measure of the patients innate clotting tendency, measured by activated partial thromboplastin time (aPTT for short) which is the time taken for clots to form in a laboratory blood sample test. Patients with longer aPTT values are more susceptible to bleeding.
Here are deviance residual plots after fitting a model with age, sex, weight and aPTT.
The dichotomous nature of logistic regression residuals makes it almost im- possible to discern any pattern in such plots. Generalized Additive Modeling is is alternate approach to examining functional form developed by Hastie and Tibshi- rani. Iterative non-parametric fits are performed using scatter-plot smoothing to estimate the additive components. The algorithm produces both estimates and an assessment of the statistical signficance of deviation from linearity.
Call: gam(formula = any.bld ~ gender + s(weight) + s(age) + s(aptt0), family = binomial)
[ output truncated ]
Df Npar Df Npar Chisq P(Chi) (Intercept) 1
gender 1
s(weight) 1 3 4.6494 0.1994
s(age) 1 3 2.4205 0.4898
s(aptt0) 1 3 6.1242 0.1057