• No results found

Examining a Fitted Logistic Model

N/A
N/A
Protected

Academic year: 2021

Share "Examining a Fitted Logistic Model"

Copied!
7
0
0

Loading.... (view fulltext now)

Full text

(1)

Examining a Fitted Logistic Model

Deviance Test for Lack of Fit

The data below describes the male birth fraction male births/total births over the years 1931 to 1990.

A simple logistic model was fit as follows

> glfit <- glm( cbind(mc,fc) ~ year, family=binomial)

> summary(glfit) . . . . .

Null deviance: 80.252 on 59 degrees of freedom Residual deviance: 78.385 on 58 degrees of freedom

> 1-pchisq(glfit$dev,glfit$df.resid) [1] 0.03853689

> glfitSat <- glm( cbind(mc,fc) ~ factor(year), family=binomial)

> anova(glfit,glfitSat,test="Chi") Analysis of Deviance Table

Model 1: cbind(mc, fc) ~ year

Model 2: cbind(mc, fc) ~ factor(year)

Resid. Df Resid. Dev Df Deviance P(>|Chi|)

1 58 78.385

2 0 0.000 58 78.385 0.03854 *

(2)

STAT 536 Lecture 16 2

Examination of Residuals

Here is a plot of the Pearson residuals, which are defined based on the fitted values ˆyi = miπˆi (letting mi is the total number of births in year i ) as

yi− ˆyi

pVar ( ˆyi) (1)

Another commonly used form of residual is the deviance residual, defined as

±√ 2

s

yilog yi

i



+ (ni − yi)log ni− yi ni− ˆyi



(2) taking the sign of yi − ˆyi. The plot below illustrates the convergence of the two definitions for large values of ni.

(3)

The plot hints a the possible existence of non-linearity as a source of the lack of fit. A fifth order polynomial fit yields the following plot.

The test of residual deviance yields the following:

> 1-pchisq(glfitp$dev,glfit$df.resid) [1] 0.1483860

(4)

STAT 536 Lecture 16 4 Use of polynomials in non-linearity can be problematic due to their non-robustness.

The use of splines is generally recommended instead.

Goodness of Fit tests in the absence of replication

A number of tests for lack of fit are available in CRAN packages, including li- brary(MKDesign) and library(Design). The latter library (from Frank Harrell, author of Regression Modeling Strategies) provides it’s own functions for logistic regression.

Returning to the ICU mortality (APACHE score) data from last day.

The most widely used (but not the most powerful) test is the Hosmer-Lemeshow test. The test statistic is calculated by first partitioning the observations by deciles of fitted values, πi. Within each decile, j, one calculates, Oj = P yi, Ej =P ˆyi and letting nj represent the number in that group (which will be roughly

n

10) we calculate

H =

10

X

j=1

(Oj− Ej)2 njπ¯j(1 − ¯πj) where ¯πj = Ej/nj.

> attach(tdf)

> library(Design)

> dd <- datadist(tdf)

> options(datadist="dd")

> lrmFit <- lrm( discharge ~ reason*apache, x=TRUE,y=TRUE)

> library(MKmisc)

(5)

> HLgof.test( predict(lrmFit,type="fitted"), as.integer(lrmFit$y=="D")) Hosmer-Lemeshow C statistic

X-squared = 1.9453, df = 8, p-value = 0.9826 Hosmer-Lemeshow H statistic

X-squared = 7.7566, df = 8, p-value = 0.4576

> residuals(lrmFit,type="gof")

Sum of squared errors Expected value|H0 SD

7.8114008 7.6517094 0.2591992

Z P

0.6160952 0.5378317

Assessing the Strength of Relationships in Logistic Regression

If one treats y as representing a diagnostic results (1 = Positive) and the fitted η’s (i.e. ˆηi’s) as a continuous diagostic indicator, we can use the idea of area under the curve (AUC) to capture the strength of the relatonship.

> rocPlot(lrmFit$y,predict(lrmFit))

Harrell’s library(Design) provides automatic re-scaling of explanatory variables to aid in interpreting the magnitude of logistic regression coefficients and odds ratios

> summary(lrmFit)

Factor Low High Diff. Effect S.E. 95% LL 95% UL

apache 10 24 14 2.03 1.72 -1.33 5.40

Odds Ratio 10 24 14 7.63 NA 0.26 221.17

[ output truncated ]

(6)

STAT 536 Lecture 16 6

Model Building Strategies

The key assumptions of the logistic regression models are

• independence of yi’s

• correct specification of the relationship between πi and the explanatory values The latter depends on the validity of the link specification and of the appro- priateness of the linear predictor. One particular issue that must be addressed is the potential utility for transforming continuous variables to improve the quality of the fit. Residual plots are often used.

The data analysed below describes occurrence of bleeding in patients enrolled in a clinical trial testing the efficacy of two protocols for treating blood clots (thromboses) using heparin, an anti-clotting drug. Bleeding is often a side-effect of heparin therapy. Physicians had notice that older women seemed to be susceptible to bleeds. Weight is also a factor, as well as a measure of the patients innate clotting tendency, measured by activated partial thromboplastin time (aPTT for short) which is the time taken for clots to form in a laboratory blood sample test. Patients with longer aPTT values are more susceptible to bleeding.

Here are deviance residual plots after fitting a model with age, sex, weight and aPTT.

The dichotomous nature of logistic regression residuals makes it almost im- possible to discern any pattern in such plots. Generalized Additive Modeling is is alternate approach to examining functional form developed by Hastie and Tibshi- rani. Iterative non-parametric fits are performed using scatter-plot smoothing to estimate the additive components. The algorithm produces both estimates and an assessment of the statistical signficance of deviation from linearity.

(7)

Call: gam(formula = any.bld ~ gender + s(weight) + s(age) + s(aptt0), family = binomial)

[ output truncated ]

Df Npar Df Npar Chisq P(Chi) (Intercept) 1

gender 1

s(weight) 1 3 4.6494 0.1994

s(age) 1 3 2.4205 0.4898

s(aptt0) 1 3 6.1242 0.1057

References

Related documents

G-score, H-Bond Interaction and Contacts .The more negative value of G-score indicates that the compound is more potent and good binding affinity (Table 2).G score of

These included the fact that the regular inpatient guidance program was implemented without any un- derstanding of the patient’s background, including the characteristics of PWS;

The present observational study monitored the 24 h clinical course and complications following diagnostic FFB and BAL in critically ill ventilated patients.. Patients’

This is based on the fact that, due to the hysteresis, one intra-zone handover at the border domain (handover 4 and 7 in Figure 7) is not immediately followed by an

The Finite element results were compared with analytical design .The modal and harmonic analysis are performed on the modified model using four different

Obliteration of signaling by secreted ADP had a modest effect on the aggregation of normal platelets in response to high concen- tration of collagen; however, aggregation of

Phase lags not including leg i have distributions closely grouped around the mean which for both species is close to 0-5 as expected from an alternating tetrapod gait.. Phase

Lactate increase 110% in red and 85% in white muscles after 40 min swimming (Table 1). Thus in our 18 cm fish it is likely that both red and white muscle are contributing to the