Testing the link - Handling apparent overdispersion

Poisson regression

7.2 Handling apparent overdispersion

7.2.6 Testing the link

---| OIM

pysq | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---+---x1 | .366116 .0012164 300.98 0.000 .3637319 .3685001 x2 | -.660156 .0012297 -536.84 0.000 -.6625662 -.6577458 x3 | .2255974 .0012219 184.63 0.000 .2232026 .2279922 _cons | 2.284213 .0015569 1467.20 0.000 2.281161 2.287264 ---. save odtest /// save simulated datasets in one file

Parameter estimates now differ greatly from the true values. The dispersion statistics are both extremely high (Pearson dispersion = 1,850), as are the AIC and BIC statistics. The model is highly overdispersed. Note the difference created by not taking into account the quadratic nature of x1. Squaring x1, of course, results in the correct model.

7.2.6 Testing the link

The final correction we address in this chapter relates to a misspecification of link. Link specification tests are typically applied to members of the binomial family, e.g. evaluating if the logit, probit, complementary loglog, or loglog is the most appropriate link to use with a Bernoulli or binomial family GLM.

Misspecification for these sets of models is examined in length in Hilbe (2009).

The test is valid, however, for any single equation model. Termed a Tukey–

Pregibon test in Hilbe (2009), the test is constructed by fitting the model, calculating the hat diagonal statistic, and running the model again with hat and hat-squared as the sole predictors. If hat-squared is significant, at p < 0.05, the model is misspecified. Another link may be more appropriate.

Three links have been used with Poisson and negative binomial count mod-els. These include those listed inTable 7.3.

We shall delay specification tests for negative binomial models until later;

for now we shall observe how data that are appropriate for an identity-linked Poisson are affected when estimated using a standard log-linked Poisson. Recall that the natural log-link is the Poisson canonical link, deriving directly from the Poisson PDF.

When the coefficients of an identity-linked Poisson are exponentiated, they are estimates of an incidence rate difference in two contiguous levels of a

Table 7.3 GLM count model links

Link Form Models

log ln(µ) Poisson; NB2

canonical NB −ln(1/(αµ) + 1) NB-C

identity µ Poisson; NB-I

Table 7.4 Identity Poisson GLM algorithm

dev=0

µ = (y + mean(y))/2 η = µ

WHILE (abs( dev) > tolerance) { w = µ

z = (y-µ)/µ β = (X’wX)⁻¹X’wz µ = X’β

oldDev = dev

dev = 2S{yln(y/µ) - (y-µ)}

dev = dev - oldDev }

predictor with respect to the count response. The notion is based on the relative risk difference that we discussed with respect to a binary response in Section 2.4. The logic is the same, except that it is now applied to counts. Rate difference models are rarely discussed in statistical literature, but they are nevertheless viable models. When estimated using GLM soft-ware, the model setup is identical to that of a standard Poisson regression, with the exception that the link is η= xb = µ, and the inverse link is sim-ply the linear predictor. Generally the algorithm may be simplified owing to the identity of the linear predictor and fitted value. An abbreviated estimat-ing algorithm for the identity Poisson model can be given as displayed in Table 7.4.

Using the same data as in the previous analyses, including the same value of xb, the linear predictor, an identity Poisson model may be constructed by using xb as the parameter to the rpoisson() function. The randomly generated counts, piy, will reflect the identity Poisson structure of the data. It is then used as the response term with the x1, x2, and x3 predictors as before.

xb <- 1 + 0.5*x1 - 0.75*x2 + 0.25*x3 piy <- rpois(nobs, xb)

poy10 <- glm(piy ~ x1+x2+x3, family=poisson(link=identity)) summary(poy10)

confint(poy10)

. gen piy = rpoisson(xb)

. glm piy x1 x2 x3, fam(poi) link(iden)

Generalized linear models No. of obs = 50000

Optimization : ML Residual df = 49996

Scale parameter = 1 Deviance = 56186.38884 (1/df) Deviance = 1.123818 Pearson = 49682.03785 ( 1/df) Pearson = .9937203

. . .

AIC = 2.567087

Log likelihood = -64173.16593 BIC = -484759.2

---| OIM

piy | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---+---x1 | .5123143 .0148118 34.59 0.000 .4832837 .5413449 x2 | -.7595172 .0148764 -51.06 0.000 -.7886744 -.73036 x3 | .2483579 .0148463 16.73 0.000 .2192597 .2774561 _cons | 1.000588 .0137555 72.74 0.000 .9736275 1.027548

---The parameter estimates are close to those we specified. ---The distribution of counts for the identity Poisson model are displayed in Table 7.5.

This sharp decrease in the frequency of counts is characteristic of the identity linked Poisson model. Note that the dispersion statistic approximates 1.0, as we would expect.

Table 7.5 R: Table with Count, Freq, Prop, cumProp

piy <- poy10$piy

myTable <- function(x) { myDF <- data.frame(table(x)) myDF$Prop <- prop.table(myDF$Freq) myDF$CumProp <- cumsum(myDF$Prop) myDF

}

myTable(piy)

. tab piy

piy Freq. Percent Cum.

---+---0 | 18,997 37.99 37.99

1 | 17,784 35.57 73.56

2 | 8,848 17.70 91.26

3 | 3,197 6.39 97.65

4 | 917 1.83 99.49

5 | 199 0.40 99.88

6 | 49 0.10 99.98

7 | 8 0.02 100.00

8 | 1 0.00 100.00

---+---Total | 50,000 100.00

Now, suppose that we were given a model with the above distribution of counts, together with the adjusters x1, x2, and x3. It is likely that we would first attempt a standard Poisson model. Modeling it thus appears as:

poy11 <- glm(piy ~ x1 + x2 + x3, family=poisson)) summary(poy11)

confint(poy11)

. glm piy x1 x2 x3, fam(poi)

Generalized linear models No. of obs = 50000

Optimization : ML Residual df = 49996

Scale parameter = 1 Deviance = 56261.22688 (1/df) Deviance = 1.125315 Pearson = 49706.1183 (1/df) Pearson = .9942019

. . .

AIC = 2.568583

Log likelihood = -64210.58495 BIC = -484684.4

---| OIM

piy | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---+---x1 | .5214119 .0155631 33.50 0.000 .4909088 .551915 x2 | -.7706432 .0156961 -49.10 0.000 -.8014069 -.7398794 x3 | .2563454 .0155219 16.52 0.000 .2259231 .2867677 _cons | -.0410198 .0142529 -2.88 0.004 -.0689549 -.0130847 ---The dispersion statistic is still close to 1.0, and the parameter estimates are somewhat close to those produced given the correct model. The difference in intercept values is, however, significant.

We may test the link using what is termed the Tukey–Pregibon link test (Hilbe,2009). The test is performed by first modeling the data (e.g. as a Poisson model), then calculating the hat matrix diagonal statistic. The hat statistic is squared and the response term is modeled on hat and hat-squared. If the value of hat-squared is statistically significant, the link of the original model is not well specified. That is, the model is well specified if the hat-squared is not significant.

The test is automated in Stata as the linktest command. Employed on the log-linked Poisson with identity-linked data, we clearly observe a misspecified link.

Table 7.6 R: Tukey-Pregibon link test

poitp <- glm(piy ~ x1+x2+x3, family=poisson)) hat <-hatvalues(poitp)

hat2 <- hat*hat

poy12 <- glm(piy ~ hat + hat2, family=poisson) summary(poy12)

confint(poy12)

. qui poisson piy x1 x2 x3 . linktest

Poisson regression Number of obs = 50000

LR chi2(2) = 3911.35 Prob > chi2 = 0.0000 Log likelihood = -64173.639 Pseudo R2 = 0.0296 ---piy | Coef. Std. Err. z P>|z| [95% Conf. Interval]

---+---_hat | 1.015117 .0167228 60.70 0.000 .9823409 1.047893 _hatsq | -.4355458 .0512416 -8.50 0.000 -.5359776 -.3351141 _cons | .0323105 .0058467 5.53 0.000 .0208511 .0437699

---With the dispersion statistic approximating 1.0, and significant predictors, we would normally believe that the model would appropriately fit the data. It does not. But we only know this by using a link specification test. Here the remedy is to employ an alternative link.

These examples show how apparent overdispersion may be corrected. The caveat here is that one should never employ another model designed for overdis-persed count data until the model is evaluated for apparent overdispersion. A model may in fact be a well-fitted Poisson or negative binomial model once appropriate transformations have taken place. This is not always an easy task,

but necessary when faced with indicators of overdispersion. Moreover, until overdispersion has been accommodated either by dealing with the model as above, or by applying alternative models, one may not simply accept seem-ingly significant p-values.

Although it has not been apparent from the examples we have used, overdis-persion does often change the significance with which predictors are thought to contribute to the model. Standard errors may be biased either upwards or downwards. A model may appear well fitted yet be incorrectly specified. All of these checks for apparent overdispersion should be traversed prior to declaring a model well fitted, or needing adjustment because of real overdispersion.

In document Negative Binomial Regression (Page 172-177)