Predicted counts - Poisson regression - Negative Binomial Regression

Poisson regression

6.4 Predicted counts

Expected or predicted counts (e.g. number of days of stay) can be calculated for a user defined set of predictor values. We may predict on the basis of the model that a non-white HMO patient entering the hospital as an urgent admission has an expected length of stay of 12 days.

This is particularly easy to calculate since all predictors are binary. Thus

intercept hmo urgent linear predictor

β₀ + β1∗1 − β2∗1 = xb

2.3329 − 0.07155 + 0.2217 = 2.48305

which is the value of the linear predictor (xβ). Next we apply the inverse link to determine the predicted count, µ; in this case exp(2.48305)= 11.977741 or 12 days.

Unfortunately the data are not well fitted, as we shall determine. This gives rise to the following observed values of los, given the above criteria:

head(medpar, n=10)

. l los hmo white type* if hmo==1 & white==0 & type2==1, noob

---los hmo white type1 type2 type3

---1 1 0 0 1 0

14 1 0 0 1 0

3 1 0 0 1 0

19 1 0 0 1 0

---The four observed values of los are 1, 3, 14, and 19. ---The mean of these counts is (1+ 3 + 14 + 19)/4 = 9.25, which is substantially lower than the predicted value of 11.98, or 12.

Of interest to many researchers is the relationship of observed and expected or predicted counts. Various programs exist to automate the calculation and graphing of the relationship. I show a method described in Long and Freese (2006). The authors use the poisson command, which saves specific model statistics that can be subsequently used to create specialized tables and graphs. The command ereturn list displays the various saved statistics fol-lowing Stata estimations. The glm command produces saved statistics as well, but many of the same statistics are saved using different names. In any case, using the Medicare model as described above, a researcher first models the data using the poisson command. We may create the model with-out displaying it on screen by using the quietly, abbreviated qui, command prefix:

glm(los ~ hmo + white + type2 + type3, family=poisson, data=medpar)

. qui poisson los hmo white type2 type3

Next we may develop a table of mean predicted vs observed counts for each day of los. For the medpar data, the number of days after 24 drops off sharply, with each los having only a few instances. We shall therefore develop a table of observed versus predicted proportions for days 0 through 25, followed by a graph of the relationship. With a mean value of the count, los, at 9.85, we expect a fairly normal-appearing predicted distribution.

For the table of observed versus predicted days, we shall multiply the per-centage values we obtain by 100, thereby rescaling the values to units. We also calculate the difference between the two values to more easily observe the discrepancy of observed to predicted days. The code for developing the table of observed versus predicted LOS, their difference and Figure 6.2 is contained in the Stata do-file, medpar_obspred.do. It can easily be amended to create the same type of table and graph for other data situations. Note that a stan-dalone command named prcounts, authored by Long and Freese (2006), can be used to develop a similar table and graph. One may use the follow-ing command to effect similar results: . prcounts psn, plot max(25)

Readers may prefer using it to employing the code shown here for devel-oping both the table and graph. However, I find the do file easier to use and amend for the examples in the text.

Table 6.15 R: Observed vs predicted counts

rm(list=ls()) data(medpar) attach(medpar)

mdpar <- glm(los ~ hmo+white+type2+type3, family=poisson, data=medpar) poi.obs.pred(len=25, model=mdpar)

* Code to create table Observed v Predicted LOS

* medpar_obspred.do : also includes code for Fig 6.2 . predict mu

. local i 0

. local newvar "pr‘i’"

*: Predicted probability at each los . while ‘i’ <=25 {

2. local newvar "pr‘i’"

3. qui gen ‘newvar’ = exp(-mu)*(muˆ‘i’)/exp(lnfactorial(‘i’)) 4. local i = ‘i’ + 1

5. }

. quietly gen cnt = . . quietly gen obpr = . . quietly gen prpr = . . local i 0

*: Observed and predicted los . while ‘i’ <=25 {

2. local obs = ‘i’ + 1 3. replace cnt = ‘i’ in ‘obs’

4. tempvar obser

5. gen ‘obser’ = (los==‘i’) /* generic = ‘e(depvar)’ */

6. sum ‘obser’

7. replace obpr = r(mean) in ‘obs’

8. sum pr‘i’

9. replace prpr = r(mean) in ‘obs’

10. local i = ‘i’ + 1 11. }

*: Preparation for table

. gen obsprop = obpr*100 /* outcomes equal to # */

. gen preprop = prpr*100 /* average predicted prob */

. gen byte count = cnt

. gen diffprop = obsprop - preprop . format obsprop preprop diffprop %8.3f . l count obsprop preprop diffprop in 1/21

---| count obsprop preprop diffprop

---+---1. | 0 0.000 0.012 -0.012

2. | 1 8.428 0.108 8.320

3. | 2 4.749 0.471 4.278

4. | 3 5.017 1.381 3.636

5. | 4 6.957 3.047 3.910

---+---6. | 5 8.227 5.402 2.825

7. | 6 6.488 8.024 -1.536

8. | 7 7.759 10.280 -2.521

9. | 8 6.154 11.612 -5.458

10. | 9 4.950 11.766 -6.816

---+---11. | 10 5.953 10.852 -4.899

12. | 11 4.682 9.231 -4.549

13. | 12 4.682 7.337 -2.654

14. | 13 2.876 5.523 -2.647

15. | 14 3.278 4.001 -0.723

---+---16. | 15 2.742 2.842 -0.099

17. | 16 2.876 2.020 0.856

18. | 17 1.940 1.465 0.475

19. | 18 1.538 1.095 0.443

20. | 19 1.605 0.845 0.761

---+---21. | 20 1.271 0.664 0.607

---Rather severe underprediction occurs until day 6, at which time overpre-diction occurs until day 15. An interesting test may be given to determine if the difference between the observed and predicted probabilities by count, weighted by the number of observations per count, is statistically significant.

A formatted graph of this same relationship can be given using the following code.

R: Graph of predicted versus observed LOS

plot(0:25, avgp, type="b", xlim=c(0,25), main = "Observed vs Predicted Days",

xlab = "Days in Hospital", ylab = "Probability of LOS") lines(0:25, c(0,propObsv), type = "b", pch = 2)

legend("topright", legend = c("Predicted Days","Observed Days"), lty = c(1,1), pch = c(1,2))

. label var prpr “Predicted days”

. label var obpr “Observed days”

. label var count “Days in Hospital”

. twoway scatter prpr obpr count, c(l l) ms(T d) ///

title(Observed vs Predicted days) ytitle(Probability of LOS)

0.05.1.15

Probability of LOS

0 5 10 15 20 25

Days in Hospital

Predicted days Observed days Observed vs Predicted days

Figure 6.2 Observed versus predicted hospital days

There is a symbol at each day on the two lines in the graph. Again, note the substantial disparity of the days until day 15 is reached. The graph clearly displays the fact that the model underpredicts days until the sixth day, when it begins to overpredict for the next nine days. Predicted days greater than 15 fit well with the observed values from the data. This lack of fit can be expected when there is statistical evidence of overdispersion.

It is evident that the predicted Poisson days appear to approximate a nor-mal or Gaussian distribution. We mentioned that this would be expected given the mean of los nearing 10. Poisson distributions with means from 4 to 10 and above appear Gaussian, with a right skew. A mean of 10 has only a slight skew, but has a lower kurtosis than a standard Gaussian distribution.

A graphical representation should help visualize the various distributional shapes.

Table 6.16 R: Poisson distributions

m<- c(0.5,1,3,5,7,9) #Poisson means y<- 0:19 #Observed counts layout(1)

for (i in 1:length(m)) {

p<- dpois(y, m[i]) #poisson pmf if (i==1) {

plot(y, p, col=i, type=‘l’, lty=i) }else {

lines(y, p, col=i, lty=i) }

}

. clear . set obs 20 . gen y = _n-1 . gen mu =.

. gen mu0_5 = (exp(-.5)*.5ˆy)/exp(lngamma(y+1)) . forvalues i = 1(2)8 {

gen muì’ = (exp(-ì’)*ì’ˆy)/exp(lngamma(y+1)) }

. graph twoway connected mu0_5 mu1 mu3 mu5 mu7 mu9 y, //

title("Poisson Distributions") /* Figure 6.3 */

0.2.4.6

0 5 10 15 20

mu0_5 mu1

mu3 mu5

mu7 mu9

Poisson Distributions

Figure 6.3 Poisson distributions with means at 0.5, 1, 3, 5, 7, 9

A Poisson mean of under 1 produces a negative exponential slope, with a high probability (∼0.6) of a count being 0. Other means produce a dis-tribution with its value at the point of maximum probability. The code used to determine the various probabilities entails the use of the Poisson PDF. The same relationship holds for the medpar study data. For a given observation:

Linear predictor

Xb= b0+ b₁X₁+ · · · + bnXn (6.41) Expected mean

µi= exp(xiβ) (6.42)

Probability

p_i = exp(−µi)µ^y_iⁱ

exp(ln(yi+ 1)) (6.43)

In document Negative Binomial Regression (Page 136-142)