Other Models - Basic Count Regression - Regression Analysis of Count Data

Basic Count Regression

3.7 Other Models

In this section we consider whether least-squares methods might be usefully applied to count data y. Three variations of least squares are considered. The ﬁrst is linear regression of y on x, making no allowance for the count nature

of the data aside from using heteroskedasticity robust standard errors. The sec- ond is linear regression of a nonlinear transformation of y on x, for which the transformation leads to a dependent variable that is close to homoskedastic and symmetric. Third, we consider nonlinear least squares regression with conditional mean of y speciﬁed to be exp(xβ).

The section ﬁnishes with a discussion of estimation using duration data, rather than count data, if the data are generated by a Poisson process.

3.7.1 OLSwithout Transformation

TheOLSestimator is clearly inappropriate as it speciﬁes a conditional mean function xγ that may take negative values and a variance function that is homoskedastic. If the conditional mean function is in fact exp(xβ), theOLS

estimator is inconsistent forβ and the computedOLSoutput gives the wrong asymptotic variance matrix.

Nonetheless,OLSestimates in practice give results qualitatively similar to those for Poisson and other estimators using the exponential mean. The ratio ofOLSslope coefficients is often similar to the ratio of Poisson slope coefficients, with theOLSslope coefficients approximately ¯y times the Poisson slope coefficients, and the most highly statistically significant regressors fromOLS

regression, using usualOLSoutput t statistics, are in practice the most highly signiﬁcant using Poisson regression. This is similar to comparing different models for binary data such as logit, probit, andOLS. In all cases the conditional mean is restricted to be of form g(xβ), which is a monotonic transformation of a linear combination of the regressors. The only difference across models is the choice of function g, which leads to a different scaling of the parametersβ. A ﬁrst-order Taylor series expansion of the exponential mean exp(xβ) around the sample mean ¯y, that is, around xβ = ln ¯y, yields exp(xβ) = ¯y + ¯y(xβ − ln ¯y). For models with intercept, this can be rewritten as exp(β1 +

x₂β2)= γ1+ x2γ2, whereγ1= ¯y + β1¯y− ln ¯y and γ2= β2¯y. So linear mean

slope coefﬁcients are approximately ¯y times exponential slope coefﬁcients. This approximation will be more reasonable the less dispersed the predicted values exp(xiβ) are about ¯y.ˆ

TheOLSestimator can be quite useful for preliminary data analysis, such as determining key variables, in simple count models. Dealing with more compli- cated count models for which no off-the-shelf software is readily available is easier if one ﬁrst ignores the count aspect of the data and does the corresponding adjustment toOLS. For example, if the complication is endogeneity, then do linear two-stage least squares as a potential guide to the impact of endogeneity. But experience is sufﬁciently limited that one cannot advocate this approach.

3.7.2 OLSwith Transformation

For skewed continuous data such as that on individual income or on housing prices a standard transformation is the log transformation. For example, if y is

log-normal-distributed then ln y is by deﬁnition exactly normally distributed, so the log transformation induces constant variance and eliminates skewness.

The log transformation may also be used for count data that are often skewed. Because ln 0 is not deﬁned, a standard solution is to add a constant term, such as 0.5, and to model ln (y+ .5) byOLS. This model has been criticized by King (1989b) as performing poorly.

An alternative transformation is the square-root transformation. Following McCullagh and Nelder (1989, p. 236), let y= µ(1 + ε). Then a fourth-order Taylor series expansion aroundε = 0 yields

y1/2 µ1/2 1+1 2ε − 1 8ε 2₊ 1 16ε 3₋ 5 128ε 4 .

For the Poisson,ε = (y−µ)/µ has ﬁrst four moments 0, 1/µ, 1/µ2_{, and (3}_/µ2₊

1/µ3_{). It follows that}_E_[√_y]_{√µ(1−1/8µ+ O(1/µ}2_)),_V_[√_y]_(1/4)(1+

3/8µ + O(1/µ2)), andE[(√y−E[√y])3] −(1/16√µ)(1 + O(1/µ)). Thus if y is Poisson then√y is close to homoskedastic and is close to symmetric. The skewness index is the third central moment divided by variance raised to the power 1.5. Here it is less than −(1/16√µ)/(1/4)1.5= −1/2√µ. By comparison for the Poisson y is heteroskedastic with varianceµ and asymmetric with skewness index 1/√µ. The square-root transformation works quite well for largeµ.

One therefore models√y byOLS, regressing√yi on xi. The usualOLSt

statistics can be used for statistical inference. More problematic is the inter- pretation of coefﬁcients. These give the impact of a one-unit change in xj on

E[√y] rather thanE[y], and by Jensen’s inequalityE[y] = (E[√y])2_{. A similar}

problem arises in prediction, although the method of Duan (1983) can be used to predictE[yi], given the estimated model for√yi.

3.7.3 Nonlinear Least Squares

The nonlinear least squares (NLS) estimator with exponential mean minimizes the sum of squared residualsi(yi− exp(xiβ))2. The estimator ˆβNLS is the

solution to the ﬁrst-order conditions

n i= 1 xi yi− exp xiβ expxiβ = 0. (3.63)

This estimator is consistent if the conditional mean of yi is exp(xiβ). It is

inefﬁcient, however, as the errors are certainly not homoskedastic, and the usual reportedNLS standard errors are inconsistent. ˆβ_NLS is asymptotically normal with variance V[ ˆβNLS]= _n i= 1 µ2 ixixi −1_n i= 1 ωiµ2ixixi _n i= 1 µ2 ixixi −1 , (3.64)

whereωi=V[yi| xi]. The robust sandwich estimate ofV[ ˆβNLS] is (3.64), with

µi andωireplaced by ˆµi and (yi− ˆµi)2.

TheNLSestimator can therefore be used, but more efﬁcient estimates can be obtained using the estimators given in sections 3.2 and 3.3.

Example: Doctor Visits (Continued)

Coefﬁcient estimates of binary Poisson, ordered probit,OLS,OLSof transfor- mations of y (both ln[y+ 0.1] and √y), PoissonPMLE, andNLSwith expo- nential mean are presented in Table 3.6. The associated t statistics reported are based onRSstandard errors, except for binary Poisson and ordered pro- bit. The skewness and kurtosis measures given are for model residuals zi− ˆzi

where zi is the dependent variable, for example, zi= √yi, and are estimates

of, respectively, the third central moment divided by s3_{and the fourth central}

moment divided by s4_{, where s}2 _{is the estimated variance. For the standard}

normal distribution the kurtosis measure is 3.

We begin with estimation of a binary choice model for the recoded variable d= 0 if y = 0 and d = 1 if y ≥ 1. To allow direct comparison with Poisson esti- mates, we estimate the nonstandard binary Poisson model introduced in section 3.6.1. Compared with Poisson estimates in thePoisscolumn, theBPresults for health status measures are similar, although for the statistically insigniﬁcant socioeconomic variables AGE, AGESQ, and INCOME there are sign changes. Similar sign changes for AGE and AGESQ occur in Table 3.4 and are discussed there. The log-likelihood forBPexceeds that for Poisson, but this comparison is meaningless due to the different dependent variable. Logit and probit, not reported, lead to similar log-likelihood and qualitatively similar estimates to those from binary Poisson, so differences between binary Poisson and Poisson can be attributed to aggregating all positive counts into one value.

The ordered probit model normalizes the error variance to 1. To enable comparison withOLSestimates we multiply these by s= .714, the estimated standard deviation of the residual fromOLSregression. Also, as only one ob- servation took the value 9, this was combined into a category of 8 or more. The rescaled threshold parameter estimates are .67, 1.08, 1.22, 1.39, 1.49, 1.67, and 1.99, with t statistics all in excess of 18 and all at least two standard errors apart. Despite the rescaling there is still considerable difference from theOLS

estimates. It is meaningful to compare the ordered-probit log-likelihood with that of other count data models; the change of one observation from 9 to 8 or more in the ordered probit should have little effect. The log-likelihood is higher for this model than forNB2, because−3138.1 > −3198.7, although six more parameters are estimated.

The log transformation ln (y+ 0.1) was chosen on grounds of smaller skew- ness and kurtosis than ln (y+0.2) or ln (y +0.4). The skewness and kurtosis are somewhat smaller for ln y than√y. Both transformations appear quite success- ful in moving towards normality, especially compared with residuals fromOLS

Table 3.6. Doctor visits: alternative estimates and t ratios

Estimators and t statistics

Discrete choice OLSof transformations Exponential mean Variable BP OrdProb y ln y √y Poiss NLS ONE −.905 −.980 .028 −2.115 .070 −2.224 −2.234 (6.66) (9.29) (.38) (21.43) (1.55) (8.74) (6.14) SEX .136 .094 .034 .081 .034 .157 −.057 (3.39) (3.03) (1.47) (2.73) (2.48) (1.98) (.42) AGE −1.356 −.381 .203 −.566 −.161 1.056 3.626 (1.76) (.46) (.46) (.97) (.60) (.77) (1.82) AGESQ 1.842 .611 −.062 .877 .292 −.849 −3.676 (2.15) (.65) (.12) (1.31) (.94) (.58) (1.70) INCOME .007 −.044 −.057 −.019 −.168 −.205 −.394 (.12) (.95) (1.65) (.43) (.80) (1.59) (2.02) LEVYPLUS .136 .098 .035 .080 .337 .123 .214 (2.80) (2.45) (1.62) (2.58) (2.41) (1.29) (1.48) FREEPOOR −.265 −.245 −.103 −.182 −.081 −.440 −.232 (2.55) (2.75) (2.17) (3.17) (3.00) (1.52) (.54) FREEREPA .223 .127 .033 .139 .054 .080 −.003 (3.16) (2.37) (.77) (2.45) (2.06) (.63) (.02) ILLNESS .148 .107 .060 .110 .048 .187 .140 (9.12) (9.23) (6.04) (8.53) (8.12) (7.81) (3.63) ACTDAYS .117 .072 .103 .106 .054 .127 .121 (14.47) (18.35) (10.61) (13.57) (13.06) (16.33) (14.21) HSCORE .034 .023 .017 .029 .013 .030 .023 (3.64) (3.54) (2.37) (3.31) (3.17) (2.11) (1.03) CHCOND1 .042 .044 .004 .022 .009 .114 .079 (.94) (1.23) (.20) (.70) (.61) (1.25) (.55) CHCOND2 .141 .096 .042 .102 .043 .141 −.055 (2.11) (2.06) (.90) (1.81) (1.62) (1.15) (.31) −lnL 2246.9 3138.1 Skewness 3.6 1.2 1.4 3.1 Kurtosis 26.4 4.0 5.5 26.0

Note: BP,MLEfor binary poisson; OrdProb, MLE for rescaled ordered probit; y,OLSfor y; ln y,

OLSfor ln(y+ 0.1); √y,OLSfor√y; Poiss, PoissonPMLE;NLS,NLSwith exponential mean. The

t statistics are robust sandwich for all but BP and OrdProb. Skewness and kurtosis are for model

residuals.

even before inclusion of regressors, as inclusion of regressors reduces skewness and kurtosis by about 20% in this example. All models give similar results re- garding the statistical signiﬁcance of regressors, although interpretation of the magnitude of the effect of regressors is more difﬁcult if the dependent variable is ln(y+ 0.1) or √y.

TheNLSestimates for exponential mean lead to similar conclusions as Pois- son for the health-status variables, but quite different conclusions for socioeco- nomic variables with considerably larger coefﬁcients and t statistics for AGE, AGESQ, and INCOME and a sign change for SEX.

3.7.4 Exponential Duration Model

For a Poisson point process the number of events in a given interval of time is Poisson distributed. The duration of a spell, the time from one occurrence to the next, is exponentially distributed. Here we consider modeling durations rather than counts.

Suppose that for each individual in a sample of n individuals we observe the duration of one complete spell, generated by a Poisson point process with rate parameterγi. Then ti has exponential density f (ti)= γiexp(−γiti) with mean

E[ti]= 1/γi. For regression analysis it is customary to specifyγi= exp(xiβ).

The exponentialMLE, ˆβ_E, maximizes the log-likelihood function

lnL= n i_{= 1} xiβ − exp xiβ ti. (3.65)

The ﬁrst-order conditions can be expressed as

n i= 1 1− expx_iβti xi= 0, (3.66)

and application of the usual maximum likelihood theory yields

V_ML[ ˆβE]= _n i_{= 1} xixi ₋₁ . (3.67)

If instead we modeled the number of events from a Poisson point process with rate parameterγi= exp(xiβ) we obtain

V_ML[ ˆβ_P]= _n i= 1 γixixi ₋₁ .

The two variance matrices coincide ifγi= 1. Thus if we choose intervals

for each individual so that individuals on average experience one event such as a doctor visit, the count data have the same information content, in terms of precision of estimation ofβ, as observing for each individual one completed spell such as time between successive visits to the doctor. More simply, one count conveys the same information as the length of one complete spell.

In document Regression Analysis of Count Data (Page 110-115)