Overview of count response models
3.1 Varieties of count response model
Poisson regression is traditionally conceived of as the basic count model upon which a variety of other count models are based.1The Poisson distribution may be characterized as
f(y; λ)= e−λi(λi)yi
yi! , yi = 0, 1, 2, . . . , ni; λ > 0 (3.1)
1 The statistical analysis of frequency tables of counts, and statistical inference based on such tables, appears to have first been developed by Abu al-Kindi (801–873CE), a Persian mathemati-cian who lived in present day Iraq. He was the first to use frequency analysis for cryptoanalysis, and can be regarded as the father of modeling count data, as well as perhaps the father of statistics. He was also primarily responsible for bringing “Arabic” numerals to the attention of scholars in the West.
30
where the random variable y is the count response and parameter λ is the mean. Often, λ is also called the rate or intensity parameter. Unlike most other distributions, the Poisson does not have a distinct scale parameter. Rather, the scale is assumed equal to 1.
In statistical literature, λ is also expressed as µ when referring to Poisson and traditional negative binomial (NB2) models. Moreover, µ is the standard manner in which the mean parameter is expressed in generalized linear models (GLM). Since we will be using the glm command or function for estimating many Poisson and negative binomial examples in this text, we will henceforth employ µ in place of λ for expressing the mean of a GLM model. We will use λlater for certain non-GLM count models.
The Poisson and negative binomial distributions may also include an expo-sure variable associated with µ. The variable t is considered to be the area in, or length of time during, which events or counts occur. This is typically called the exposure. If t= 1, then the Poisson probability distribution reduces to the standard form. If t is a constant, or varies between events, then the distribution can be parameterized as
f(y; µ)=e−tiµi(tiµi)yi
yi! (3.2)
When included in the data, the natural log of t is entered as an offset into the model estimation. Playing an important role in estimating both Poisson and negative binomial models, offsets are discussed at greater length in Chapter 6.
A unique feature of the Poisson distribution is the relationship of its mean to the variance – they are equal. This relationship is termed equidispersion. The fact that it is rarely found in real data has driven the development of more general count models, which do not assume such a relationship.
Poisson counts, y, are examples of a Poisson process. Each event in a given time period or area that is subject to the counting process is independent of one another; they enter into each period or area according to a uniform distribution.
The fact that there are y events in period A has no bearing on how many events are counted in period B. The rate at which events occur in a period or area is µ, with a probability of occurrence being µ times the length of the period or size of the area.
The Poisson regression model derives from the Poisson distribution. The relationship between µ, β, and x – the fitted mean of the model, parameters, and model covariates or predictors, respectively – is parameterized such that µ= exp(xβ). Here xβ is the linear predictor, which is also symbolized as η within the context of GLM. Exponentiating xβ guarantees that µ is positive for all values of η and for all parameter estimates. By attaching the subscript i
to µ, y, and x, the parameterization can be extended to all observations in the model. The subscript can also be used when modeling non-iid (independent and identically distributed) observations.
It should be explicitly understood that the response, or dependent variable, of a Poisson or negative binomial regression model, y, is a random variable specifying the count of some identified event. The explanatory predictors, or independent variables, X, are given as nonrandom sets of observations.
Observations within predictors are assumed independent of one another, and predictors are assumed to have minimal correlation between one another.
As shall be described in greater detail later in this book, the Poisson model carries with it various assumptions in addition to those given above. Violations of Poisson assumptions usually result in overdispersion, where the variance of the model exceeds the value of the mean. Violations of equidispersion indicate correlation in the data, which affects standard errors of the parameter estimates.
Model fit is also affected. Chapter 7 is devoted to this discussion.
A simple example of how distributional assumptions may be violated will likely be instructional at this point. We begin with the base count model – the Poisson. The Poisson distribution defines a probability distribution function for non-negative counts or outcomes. For example, given a Poisson distribution having a mean of 2, some 13% of the outcomes are predicted to be zero. If, in fact, we are given an otherwise Poisson distribution having a mean of 2, but with 50% zeros, it is clear that the Poisson distribution may not adequately describe the data at hand. When such a situation arises, modifications are made to the Poisson model to account for discrepancies in the goodness of fit of the underlying distribution. Models such as zero-inflated Poisson and zero-truncated Poisson directly address such problems.
The above discussion regarding distributional assumptions applies equally to the negative binomial. A traditional negative binomial distribution having a mean of 2 and an ancillary parameter of 1.5 yields a probability of approx-imately 40% for an outcome of zero. When the observed number of zeros substantially differs from the theoretically imposed number of zeros, the base negative binomial model can be adjusted in a manner similar to the adjustments mentioned for the Poisson. It is easy to construct a graph to observe the values of the predicted counts for a specified mean. The code can be given as inTable 3.1.
Note the differences in predicted zero count for Poisson compared with the negative binomial with heterogeneity parameter, α, having a value of 1.5. If α were 0.5 instead of 1.5, the predicted zeros would be 25%. For higher means, both predicted zero counts get progressively smaller. A mean of 5 results in approximately 1% predicted zeros for the Poisson distribution and 24% for the negative binomial with alpha of 1.5.
Table 3.1 R: Poisson vs. negative binomial PDF
obs <- 11 mu <- 2 y <- 0:10
yp2 <- (exp(-mu)*muˆy)/exp(log(gamma(y+1))) alpha <- 1.5
amu <- mu*alpha ynb2 = exp(
y*log(amu/(1+amu)) - (1/alpha)*log(1+amu) + log(gamma(y +1/alpha)) - log(gamma(y+1)) - log(gamma(1/alpha)) )
plot(y, ynb2, col="red", pch=5,
main="Poisson vs Negative Binomial PDFs") lines(y, ynb2, col="red")
points(y, yp2, col="blue", pch=2) lines(y, yp2, col="blue")
legend(4.3,.40,
c("Negative Binomial: mean=2, a=1.5",
"Poisson: mean=2"), col=(c("red","blue")), pch=(c(5,2)),
lty=1)
# FOR NICER GRAPHIC
zt <- 0:10 #zt is Zero to Ten
x <- c(zt,zt) #two zt’s stacked for use with ggplot2 newY <- c(yp2, ynb2) #Now stacking these two vars
Distribution <- gl(n=2,k=11,length=22, label=c("Poisson","Negative Binomial") )
NBPlines <- data.frame(x,newY,Distribution) library("ggplot2")
ggplot(NBPlines, aes(x,newY,shape=Distribution, col=Distribution)) + geom_line() + geom_point()
. set obs 11
. gen byte mu = 2 // mean = 2 . gen byte y = _n-1
* POISSON
. gen yp2 = (exp(-mu)*muˆy)/exp(lngamma(y+1))
* NEGATIVE BINOMIAL
. gen alpha = 1.5 // NB2 alpha=1.5 . gen amu = mu*alpha
. gen ynb2 = exp(y*ln(amu/(1+amu)) - (1/alpha)*ln(1+amu) + lngamma(y +1/alpha)- lngamma(y+1) - lngamma(1/alpha)) . label var yp2 "Poisson: mean=2"
. label var ynb2 "Negative Binomial: mean=2, a=1.5"
. graph twoway connected ynb2 yp2 y, ms(T d) ///
title(Poisson vs Negative Binomial PDFs)
0.1.2.3.4
0 2 4 6 8 10
Negative Binomial: mean=2, a=1.5 Poisson: mean=2 Poisson vs Negative Binomial PDFs
Figure 3.1 Poisson versus negative binomial PDF at mean= 2
Early on, researchers developed enhancements to the Poisson model, which involved adjusting the standard errors in such a manner that the presumed overdispersion would be dampened. Scaling of the standard errors was the first method developed to deal with overdispersion from within the GLM framework.
It is a particularly easy tactic to take when the Poisson model is estimated as a generalized linear model. We shall describe scaling in more detail in Section 7.3.1. Nonetheless, the majority of count models require more sophisticated adjustments than simple scaling.
Again, the negative binomial is normally used to model overdispersed Pois-son data, which spawns our notion of the negative binomial as an extension of the Poisson. However, distributional problems affect both models, and nega-tive binomial models themselves may be overdispersed. Both models can be extended in similar manners to accommodate any extra correlation or disper-sion in the data that result in a violation of the distributional properties of each respective distribution (Table 3.1). The enhanced or advanced Poisson or negative binomial model can be regarded as a solution to a violation of the distributional assumptions of the primary model.
Table 3.2enumerates the types of extensions that are made to both Poisson and negative binomial regression. Thereafter, we provide a bit more detail as to the nature of the assumption being violated and how it is addressed by each type of extension. Later chapters are devoted to a more detailed examination of each of these model types.
Earlier in this chapter we described violations of Poisson and negative binomial distributions as related to excessive zero counts. Each distribution has an expected number of counts for each value of the mean parameter;
we saw how for a given mean, an excess – or deficiency – of zero counts results in overdispersion. However, it must be understood that the negative binomial has an additional ancillary or heterogeneity parameter, which, in concert with the value of the mean parameter, defines (in a probabilistic sense) specific expected values of counts. Substantial discrepancies in the number of counts, i.e. how many zeros, how many 1s, how many 2s, and so forth, observed in the data from the expected frequencies defined by the given mean and ancillary parameter (NB model), result in correlated data and hence overdispersion. The first two items in Table 3.2 directly address this problem.
Table 3.2 Violations of distributional assumptions
1 No zeros in data 2 Excess zeros in data
3 Data separable into two or more distributions 4 Censored observations
5 Truncated data
6 Data structured as panels: clustered and longitudinal data
7 Some responses occur based on the value of another variable
8 Endogenous variables in model
Violation 1: The Poisson and negative binomial distributions assume that zero counts are a possibility. When the data to be modeled originate from a generating mechanism that structurally excludes zero counts, then the Poisson or negative binomial distribution must be adjusted to account for the missing zeros. Such model adjustment is not used when the data can have zero counts, but simply do not. Rather, an adjustment is made only when the data must be such that it is not possible to have zero counts. Hospital length of stay is a good example. When a patient enters the hospital, a count of 1 is given. There are no lengths of stay recorded as zero days. The possible values for data begin with a count of 1. Zero-truncated Poisson and zero-truncated negative binomial models are normally used for such situations.
Violation 2: The Poisson and negative binomial distributions define an expected number of zero counts for a given value of the mean. The greater the mean, the fewer zero counts are expected. Some data, however, come with a high percentage of zero counts – far more than are accounted for by the Poisson or negative binomial distribution. When this occurs statisticians have developed regression models called zero-inflated Poisson (ZIP) and zero-inflated negative binomial (ZINB). The data are assumed to come from a mixture of two dis-tributions where the structural zeros from a binary distribution are mixed with the non-negative integer outcomes (including zeros) from a count distribution.
Logistic or probit regression is typically used to model the structural zeros, and Poisson or negative binomial regression is used for the count outcomes. If we were to apply a count model to the data without explicitly addressing the mixture, it would be strongly affected by the presence of the excess zeros. This inflation of the probability of a zero outcome is the genesis of the zero-inflated name.
Violation 3: When the zero counts of a Poisson or negative binomial model do not appear to be generated from their respective distributions, one may separate the model into two parts, somewhat like the ZIP and ZINB models above. However, in the case of hurdle models, the assumption is that a threshold must be crossed from zero counts to actually entering the counting process. For example, when modeling insurance claims, clients may have a year without claims – zero counts. But when one or more accidents occur, counts of claims follow a count distribution (e.g. Poisson or negative binomial). The logic of the severability in hurdle models differs from that of zero-inflated models. Hurdle models are sometimes called zero-altered models, giving us model acronyms of ZAP and ZANB.
Like zero-inflated models, hurdle or zero-altered algorithms separate the data into zero versus positive counts. However, the binary component of a hurdle model is separate from the count component. The binary component consists of two values, 0 for 0 counts and 1 for positive counts; the count
component is a zero-truncated count model with integer values greater than 0.
For zero-inflated models, the binary component is overlapped with the count component. The binary component model estimates the count of 0, with the count component estimating the full range of counts. The count of 0 is mixed into both components. Care must be taken when interpreting the binary com-ponent of zero-inflated compared with hurdle models.
The binary component of zero-inflated models is typically a logit or probit model, but complementatry loglog models are also used. The binary component of hurdle models is usually one of the above binomial models, but can be a censored-at-one Poisson, geometric or negative binomial as well. Zero-inflated likelihood functions therefore differ considerably from the likelihood functions of similar hurdle models. We shall address these differences in more detail in Chapter 11.
Violation 4: At times certain observations are censored from the rest of the model. With respect to count response models, censoring takes two forms. In either case a censored observation is one that contributes to the model, but for which exact information is missing.
The traditional form, which I call the econometric or cut parameteriza-tion, revalues censored observations as the value of the lower or upper valued non-censored observation. Left-censored data take the value of the lowest censored count; right-censored data take the value of the highest non-censored count. Another parameterization, which can be referred to as the sur-vival parameterization, considers censoring in the same manner as is employed with survival models.That is, an observation is left-censored to when events are known to enter into the data; they are right-censored when events are lost to the data due to withdrawal from the study, loss of information, and so forth. The log-likelihood functions of the two parameterizations differ, but the parameter estimates calculated are usually not too different.
Violation 5: Truncated observations consist of those that are entirely excluded from the model, from either the lower, or left, or higher, or right side of the distribution of counts. Unlike the econometric parameterization of censoring described in Violation 4, truncated data are excluded, not revalued, from the model.
Violation 6: Longitudinal data come in the form of panels. For example, in health studies, patients given a drug may be followed for a period of time to ascertain effects occurring during the duration of taking the drug. Each patient may have one or more follow-up tests. Each set of patient observations is considered to be a panel. The data consist of a number of panels. However, observations within each panel cannot be considered independent – a central assumption of maximum likelihood theory. Within-panel correlation result in overdispersed data. Clustered data result in similar difficulties. In either case,
methods have been developed to accommodate extra correlation in the data due to the within-panel correlation of observations. Such models, however, do require that the panels themselves are independent of one another, even though the observations within the panels are not. Generalized estimating equations (GEE), fixed-effects models, random-effects models, and mixed-effects and multilevel models have been widely used for such data.
Violation 7: Data sometimes come to us in such a manner that an event does not begin until a specified value of another variable has reached a certain threshold. One may use a selection model to estimate parameters of this type of data. Greene (1994) summarizes problems related to selection models.
Violation 8: At times there are variables not in the model which affect the values of predictors that are in the model. These influence the model, but only indirectly. They also result in model error, that is, variability in the model that is not accounted for by variables in the model. A standard model incorporates this extra variation into the error term. Models have been designed, however, to define endogenous variables that identify extra variation in the data that is not incorporated into existing predictors. Endogeneity can arise from a variety of sources (e.g. errors in measuring model predictors, variables that have been omitted from the model, and that have not been collected for inclusion in the data, predictors that have been categorized from continuous predictors, and for which information has been lost in the process, sample selection errors in gathering the data). The problem of endogeneity is of considerable importance in econometrics, and should be in other disciplines as well. Endogeneity is central to the discussion of Chapter 13, but is also inherent in the models addressed in Chapters 12 and 14.
Table 3.3provides a schema of the major types of negative binomial regres-sion models. A similar schema may also be presented characterizing varieties of the Poisson model. Some exceptions exist, however. Little development work has been committed to the exact statistical estimation of negative binomial parameters and standard errors. However, substantial work has been done on Poisson models of this type – particularly by Cytel Corp, manufacturers of LogXact software, and Stata Corporation. Additionally, models such as hetero-geneous negative binomial, and NB-P have no comparative Poisson model.
3.2 Estimation
There are two basic approaches to estimating models of count data. The first is by full maximum likelihood estimation (MLE or FMLE), and the second is by an iteratively re-weighted least squares (IRLS) algorithm, which is based
Table 3.3 Varieties of negative binomial model
1 Negative binomial (NB) NB2
NB1
NB-C (canonical)
NB-H (Heterogeneous negative binomial) NB-P (variety of generalized NB) 2 Zero-adjusting models
Zero-truncated NB Zero-inflated NB
NB with endogenous stratification (G) Hurdle NB models
NB-logit hurdle // geometric-logit hurdle NB-probit hurdle // geometric-probit hurdle NB-cloglog hurdle // geometric-cloglog hurdle 3 Censored and truncated NB
Censored NB-E: econometric parameterization Censored NB-S: survival parameterization Truncated NB-E: econometric parameterization 4 Sample selection NB models
5 Models that handle endogeneity Latent class NB models
Two-stage instrumental variables approach Generalized method of moments (GMM)
NB2 with an endogenous multinomial treatment variable Endogeneity resulting from measurement error
6 Panel NB models
Unconditional fixed-effects NB Conditional fixed-effects NB Random-effects NB with beta effect Generalized estimating equations
Linear mixed NB models Random-intercept NB Random-parameter NB 7 Response adjustment
Finite mixture NB models Quantile count models Bivariate count models 8 Exact NB model
9 Bayesian NB with (a) gamma prior, (b) beta prior
on a simplification of the full maximum likelihood method. IRLS is intrinsic
on a simplification of the full maximum likelihood method. IRLS is intrinsic