Generalized Estimating Equations for Zero-Inflated Spatial Count Data

(1)

doi:10.1016/j.proenv.2011.07.049

Procedia Environmental Sciences 7 (2011) 1–11 Procedia Environmental Sciences 7 (2011) 281–286

1

st

Conference on Spatial Statistics 2011: Mapping Global Change

Generalized Estimating Equations for Zero-Inflated Spatial

Count Data

Anthea Monod

*

École Polytechnique Fédérale de Lausanne, Department of Mathematics, Station 8, CH-1015 Lausanne, Switzerland

Abstract

This paper consolidates the zero-inflated Poisson model for count data with excess zeros proposed by Lambert (1992) and the two-component model approach for serial correlation among repeated observations proposed by Dobbie and Welsh (2001) for spatial count data; not only does this address the problem of overdispersion, but additionally provides for greater flexibility within the zero components, allowing for the distinction between zeros that arise due to random sampling and those that arise due to an inherent characteristic that may induce zero observations. A likelihood and corresponding score equations are derived for the zero-inflated Poisson model; spatial correlation may be incorporated via any spatial correlation structure following Diggle et al. (2009). A Matérn (1960) correlation is implemented as an illustrative example.

Keywords: Generalized estimating equation (GEE); marginal model; Matérn correlation; spatial count data; zero-inflated counts; zero-inflated Poisson model.

1. Introduction

Let

y

_it denote the number of occurrences of an event observed at

t

1

,

T

_i time points for each subject

i

,

i

1

,

n

, and let

x

_it

R

q be a vector of measured covariates. Such data is often modeled in the context of generalized linear models to allow for greater flexibility, specifying the expectation to be

> @

O

1

(

T

E

)

it it it

g

x

Y

E

with

E

a

q

u

1

vector of unknown parameters, and the link function

g

(

)

commonly taken to be the log function. For a Poisson probability specification, the variance is set to be

Var

> @

Y

_it

O

_it, which in

*

Correspondence: [email protected], tel. +41 21 693 29 02

Selection and peer-review under responsibility of Spatial Statistics 2011

Selection and peer-review under responsibility of Spatial Statistics 2011Open access under CC BY-NC-ND license.

(2)

practice may be too restrictive; often the data exhibit

E

> @

Y

_it

Var

> @

Y

_it . This is known as overdispersion, which, if left uncorrected, may lead to inaccurate inference as a result of underestimated standard errors.

Lambert (1992) presents the technique of zero-inflated Poisson (ZIP) regression, giving rise to a new class of regression models for count data with an abundance of zero observations. In a ZIP model, the non-negative integer-valued response variable

Y

is assumed to be distributed as a mixture of a Poisson distribution with a parameter

O

and a distribution with point mass of one at the value zero, with mixing probability

D

. In this ZIP model, the non-zero count responses and a portion of the zero count responses are modeled by the familiar Poisson specification.

Dobbie and Welsh (2001) adapt the generalized estimating equations approach of Liang and Zeger (1986) to zero-inflated spatial count data, and address the issue of dependence by incorporating a correlation matrix. The abundance of zeros is modeled in their method via a two-component approach, proposed by Welsh et al. (1996), among others: the zero observations are modeled separately from the non-zero observations. This method considers first absence versus presence (zero versus non-zero) via a logistic model, and then conditional on presence, models the non-zero counts by a truncated discrete model; Dobbie and Welsh (2001) use a truncated Poisson distribution.

We work in the context of a Poisson generalized linear model, and consolidate the approaches of Lambert (1992) and Dobbie and Welsh (2001) to construct a generalized estimating equation for the zero-inflated Poisson spatial model, and incorporate spatial dependence. Attributing some of the zeros to the Poisson distribution avoids conditioning on the responses, and provides a more intuitive approach to occurrence of zeros in the data. Dobbie and Welsh (2001) use their model to describe weekly counts of Noisy Friarbirds (Philemon corniculatus) recorded by observers for the Canberra Garden Bird Survey; attributing a probability weight of zero observations to a point mass distribution and its complement to a Poisson distribution allows for the distinction between zero counts arising due to an inherent characteristic that may induce zero observations (e.g. inadequacy of the region where measurements were taken for the survival or reproduction of Noisy Friarbirds) and zero counts arising at random.

The remainder of this paper is structured as follows: we provide an overview of the zero-inflated Poisson technique proposed by Lambert (1992), and outline the two-component approach of Dobbie and Welsh (2001). We then develop the Poisson generalized linear model that borrows ideas from the two methods, we give its likelihood and compute corresponding score equations for the zero-inflated Poisson model. We then follow Diggle et al. (2009) to incorporate dependence into the score equations. As an illustrative example, we give the estimating equations for the non-zero component where the spatial correlation is defined by the celebrated Matérn (1960) class of spatial correlations.

2. Overview

2.1 The Zero-Inflated Poisson Model

Definition 2.1.1: Let

Y

_itbe a non-negative, integer-valued random variable describing a discrete number of occurrences for a cross-sectional unit

i

at a time period

t

, and let

y

_itbe the observed event count.

it

Y

is said to be Poisson-distributed, or follows a Poisson distribution, with parameter

O

it

(

0

,

f

)

if the

probability that there are exactly

y

_itoccurrences is given by the probability mass function

.

!

)

;

(

)

;

Pr(

it y it it it Y it it it

y

e

y

f

y

Y

it it O

O

(3)

Here, as a generalization, we allow the parameter

O

_it to depend on subject

i

and time period

t

; furthermore, we allow the parameter to depend on some given information

u

_it

R

s(which may even be provided by the same measured covariates

x

_it, so it may be that

s

q

), so that

O

_it

O

(

z

_it

)

.

As mentioned previously, the zero-inflated Poisson model is a mixture of a Poisson distribution, and a point mass of one at the value zero, with a certain mixing probability.

Definition 2.1.2: Let

Y

_itbe a random variable as above in Definition 2.1.1; it is said to follow a zero-inflated Poisson distribution with parameter

O

it

(

0

,

f

)

and mixing probability

D

(

0

,

1

)

if

).

1

(

y

probabilit

with

)

(

~

,

y

probabilit

with

0

it

D

O

D

Poisson

Y

it it It follows that

> @

Y

it it

E

(

1

D

)

O

and

Var

> @

Y

_it

(

1

D

)

O

_it

(

1

DO

_it

)

.

Indeed

E

> @

Y

_it

Var

> @

Y

_it , since

D

,

O

it

!

0

. The zero-inflated Poisson distribution relaxes the equal

expectation and variance constraint, and exhibits overdispersion.

2.2 Modeling Correlated Zero-Inflated Count Data: The Two-Component Approach

The two-component approach presented in Dobbie and Welsh (2001) consists in modeling the zero observations separately from the non-zero observations; it is assumed that given covariates

q it

x

R

and

z

_it

R

r ,

Y

_it

0

with probability

D

_it and

Y

_it

~

Poisson

_Truncated

(

O

_it

,

n

₀

)

with probability

1

D

_it , where

Poisson

_Truncated

(

O

,

n

₀

)

denotes the zero-truncated Poisson distribution, truncated at

n

₀, with parameter

O

. The parameter

O

_itand the probability

D

_it are allowed to depend respectively on auxiliary information

x

_it and

z

_it, which need not be different. This gives

,

2

,

1

for

)

1

(

!

)

1

(

)

,

|

Pr(

)

|

0

Pr(

it it y it it it it it it it it it

y

e

y

e

z

x

y

Y

x

Y

it it it O O

O

D

For a two-component generalized linear model, the link functions are

2 1 2 1 1 1

)

log(

)

(

1

log

)

logit(

)

(

E

O

E

D

T it it it T it it it it it

z

g

x

g

¸¸

¹

·

¨¨

©

§

for the first (presence versus absence) and second (non-zero, conditional on presence) components, respectively.

Given this setting that assumes independence,

E

₁ and

E

₂ are orthogonal and the log-likelihood for the two-component model is the sum of the logistic log-likelihood, denoted by

L

₁, and the truncated Poisson log-likelihood, denoted by

L

₂; we have

L

₁

L

₂, where

¦

! !

¸

¹

·

¨

©

§

¸¸

¹

·

¨¨

©

§

0 2 2 0 0 1

))

!

log

)

exp(

1

log(

(

1

log

1

log

2 2 1 1 1 it T it T it it T it T it it T it y it z z T it it y x x y x

y

e

z

y

L

e

L

E E E E E

E

(4)

from which estimating (score) equations can be derived for each component; details may be found in Liang and Zeger (1986).

To incorporate dependence, Dobbie and Welsh (2001) consider derived responses, which comprise an indicator variable for the absence/presence component and a count for the non-zero component, conditional on presence. They proceed to model the dependence of the observed

y

_it, and determine what this implies about the dependence between the derived responses. For their two-component model, they consider a single variance-covariance matrix that takes the structure of a

2

u

2

block matrix: the variance behavior for each component lie along the diagonal, while the off-diagonal blocks describe the covariance behavior between the two components: the presence/absence component and the positive count component. They then show that in assuming a dependence structure of an autoregressive process of order 1,

AR

(

1

)

, the unconditional variance does not depend on the covariance behavior between the two components, and thus set the off-diagonal blocks to zero. Furthermore, the consistency of the estimators depends only on the correct specification of the mean functions of the derived responses, and not on the correct choice of correlation matrices. New estimating equations are then re-derived, again following the method of Liang and Zeger (1986), and the corresponding robust estimators for the variances of the parameter estimators are given.

3. Method

We implement the zero-inflated Poisson model of Lambert (1992) to directly address the matter of overdispersion from excess zeros, and proceed to obtain a likelihood and score equations, which, following Dobbie and Welsh (2001), turn out to be generalized estimating equations in the style of Liang and Zeger (1986), and finally incorporate spatial dependence following Diggle et al. (2009) into our model.

3.1 The Zero-Inflated Poisson Generalized Linear Model

In the context of generalized linear models outlined in the introduction, we specify the commonly-used log-linear link function so that

E

O

E

O

xitT it T it it

x

e

log

.

In the generalized Definition 2.1.1, we allowed the parameter of the Poisson distribution to depend on auxiliary information

O

_it

O

(

z

_it

)

, which does not necessarily differ from the measured covariates; notice that in using the log-linear link function, we indeed have

O

_it

O

(

x

_it

)

.

For the zero-inflated Poisson model, with mixing probability

D

_it, which for simplicity are also assumed to depend on

x

_it, the observations are generated by

!

)

exp(

)

1

(

)

0

(

)

|

Pr(

it x T it it it it it it it it

y

e

x

y

I

x

y

Y

T it E

E

D

;

where

I

(

y

_it

0

)

is an indicator variable. The probability of observing a zero is

)

exp(

)

1

(

)

1

(

)

|

(

)

|

0

Pr(

D

Oit

D

xitTE it it it it it it it it

x

f

y

x

e

Y

.

3.2 Likelihood and Score Equations

(5)

¦

!

0 : , 0 : ,

)

!

log

(

))

exp(

)

1

(

log(

)

,

(

it T it it T it y t i it x T it it y t i x it it it

e

y

x

e

y

E E

E

D

E

D

"

.

Applying the techniques of Liang and Zeger (1986), we notice that the independence estimating equations for the Poisson component of the ZIP model is given by

¦

n i it i it T i

diag

y

x

1 1

0

))

exp(

(

)

(

O

, where _i _i _iT T i

x

(

₁

,

)

. The solution to this equation gives a consistent estimator of

E

for this portion of the model.

Proceeding by classical likelihood analysis, the resulting score equation for the ZIP model with regard to

E

is

¦

!

w

0 : , 0 : ,

0

)

(

)

0

Pr(

)

0

Pr(

)

(

it it ity it it it y t i it it it it it

y

x

Y

y

O

D

O

E

"

. (1)

Modeling the mixing probability

D

_itas any function of another parameter

J

,

D

it

(

J

)

, we may also

compute the resulting score equation for the ZIP model with regard to

J

:

0

1

)

0

Pr(

)

0

Pr(

,

w

!

w

¦

t i it it it it

Y

D

J

D

J

"

. (2)

The ratios of probabilities provide an intuitive odds-ratio interpretation of the weighting between the two probability components of the ZIP model, however its dependence on

E

in (2), as well as the dependence of (1) on

J

through

D

_it requires the two score equations to be solved simultaneously.

3.3 Introducing Dependence

Following Dobbie and Welsh (2001) and Diggle et al. (2009) in the setting of marginal models, we may introduce dependence into the score equations (1) and (2), extending them to incorporate a single

2

T

_i

u

2

T

_i spatial variance-covariance matrix describing the variance for both components of the model, which takes the structure of a

2

u

2

block matrix, as described earlier. Diggle et al. (2009) show that for marginal models under appropriate parameterizations, the score equations assume a form of a generalized estimating equation (Liang and Zeger, 1986),

> @

1

(

)

0

¸¸

¹

·

¨¨

©

§

w

P

E

P

Y

Var

T (3)

where

P

denotes a mean function of the parameter

E

. As (1) assumes this form, we may introduce spatial correlation via the variance-covariance matrix

Var

> @

Y

. The solution to this equation gives a consistent estimator of

E

. Corresponding robust estimators of its variance are provided by Liang and Zeger (1986), and detailed in Dobbie and Welsh (2001).

3.4 An Example

For the score equation (2) concerning the mixing probability of the zero-component of the ZIP model, we follow Dobbie and Welsh (2001) and specify an

AR

(

1

)

dependence structure. The celebrated Matérn (1960) class of spatial correlations gives general and versatile forms for isotropic spatial processes, encompassing two other important models commonly used in geostatistical applications. In such an application, the matrix

Var

> @

Y

in (3) would take the form

(6)

¸

¹

·

¨

©

§

)

0

(

)

(

)

0

(

)

(

)

(

)

(

)

0

(

1 2 1 1 2 1

J

n n

s

where

J

(

)

is the spatial semivariogram function, given by the difference of spatial covariograms

)

(

)

0

(

)

(

h

C

h

J

in the isotropic case, for some separation vector

h

. The Matérn (1960) covariogram is given by

)

(

2

)

(

2

)

(

2

h

K

h

C

-

-Q

V

Q Q

¸

¹

·

¨

©

§

*

where

denotes the usual Euclidean norm,

V

2is the variance of the spatial process, the parameter

Q

governs the smoothness of the process, the parameter

-

is related to the range of the spatial process (i.e. the lag

h

beyond which the spatial process values are no longer correlated), and

K

_Q

(

)

is the Bessel function of the second kind of order

Q

.

4. Conclusion

In this paper, we propose a model for spatial count data comprising excess zeros, a problem which creates overdispersion leading to underestimated standard errors and thus inaccurate inference. We combine methods proposed by Lambert (1992) and Dobbie and Welsh (2001) to construct a zero-inflated Poisson model in which zeros are generated by a point-mass probability at the value zero as well as by a usual Poisson distribution, which provides a more perceptive insight into the interpretation of the structural versus sampling zeros for data where zeros are abundant. We give the likelihood for this model and derive its corresponding score equations, thus obtaining generalized estimating equations in the style of Liang and Zeger (1986) and Diggle et al. (2009) that may incorporate any spatial dependence among observations. We illustrate with an autoregressive dependence structure of order 1 for the component corresponding to the zero observations following Dobbie and Welsh (2001), and a Matérn (1960) spatial correlation model for the component corresponding to the Poisson distribution in our ZIP model.

Acknowledgements

I am deeply indebted to Prof. Stephan Morgenthaler for his guidance, support, and scientific input. Research supported in part by the Swiss National Science Foundation, Grant No. FN 200021-116146.

References

[1] Lambert D. Zero-Inflated Poisson Regression, With an Application to Defects in Manufacturing. Technometrics

1992;34(1):1–14.

[2] Dobbie M, Welsh AH. Modelling Correlated Zero-Inflated Count Data. Aust. N. Z. Stat. 2001;43(4):431–444.

[3] Welsh AH, Cunningham RB, Donnelly CF, Lindenmayer DB. Modelling the Abundance of Rare Species: Statistical Models for Counts with Extra Zeros. Ecological Modelling 1996;88:297–308.

[4] Diggle PJ, Heagerty P, Liang K-Y, Zeger SL. Analysis of Longitudinal Data. 2nd_{ed. Oxford University Press; 2009.}

[5] Liang K-Y, Zeger SL. Longitudinal Data Analysis Using Generalized Linear Models. Biometrika 1986;73(1):13–22. [6] Matérn B. Spatial Variation. 2nd_{ed. Springer Series in Statistics No. 36. Springer-Verlag; 1960.}

CC BY-NC-ND license