Generalized Estimating Equations - Using generalized estimating equations with regression splin

As stated previously, we cannot rely on using GLMs, as some of the assumptions are not valid. Another method is required, which will allow for correlation in the data.

Longitudinal data of this kind, which follows a site (or patient) over time (i.e., repeated measures), are very common, especially in biomedical sciences, and new methods are being developed to deal with them. Section 3.2 expands on the topic of longitudinal data. In order to develop an index of abundance for site i at time t, a new set of models is necessary, since classical linear models and GLMs are unsatisfactory. GEEs are extensions of GLMs which allow for:

• Longitudinal type data, following a site over time,

• Correlation among observations on the same unit (site)

Hardin and Hilbe [2003] introduce the Generalized Estimating Equation (GEE) with regards to panel data, i.e., where data are clustered (e.g., litters of animals)

or repeated measures (e.g., a patient followed through time). GEEs are split into subject-specific (SS) models or population averaged (PA) models (also known as marginal models). The former models individual panels (sites) and attempts to explain the source of covariance, the latter’s regression coefficients describe the average population response and only describe the covariance among repeated observations (do not attempt to explain it). Since we only want to allow for dependence and are not primarily interested in it scientifically, we will use a PA model. In the case of normal data, there is a simple relationship between SS and PA models, but for other distributional forms, this becomes more complicated. Although attempts were made to solve the problem of correlated data for sev- eral individual GLM models, GEE is a broad framework which unifies these methods.

One of the primary advantages of using GEEs is that no distributional form is assumed, so there is no danger of specifying a wrong distribution, and avoids the need to specify a multivariate distribution. GEEs are essentially a general, unified form of using quasi-likelihood in that there are no parametric assumptions.

As in the case of GLMs, we have the following scenario: We have a set of observations from K sites, and for each site i, (i= 1, ..., K), we have a vector of observed counts yit, for Butterfly Monitoring Survey days t = 1, ...,250 with corresponding vector of covariates Xit = (x0i1, ..., x0i250), where some covariates

are site-specific, and some change over time, for the 26 weeks in the Butter- fly Monitoring Season, allowing for missing values. In general, we assume that components ofyitare correlated, but that individual sites are independent from one another - this issue is further discussed in Chapter 4, Section 4.3.7. To model the relationship between the count and the covariates, with the ultimate aim of deriving an annual, site-level index of abundance, a regression method similar to that of the GLM is formed:

and

µit=E(Yit|Xit), and

β = (β1, ..., βp)

is a vector of punknown regression coefficients to be estimated andg(.) is the link function (usually taken as the canonical link function of a distribution). The main difference between GLMs and GEEs is that the GEE allows R, a working correlation matrix, to be specified. Some general forms of this working correlation matrix are the independence structure, the exchangeable (or compound symmetry) form, or the auto-regressive form. There is much discussion in the literature about the consistency and efficiency of estimators ofβ under the different forms assumed, and the underlying truth of the correlation structure (e.g., McDonald [1993] and Fitzmaurice [1995]). Choosing the “correct” correlation structure, and model selection in general is discussed later in this chapter.

The form the variance of observationsYittakes is as follows:

V(Yi) =φA 1 2 iRA 1 2 i, (3.18)

where φis the dispersion parameter, andA12

i is diagonal matrix with elements

V(Yit) andRis the working correlation matrix.

GEE theory is based on the assumption that there are no missing data, or if missing data exists, they must be missing completely at random (MCAR). When fitting GEEs, the following items need to be considered:

• a model for the mean

• a model for the variance

It was Liang and Zeger [1986] who first introduced the unified GEE approach, and estimation is described in the following section.

3.8.1 GEE Estimation

Focus here is on PA-GEE estimation, which models the average count across all panels (sites), so that we develop a marginal regional model, which can predict counts at individual sites, using site-specific covariates. This is the brand of model introduced in Liang and Zeger [1986].

We want to estimateβ, the vector of parameter estimates andV, the associated variance-covariance matrix.

We want to solve equation 3.13, but instead of V being a diagonal matrix with diagonal elements V(Yit), V becomes Vi = (A

1 2

iRiA

1 2

i)φ, where the A is the diagonal matrix with elements V(Yit) and R is the correlation matrix (to be specified), and φis the dispersion parameter.

Beginning with the assumption of independence between all observations,Rbe- comes the identity matrix, and V(Yit) is as before in Equation 3.11. However, we can specify R to have other forms, and the estimation algorithm estimates simultaneously the marginal model parameters as well as the correlation parameters. The method alternates between updating the estimates for β and updating the correlation matrix R.

3.8.2 Specifying the Correlation Matrix

An important aspect of the GEE is specifying the form of the correlation matrix,

R. According to Liang and Zeger [1986], the GEE approach yields a consistent estimator of β even whenR is misspecified. For this reason, an independence model is often used when the choice of R is not obvious. The most commonly used are:

• Independence model

• Compound symmetry (or exchangeable), where

corr(yij, yik) =        1 forj=k ρ forj6=k

• First order Auto-regressive (AR-1), where

corr(yij, yik) =ρ|j−k|

• Unstructured - where every element of the correlation matrix is estimated separately.

Zeger et al. [1988] also demonstrated that ˆβ obtained under the independence model is relatively efficient. This (independence) is a dangerous assumption to rely on, however, as McDonald [1993] showed that when the correlation between responses is large, ˆβ becomes inefficient. Fitzmaurice [1995] discusses this further and concludes that for time-varying or cluster-specific covariates (both of which we have in the BMS data set), estimates of ˆβ under the independence assumption may be very inefficient, often as low as 60%. Efficiency here is measured as Asymptotic Relative Efficiency (ARE), where the ARE of an element of ˆβis measured as the ratio of elements of the asymptotic covariance matrix of a model with correctly specified correlation and the the asymptotic covariance matrix given by the incorrectly specified “working” correlation matrix. Zeger et al. [1988] warns of the danger of ignoring correlation - in some studies, it leads to incorrect interpretations of data and model inferences. There is strong sug- gestion that there will be autocorrelation within site for the BMS data set, and it is important that this should be accommodated for, if found to be significant.

In document Using generalized estimating equations with regression splines to improve analysis of butterfly transect data (Page 47-51)