Increasing the number of clusters - Generalized linear mixed models for count data

Following the same idea, now we fix the cluster size and let the number of clusters increase. We have,

Table 4.5: H-likelihood (cluster number)

H-likelihood 10*10 50*10

Estimate Std. error p-value Estimate Std. error p-value ˆ β1 0.97705 0.09818 5.12e-11 1.05681 0.06743 <2e-16 ˆ β2 -0.26176 0.45347 0.568 -0.37794 0.27389 0.168 ˆ s 0.5167 NA NA 0.9558 NA NA

Table 4.6: Laplace (cluster number)

Laplace 10*10 50*10

Estimate Std. error p-value Estimate Std. error p-value ˆ β1 0.9772 0.08823 1.646e-28 1.0556 0.06633 4.988e-57 ˆ β2 -0.2640 0.40842 0.5180 -0.3799 0.26940 0.1585 ˆ σ 0.6405 0.1499 NA 0.9615 0.09789 NA

Table 4.7: Gauss-Hermite (cluster number)

Gauss-Hermite 10*10 50*10

Estimate Std. error p-value Estimate Std. error p-value ˆ β1 0.9771 0.08824 1.691e-28 1.0556 0.06634 5.165e-57 ˆ β2 -0.2637 0.40848 0.5186 -0.3799 0.26944 0.1585 ˆ σ 0.6406 0.1499 NA 0.9615 0.09792 NA

Table 4.8: EM (cluster number)

EM 10*10 50*10

Estimate Std. error p-value Estimate Std. error p-value ˆ β1 1.0190927 0.06735784 0.0000000 1.1799494 NA NA ˆ β2 -0.1599988 0.49920463 0.7485838 -0.5512227 NA NA ˆ σ 0.4382116 0.1992431 0.0139254 0.9765001 0.2159201 3.055585e-06

4.3 Discussion

From Section 4.1, we have seen that by increasing the cluster size, all the methods tend to give better estimates, especially for the fixed effects. However, one should also notice that although we have ˆβ2 very close to the true parameter,

the standard error of which is huge. Hence the appearance of β2 is insignifi-

cant according to the p-value. This can be a problem that we might not be able to reject the appearance of certain parameters by simply looking at its p- value, which definitely worths further study. Another interesting point is what makes the difference between the standard errors of β1 and β2. The dispersion

parameter of the random effect σ is poorly estimated by all of these methods, which is not surprising, since by keeping the number of clusters constant, we fail to have an increasing number of random effects to give a better estimate for

σ. However, it worths noticing that all of these methods tend to underestimateσ.

According to Section 4.2, if we fix the cluster size and increase the number of clusters, the estimate for the dispersion parameter σ is significantly improved for each method as we should expect. The estimates for the fixed effects do not seem to be improved, however the standard errors of the fixed effects estimates tend to decrease.

One way to interpret this is that by increasing the cluster size, the peak of the likelihood function (for the fixed effects) tends to converge to the true parameter, while the shape the the likelihood function is not likely to change. On the other hand, if we increase the number of clusters, the likelihood function tends to narrow down. This will decrease the standard errors of the estimates. However, the peak of the likelihood function dose not seem to converge to the true param-

eter. Therefore, if we are only interested in the value of ˆβ, we should prefer large cluster size compared to the number of clusters. If we are more interested in the dispersion parameter of the random effects, which is seldom the case, we may prefer large number of clusters. If we want to make inference on the parameter estimates, we should hope that both the cluster size and the cluster number to be relatively large.

Surprisingly, we also find that the Gauss-Hermite quadrature may not sufficiently improve the estimates compared to the standard Laplace approximation in contrast to what we would expect from Section 2. However, Bianconcini and Silvia [5] gave a more complicated example where applying the Gauss-Hermite quadrature dose improve the estimates.

Chapter 5 Application

In this chapter, we present an analysis of data arising from a clinical trial of 58 epilepsy patients [39]. Each of the patients was assigned to a control group or a treatment group. The experiment recorded the number of seizures experienced by each patient over four two-week periods. The experiment also recorded a baseline count of the number of seizures the patients had experienced during the previous eight weeks.

In the analysis, we use the following covariates: (i) Base: The logarithm of (baseline/4); (ii) Age: The patient’s age in years;

(iii) Treat: An indicator variable for the treatment group;

(iv) Visiti: An indicator variable for theith time period,i= 1,2,3,4.

5.1 Model fitting

We first construct our base model as a Poisson regression model, i.e. a GLM where the response variable is assumed to be Poisson distributed. Hence the base model can be written as

logνij =β0+β1Base+β2Age+β3T reat+ 7 X

k=4

βkV isitk−3+ (interaction terms).

Before we fit the model, let us first make some rational expectations:

First, it seems natural to expect that νij is directly proportional to (baseline/4),

might hope that β0 = 0 and β1 = 1. Moreover, if the drug is functional, then

β3 <0. And ‘Visiti’ measures the time effect. βi >0 suggests that the conditions

of the patients generally become worse as time increases, and βi <0 otherwise.

Note that none of these above is rigorously clarified, but they will help us to better understand the following results. By fitting the base model using R, we have

logνij =β0+β1Base+β2Age+β3T reat+β7V isit4,

where we have the estimates

Base model βˆ0 βˆ1 βˆ2 βˆ3 βˆ7 Overdispersion

Estimate -0.514441 0.983111 -0.243602 0.023663 -0.147920 4.067673

Std. error 0.161559 0.037291 0.052225 0.003943 0.059149 NA

One can see that this model fits the data awkwardly. We have ˆβ3 which is small

but extremely significant plus a strong sign of overdispersion. Then we should try for alternative models. By natural grouping, we treat each individual as a cluster, and then fit the data using GLMMs by adding a single additive random effect to each cluster. Therefore the model becomes

logµij =β0+β1Base+β2Age+β3T reat+β7V isit4+ui.

Then in order to obtain the parameter estimates, we are going to apply the estimation methods presented in Chapter 2. First, using h-likelihood, we have

H-likelihood βˆ1 βˆ3 sˆ

Estimate 1.00325 -0.30916 0.2153

Std. error 0.05023 0.14388 NA

Although the appearance of β7 turns out to be not very significant (p-value =

0.101), in order to be comparable with the following methods, we fit the aug- mented model by including β7, which gives

H-likelihood∗ βˆ1 βˆ3 βˆ7 ˆs

Estimate 1.01709 -0.30259 -0.14211 0.2179

Std. error 0.05109 0.14439 0.08773 NA

where ˆs represents the sample variance of the estimates of random effects.

Laplace βˆ1 βˆ3 βˆ7 σˆ Estimate 1.0155 -0.3303 -0.1449 0.5109 Std. error 0.05161 0.14125 0.05906 0.05946 G-H βˆ1 βˆ3 βˆ7 σˆ Estimate 1.0154 -0.3302 -0.1449 0.5121 Std. error 0.05173 0.14154 0.05906 0.05972 and EM βˆ1 βˆ3 βˆ7 σˆ Estimate 1.0044731 -0.3150676 -0.1447314 0.2611887 Std. error 0.08142006 0.16163689 0.05915885 0.06129364

where ˆσ is the dispersion parameter estimate of the (pre-specified) distribution of the random effects.

5.2 Fitting overdispersion

In GLMs, the general approach for testing the existence of overdispersion of a Poisson regression model is to compare the residual squares, i.e. (yij −µij)2 to

the fitted value µij by fitting a liner model with no intercept. And if there is no

overdispersion, we should expect the slope to be 1.

However, it is much harder to test for the ovedispersion in GLMMs, due to the fact that both fitted values and residuals are not clearly defined in GLMMs, since µij will depend on the unknown random effect ui. This is when the h-

likelihood becomes useful. The general idea is that we are going to test the overdispersion using the estimates obtained with h-likelihood and if the result is that the model is apparently overdispersed, then we may convince ourselves that the model obtained by using other estimation methods will also be overdispersed. Since the h-likelihood estimation also estimates the random effects, by substi- tuting which into the formula (2.11), we can obtain the estimate for νij, namely

νij as our fitted value, then the residual can be defined asyij|uˆ−νˆij. Therefore

residual squares and fitted values by plotting the forth square roots of the residual squares against the forth square roots of the fitted value.

One can see from Figure 5.1 that the relationship between the residual squares and fitted values are not likely to be linear but quadratic. Inspiring by the driviation of negative binomial (3.14), we are looking for a expression of the form

var(yij|u) =νij +

ν2

r , (5.1)

for which we have a very significant estimate for the dispersion parameter r

equalling 0.22719 with standard error 0.03015. Thus we should be able to convince ourselves the existence of overidispersion.

According to the existing R-packages, we are only allowed to fit a negative binomial mixed model using the EM algorithm,

Negbinom βˆ1 βˆ3 rˆ σˆ

Estimate 1.0104905 -0.3504361 0.151891 0.227846

Std. error 0.07562611 0.18510534 0.493470 0.0723605

Inspiring by the simulation results in Chapter 4 and the example in the previous section, we might want to include β7, though the appearance of which is not

significant. Then we have

Negbinom* βˆ1 βˆ3 βˆ7 rˆ σˆ

Estimate 1.0205100 -0.3326191 -0.1029897 0.148820 0.2199149

Std. error 0.04528806 0.10420430 0.08989300 0.428109 0.04323876

We can see that the h-likelihood tends to overestimate the dispersion parameter r. Then the conditional variance (5.1) will tend to be underestimated (since

r appears in the denominator), which should not be surprising, as we have dis- cussed in Section 2.2 that the h-likelihood tend to be overfitting.

We have similar estimates for the fixed effects: ˆβ1 ≈1; ˆβ2 ≈ −0.31; ˆβ7 ≈ −0.14.

However, the estimates for the dispersion parameter of the random effects are more different: ˆσ ≈ 0.22 according to the h-likelihood and the EM algorithm, but ˆσ ≈ 0.51 from the Laplace approximation and Gauss-Hermite quadrature. We cannot tell exactly which method gives the best estimate for σ, however we should expect the estimates from the EM algorithm to be most accurate given a sufficiently large number of Monte-Carlo iterations.

Chapter 6 Summary

In this thesis, we have explored the estimation methods of GLMMs for count data and the methods for modeling overdispersion for Poisson mixed models. In this final chapter, we summarize and compare the features of each method.

6.1 Estimation methods

Hierarchical-likelihood

The hierarchical-likelihood, or h-likelihood, is a direct generalization from the estimation methods for GLMs. In h-likelihood, we do not distinguish between the fixed and random effects, instead we treat both of them as fixed effects and estimate them using (2.11) and (2.12) to get ˆβand ˆu. Equivalently, we are fitting a GLM with parameters (β,u), which may cause overfitting by introducing too many parameters. We cannot obtain the estimates for the variance components of the random effects by maximizing the h-likelihood with respect to the covariance matrix of the random effects. However, we can approximate the variance components by the sample variance (covariance) of the random effects estimates. The h-likelihood is computationally fast and allows for arbitrary number of random effects and the h-likelihood dose not require the distribution of the random effects to be normal.

Laplace approximation

The Laplace approximation is a simple numerical approximation method based on the Taylor series expansion. To approximate a given integral, for example in

this case the ordinary likelihood function (2.4):

L(y,β) =

exp[h(y,β,u)]du, (6.1)

we first expand the logarithm of the integrand h(y,β,u) around its maxima ˆu, which is exactly the estimates of random effects in the h-likelihood. However, unlike the h-likelihood, we do not estimate β by simply maximizing h(y,β,u) with respect to β. Instead we approximate the integral L(y,β) first to sufficiently high order (2.27), then maximizing the approximated likelihood function

Lla(y,β) with respect to β to get ˆβla.

Again, the we cannot estimate the variance components of the random effects using the Laplace approximation (2.14). A subsidiary method is needed. We

have known from Theorem 2.5 that Gauss-Hermite quadrature with 1 quadra-

ture point is equivalent to the standard Laplace approximation and more generally [5] Gauss-Hermite quadrature with 2n + 1 quadrature points shares the same asymptotic properties of the Laplace estimator of order o(N−n_{). Therefore}

we can estimate the variance components of the random effects using Gauss- Hermite quadrature. The Laplace approximation is slower than the h-likelihood but reduces the risk of overfitting.

Gauss-Hermite quadrature

Gauss-Hermite quadrature, as a special case of Gaussian quadrature, is another

well-known numerical approximation method. Gauss-Hermite quadrature ap-

proximates the integral by a weighted sum of function values at specified quadrature points, which are the root of Hermit polynomials (2.41):

Z +∞ −∞ e−x2f(x)dx≈ n X i=1 wif(xi). (6.2)

We proved inTheorem 2.3that then-point Gauss-Hermite quadrature is exact for polynomials of degree up to 2n −1. Moreover, in Theorem 2.5 we also proved the equivalence of 1-point Gauss-Hermite quadrature with the Laplace approximation. Therefore, Gauss-Hermite quadrature can be considered as an alternative version of the high order Laplace approximation. Theoretically, Gauss- Hermite quadrature should be more accurate than the standard Laplace approximation. However, according to the simulation results in Chapter 4, Gauss- Hermite quadrature may not sufficiently improve the accuracy of the standard

Laplace approximation, especially for simple models.

Using Gauss-Hermite quadrature, we can estimate the variance components of the random effects (2.52). However, the conditions for applying Gauss-Hermite quadrature is restricted. We have to assume that the random effects are normally distributed and the number of random effects is small. Specifically in this thesis, we only focus the case where we have only one random effect. As the number of random effects increases, the approximation formula will become massively complicated with relatively poor accuracy [49]. Gauss-Hemite quadrature with more than one quadrature points is slower than the standard Laplace approximation.

EM algorithm

The EM algorithm is an iterative method to produce the MLE without explicitly computing the ordinary likelihood function. Starting from a pre-specified initial value ζ(0)_{, we maximize} _Q₍_ζ|ζ(r)_{) as in (2.56) iteratively. We have shown that}

under regularity conditions [47], the EM algorithm will converge to a stationary point. Moreover, if the likelihood is unimodal, the EM algorithm will surely converge to the MLE.

However, the function Q can be intractable in general. In the Monte-Carlo EM algorithm, we apply the Monte-Carlo integration method [17] to approximate Q. The EM algorithm can also estimate the variance components of the random effects [24], using the similar idea.

Theoretically, the EM algorithm is the most accurate method that we have dis- cussed so far. Unlike the Laplace approximation and Gauss-Hermite quadrature, where we have to worry about the coefficients of the error terms as we have dis- cussed at the end of Section 2.3. The EM algorithm can be as accurate as we want given large enough number of Monte-Carlo iterations. Even better, the EM algorithm is not restrictive on the distribution and the number of the random effects. However, the EM algorithm is very slow, which make it impractical to allow for too many Monte-Carlo iterations or some complicated models.

Conditional likelihood

In stead of approximating the ordinary likelihood function, we can define substi- tutional likelihood functions which can be maximized more easily to produce the

parameter estimates. In this thesis, generalized the idea of conditional likelihood to the Poisson mixed models. Maximizing the conditional likelihood condition on the minimal sufficient statistics of the random effects (2.75):

c X i=1 ni X j=1 ∂ log[f(yij|si)] ∂β =0. (6.3)

we can obtain the conditional maximum likelihood estimator for the fixed effects. The most appealing feature of applying the conditional maximum likelihood estimation is that we do not need to assume any distribution for the random effects. However it is not generally available: we can only obtain the conditional maximum likelihood estimator in simple cases where we have single additive random effect. The estimating equation depends on the conditional distribution of response variables. The conditional maximum-likelihood estimator is not efficient due to the loss of information (2.73) by conditioning.

Quasi-likelihood

As another example of the likelihood substitutes, following Foulley and Im [15], we introduced the quasi-likelihood estimation for the Poisson mixed models. We first compute the marginal expectation (2.97) and marginal variance (2.98) by integrating out the random effects. Then we can apply the general formula of quasi-likelihood estimation for GLMs [42]. Maximizing the quasi-likelihood function with respect to β (2.95):

c X i=1 ni X j=1 yij −µij var(yij) ∂µij ∂β =0. (6.4)

we can obtain the maximum quasi-likelihood estimator for the fixed effects in a Poisson mixed model.

Both the conditional likelihood and the quasi-likelihood cannot estimate the variance components of the random effects. We should expect both the them to be fast. However, further work need to be done for applications in other kinds of GLMMs. In general this can be hard: since the estimating equation depends on the conditional distribution of response variables, we need to derive the estimating equation case by case given different conditional distributions.

Simulation study

According to the simulation results in Chapter 4, we can conclude that by increasing the cluster size, the peak of the likelihood function tends to converge to the true parameter to give better estimates. Moreover, if we increase the number of clusters, the likelihood function tends to narrow down to give smaller standard errors. Therefore, if we are only interested in the value of ˆβ, we may prefer large cluster size. If we are more interested in the variance components of the random effects, which is seldom the case, we may prefer large number of clusters. If we intend to make inference on the estimates, we should hope that both the cluster size and the cluster number to be large.

There are some theories indicating that under regularity conditions [48], these parameter estimators are asymptotically normally distributed with mean equal- ing the true parameter and variance determined from the Fisher information matrix. Lee, Youngjo and Nelder, John A [25] proved it for the h-likelihood. Tierney, Luke and Kadane, Joseph B [40] proved the asymptotic normality for the Laplace approximation, and Liu, Qing and Pierce, Donald A [26] proved it for the Gauss-Hermite quadrature. Anderson, E.B [1] proved the result for the conditional likelihood. Bollerslev, T. and Wooldridge, J.M. [7] proved it for quasi-likelihood estimators.

6.2 Modeling overdispersion

In document Generalized linear mixed models for count data (Page 75-97)