2.2.1 “Rough” strategies
2.4 Model estimation
Σ 0 . . . 0 0 Σ . . . 0 ... ... ... ... 0 0 . . . Σ
.
2.4 Model estimation
Inference for the linear multilevel model is conducted on the effects, both fixed and random, and the variance components. It can be based either on least squares approach (Goldstein, 2010), maximum likelihood meth-ods (Searle et al., 1992; Laird and Ware, 1982; Pinheiro and Bates, 2000;
Raudenbush and Bryk, 1986), or on Bayesian methodology (Seltzer et al., 1986).
2.4.1 Maximum likelihood estimation
Maximum likelihood is the most used estimation method in multilevel mod-elling. It produces, in fact, estimators that are asymptotically efficient, consistent and, for large sample size, robust against violations of the non-normality assumption for the errors.
Two different likelihood functions can be optimized, each corresponding to a specific approach: the Full Maximum Likelihood (FML) and the Restricted Maximum Likelihood (REML). The substantial difference between them is that, in the latter method, the likelihood function does not include the
re-26
gression coefficients.
Full maximum likelihood
Call θ the vector of variance components. Under the assumptions of nor-mality and independence for uj and j, the group-response vectors, yj, are independent and normally distributed with mean Xjβ and covariance matrix Ωj = ZjΣZTj + σ2Inj. Therefore, the full likelihood function associated with the response vector y of the model (2.6) is:
L(β, θ; y) =
J
Y
j=1
f (yj; β, θ)
=
J
Y
j=1
|Ωj|−1/2exp{−12(yj − Xjβ)TΩ−1j (yj− Xjβ)}
(2π)nj2
.
(2.8)
Maximum likelihood estimates for the fixed effects and the variance compo-nents are the values maximizing the likelihood function (2.8), or equivalently, the logarithm of the likelihood:
`(β, θ; y) = ln L(β, θ; y)
= −n
2ln(2π) − 1 2
J
X
j=1
ln |Ωj| − 1 2
J
X
j=1
(yj − Xjβ)TΩ−1j (yj − Xjβ)
(2.9) The derivatives of the log-likelihood with respect to the parameters are as follows:
∂`(β, θ; y)
∂β =
J
X
j=1
XTjΩ−1j (yj− Xjβ)
(2.10)
∂`(β, θ; y)
∂θ = −1
2
J
X
j=1
tr
Ω−1j ∂Ωj
∂θ
+ 1 2
J
X
j=1
(yj − Xjβ)TΩ−1j ∂Ωj
∂θ Ω−1j (yj− Xjβ)
,
(2.11)
obtained by using the following properties of matrices derivatives (Magnus and Neudecker, 2007):
(i) ∂ ln |A|∂A = tr(A−1)
(ii) ∂(xT∂AA−1x) = −A−1(xxT)A−1, where A is a square matrix.
Maximizing the log-likelihood function (2.9) consists in equating the partial derivatives (2.10) and (2.11) to zero and solving the resulting equations for β and θ. However, the equation for θ is nonlinear in the variance vector, so that it is not possible to achieve closed form expressions for θ. Therefore, solutions have to be obtained through iterative algorithms.
One approach to obtain iteratively maximum likelihood estimates consists in computing a GLS estimate for the fixed-effect vector β, that is
β = (Xˆ TΩ−1X)−1XTΩ−1y. (2.12)
This is the maximum likelihood estimator of the regression coefficient vector, that is the same obtained by equating to zero the score function with respect to β (2.10). Then, the variance components are obtained by optimizing the profiled likelihood, that is the likelihood function achieved by substituting the estimate of β from the previous step. This optimization usually is accom-plished through the Expectation-Maximization algorithm (Dempster et al., 1981; Laird et al., 1987). This is an iterative procedure to compute maximum likelihood estimates in the case of incomplete-data problem. Instead of max-imizing the marginal likelihood associate to the data vector y, Eq. 2.8, the EM algorithm maximizes the joint likelihood of the complete dataset, (y, u), that is that composed by the observed data and the unobserved (or missing) data. Starting with some initial values for the parameters, the EM algorithm consists of iterating two steps: the Expectation step and the Maximization step.
1. In the E-STEP, the current vector of estimated parameters, ˆθ(h), is used to evaluate the distribution of the unobserved data conditional
28
to the observed data, u|y. Hence, the expected value of the joint log-likelihood, conditional to the data vector y, is obtained.
2. The M-STEP consists in computing a new value of ˆθ(h+1) by maxi-mizing the expected value of the likelihood function evaluated in the E-step.
E-STEP and M-STEP are iterated until convergence. A more formal expla-nation of the EM algorithm is provided in section 4.2.
Restricted maximum likelihood
The problem with the maximum likelihood estimates of the variance compo-nents is that the procedure does not take into account the degree of freedom lost in estimating β. That is, in the variance estimators, the vector of true parameters β is replaced by its estimator ˆβ. However, in this way, the un-certainty about the regression coefficients is not taken into account. As a result, the variance components are estimated with bias in a downward di-rection (Harville, 1977; Patterson and Thompson, 1971).2
In order to produce unbiased variance estimates, Harville (1977) and Pat-terson and Thompson (1971) proposed a variant of the FML method, the Restricted maximum likelihood.
The REML method obtains the estimate of the variance components by op-timizing the likelihood of θ not corresponding to y but to a complete set of error contrasts for the responses, that is on a set of n − k linearly indepen-dent error contrasts, aTy, where k is the number of fixed-effects, such that the function does not contain any fixed effects.3 More specifically, the vector a is chosen such that aTXβ = 0, and, therefore, aTy = aTXβ + aTZu does not contain any terms in β.
2 Raudenbush and Bryk (2002) found out that the maximum likelihood estimator of the between-group variance in a random-intercept model has approximately a bias factor of (J − k)/J , where J is the number of groups and k is the number of fixed effects.
3It is easy to show that the log-likelihoods corresponding to different sets of error contrasts contain the same information and, as a consequence, the same maxima in that they differ only by a constant (Longford, 1993).
In practice, the difference between FML and REML estimates becomes negligible as the number of groups increases. Moreover, the Restricted likeli-hood function cannot be used to construct a chi-square test to compare two nested models, and it is computationally less easy than the FML.
The REML theory can rely also on Bayesian arguments.
2.4.2 Iterative GLS
The most common method among those based on least squares is the Iterative Generalized Least Squares (Goldstein, 2010, 1986). The GLS estimate for β is given by:
β = (Xˆ TΩ−1X)−1XTΩ−1y
that is the best linear unbiased estimator (BLUE). However, since the variance-covariance matrix, Ω, is typically unknown, this formula is not computable.
For this reason, the GLS procedure is conducted iteratively. Starting from some initial values for the fixed parameters (generally the OLS estimates), the “raw” residuals are constructed as:
= y − X ˆβ.
By noting that
E(T) = Var(y) = Ω,
the procedure applies the GLS method on the following model:
vec(T) = f (θ) + r
where vec(eeT) constitutes the dependent variable, r is the residual vector, and f (θ) is a vector of functions containing coefficients to be estimated, that are the variance components, and, as covariates, vectors of 0 and 1 depending on the structure of Ω. Therefore, the variance components in θ, estimated through GLS, are then used to obtain new estimates of the fixed effects. The procedure goes on until convergence.
Under the assumption of normality, the IGLS procedure produces maximum 30
likelihood estimates. Moreover, since the IGLS estimates of variance com-ponents are biased, as the ML estimates, a Restricted version exists, the Restricted Iterative Generalized Least Squares (RIGLS).