Model Selection, Checking and Interpretation

1.2 Aims and Outline of the Thesis

2.1.3 Model Selection, Checking and Interpretation

This subsection presents some suggested steps to follow when selecting a multilevel model. This process involves the selection of significant fixed effects and significant random effects. There are no rules of thumb for the selection of a multilevel model, but some authors (Bryk and Raudenbush, 1992; Snijders and Bosker, 1999; Hox, 2000) provided some suggestions.

A good starting point is an empty random intercept model. This allows for the investigation of the amount of variability explained in each of the levels and the estimation of the intra-cluster correlation ρ (Snijders and Bosker, 1999). The next step would be to include the set of level one explanatory variables. These could include main effects and level one interaction terms, and when selecting the significant interaction terms the hierarchical principle should be employed, i.e.

the main effects of the interaction terms should also be retained in the model. In other words, the main effects of the significant interaction terms should also be kept in the model. After a level one model is selected two alternative steps could be followed. Either the inclusion of level two variables or the inclusion of random slopes could be addressed. Snijders and Bosker (1999) suggested that a good practice is to perform these steps separately and that the significant effects of each of the two steps could afterwards be tested together in the final model.

Snijders and Bosker (1999) recognized the difficulty in testing for significant random slopes. For this reason, it is advised that these should be tested only for covariates that show a strong fixed effect or for those that are substantively expected to vary between clusters. Random slopes should not be tested for level one interactions and they should be tested one at time. Care must be taken when testing for the inclusion of random slopes as their variances are likely to be close to zero. It should be kept in mind that the omission of an important random effect will impact on the hypothesis testing of the fixed part of the model and that, because the estimation methods involve numerical iteration, the inclusion of many random slopes may lead to convergence problems. If a random slope is found to be significant it means that there are still unexplained cluster differences. Contextual variables can then be included in the model in order to try to explain more of the unexplained cluster variability. In addition, the inclusion of level two variables is highly advisable when the random intercept is thought to be correlated with some of the covariates. It is also advisable to include in the model cross-level interactions between the variable with the random slope and the level two variables when the random slope seems to be correlated with a level two variable (Snijders and Bosker, 1999).

Model selection is a dynamic process that should involve both theoretical and empirical considerations. Overall, Snijders and Bosker (1999) advised refraining from including non-significant effects. It is worth reinforcing that different selection procedures can result in different selected models. As in the classical linear regression case the objective of the multilevel model selection is still to find the most parsimonious model that best represents the relationship between outcome variables and explanatory variables. Different tests of hypotheses can be used to assist in the model selection. These tests are usually applicable in the comparison of nested models from the same sample of data, and are listed below.

The Wald test is the general single parameter test that can be employed to test whether a fixed effect, say βk, is significantly different from zero or not. It

tests the hypotheses:

H0 : βk = 0

H1 : βk 6= 0 .

Under the null hypothesis, the Wald test statistic is:

TW ald( ˆβk) = ˆ βk S.E.( ˆβk) !2 ,

where S.E.( ˆβk) is the standard error of the estimate of the fixed effect ˆβk being

tested. For large samples and under the null hypothesis, TW ald( ˆβk) ∼ χ21 and

the test procedure leads to the rejection of H0 if TW ald( ˆβk) > χ21(1−α) for level

significance α.

When multiple parameters need to be tested simultaneously, like in the case of a categorical covariate with several categories, the multivariate Wald test can be used. The hypotheses for the multivariate Wald test are:

H0 : Cβ = 0

H1 : Cβ 6= 0 ,

where C is a matrix of linear combinations, or the contrast matrix. Each row of C is formed of sequences of 10s or 00s, where 1 is relatively positioned to the parameters being tested from the vector of regression parameters β. Rewriting Cβ as β∗, representing a sub-vector of β the hypotheses for the multivariate Wald

tests can now be written as:

H0 : β∗ = 0

H1 : β∗ 6= 0 .

Under the null hypothesis, the multivariate Wald test statistic is TW ald( ˆβ∗) = ˆβ

∗Σˆ−1_βˆ_∗βˆ∗,

where ˆΣ_βˆ

∗ is the estimated covariance matrix of ˆβ∗, and DF is the number of

rows of C, therefore the number of parameters being tested. The null hypothesis is rejected for large values of TW ald( ˆβ∗), i.e. when TW ald( ˆβ∗) > χ2DF (1−α).

It is worth mentioning that the multivariate Wald test is applicable to test fixed effects only. For testing multiple parameters including random effects an alternative is to use the likelihood ratio test (LRT). The LRT compares the log-likelihoods (l) of two nested models, a reduced model Mred and a model with

the parameters being tested Mfull. The hypotheses being tested for the LRT are:

H0 : Mred

H1 : Mfull .

Under the null hypothesis, the test statistic of the LRT is

L2 = −2 × (lred− lfull) , (2.10)

where DF is determined by the difference between the number of parameters in the full and in the reduced model. The reduced model is rejected for large values of the likelihood-ratio test statistic L2 ( i.e. L2 > χ2_{DF (1−α)}).

For the random part, however, the applicability of the LRT is questionable. This is because the LRT tests whether the variances equal zero which is a value on the boundary of the parameter space [0, ∞) for the variances. Therefore, this test will tend to accept the null hypothesis “more often than it should ” as stated in Frees (2004, chapter 5). Snijders and Bosker (1999) still advocate the use of the LRT to test random effects bearing in mind that the test is, however, a one-sided test. For example, the hypotheses for testing the variance of the random intercepts are:

H0 : σ2u0= 0

H1 : σ2u0> 0 .

Therefore, the test could still be used and the p-value calculated based on the χ2

1 distribution should be divided by two. This is because the test statistic is no

longer χ2

1but a mixture of 0 and χ21 distribution (Snijders and Bosker, 1999). Care

must be taken, however, in applying the LRT to models estimated via REML. As mentioned before, REML is used to estimate the random part of the model, and the deviance from REML estimation describes only the random part of the model (Singer and Willett, 2003). In the case where the reduced model and the full model were both estimated via REML and the fixed part of each of the models are exactly the same, the LRT could be applied to test the extra random effects in the full model. Once again, the reduced model must be nested within the full

model.

If two non-nested models need to be compared, two alternative goodness-of- fit criteria can be used instead of the LRT. Both criteria use the likelihood-ratio statistic of the fitted models only differing by a scale factor:

IC = L2+ 2 × (Scale factor)(Number of parameters in the model) .

The Akaike information criterion (AIC) has a scale factor equal to one, and the Bayesian information criterion (BIC) has the scale factor equal to half of the log of the sample size (Singer and Willett, 2003), so:

AIC = L2+ 2ϕ

and

BIC = L2 + log(m)ϕ ,

where L2 _{is defined in equation 2.10, ϕ is the total number of parameters in the}

model (for both fixed and random parts) and m is the total sample size. Singer and Willett (2003) commented on the ambiguity of using BIC for the case of a longitudinal multilevel model as it is not clear whether the sample size m should be for the number of individuals in the data or the effective sample size that accounts for the repeated observations within individuals. It is worth mentioning that the models compared using AIC and BIC need not be nested but they should be fitted to the same sample, and smaller values for AIC or BIC indicate a better fit of the reduced model.

Model selection in the multilevel modelling framework also involves model checking. Section 2.1.1 presented the assumptions which the multilevel models are based upon. As for the classical regression models, the assumptions made for the multilevel model need to be checked after the fitting of the model. The failure of these assumptions compromises the interpretation of the estimated parameters. In addition, the conclusions regarding the relationship between the outcome and the covariates can be misleading. When the assumptions are not valid the hypothesis tests are invalid as well (Snijders and Bosker, 1999).

The model checking process of a multilevel model is very similar to that for the classical linear regression. The difference is that, because of the multiple levels and the multiple residual terms, each error component requires checking. A general graphical inspection is usually performed to assess the assumptions of

linearity, homoscedasticity and normality. The linearity assumption is made for the relationship between outcome and explanatory variables. This assumption can be checked by directly plotting the outcome against the explanatory variables. For departures from this assumption additional terms for the explanatory variables, such as for example squared or cubic terms, can be included in the model. As stated before, the residuals are assumed to be normally distributed. This assumption can be checked by inspecting a normal probability plot. Last, but not least, the assumption of constant variance of the level one raw residuals must be checked. This assumption can be checked, for example, by plotting the residuals against the fitted values. Note that, when random slopes are added to the model, the composite residuals are no longer assumed to have constant variance.

The residuals uj, or the cluster specific effects, are random variables rather

than parameters of the multilevel model (Snijders and Bosker, 1999). For a random slope model, such as the model in equation 2.6, each cluster has its own predicted line. If only the fixed part is considered these lines are all the same. The cluster specific effects need to be considered so that each cluster has their specific fitted line. These cluster specific effects need to be predicted (Frees, 2004; Skrondal and Rabe-Hesketh, 2004; Longford, 1993) from the model in order to check their assumptions. This prediction is usually performed through Empirical Bayes (EB) (Efron and Morris, 1975) estimation. This method combines information from the cluster of interest with the other clusters, accounting for the cluster size and the covariance matrix of the observations (Snijders and Bosker, 1999). If only the random intercept is considered, the EB estimate for the random intercepts is given by ˆ u0j = njσˆ2u0 njσˆu02 + ˆσe2 ˜ yj = Sh × ˜yj ,

where ˜yj is the cluster mean of the raw residuals yij− xTijβ. The EB residuals areˆ

also called the shrinkage estimates because of the shrinkage factor Sh = njσˆu02

njσˆ2u0+ˆσ2e.

This factor pushes the mean of the raw residuals for cluster j towards the general mean. In other words, the EB residuals bring the estimates of the random intercepts and random slopes closer to the mean. As the cluster size increases Sh approaches to one, and the EB residuals will be approximately the same as the mean of the raw residuals. The cluster specific effects can also be compared by means of a caterpillar plot (Goldstein, 2003). This plots, for each cluster, their predicted random effects with respective confidence intervals ordered according to their magnitude. The comparison is performed by assessing those confidence

intervals that do or do not overlap with the others. This type of graph can also assist in grouping clusters according to their performance.

After the model has been selected and the assumptions checked the interpretation of the parameter estimates can proceed. In the analysis of multilevel linear models the interpretation of the fixed effects is the same as for the standard linear regression analysis. In other words, it can be said that for a unit increase in xk,

y would have an expected change of ˆβk, keeping all other variables constant. If

the model includes squared terms of some of the explanatory variables or interaction terms between any of them, these effects should be interpreted together. If a categorical variable is also included in the model, the interpretation of its effect compares the effect of each of the categories with the omitted category, the baseline.

Models presented in the subsequent chapters, however, consider the fit of a log-transformed outcome variable. In this case the interpretation of the parameters differs to that mentioned above. Instead of an expected change of ˆβk

in the outcome for a unit change in xk, the expected change in y is a e ˆ βk _fold

increase/decrease, depending on the sign of ˆβk. In other words, there will be an

expected

bk% = 100 × (e ˆ

βk _{− 1)%} _(2.11)

change in y for a unit increase in xk (Tufte, 1974). However, when xk is also

considered with a log-transformation in the model this is no longer the interpretation of ˆβk. In this case ˆβk represents the elasticity of the outcome with respect

to xk. Dougherty (2002) defined elasticity as the proportional change in y for a

given proportional increase in xk. For cluster level variables the interpretation can

follow as for the level one variables. In addition, the coefficients of the contextual variables that represent proportions can be multiplied by any constant a and the formula in equation 2.11 can be modified to

bk% = 100 × (e ˆ

βk×a_{− 1)%,} _(2.12)

where, for example, a can be equal to 0.1. This gives the interpretation that there will be an expected bk% change in y for a 10 percentage point increase in

contextual variable k.

The random part of the model can also be interpreted. To assist in the interpretation of the random part of the model, a plot with the cluster specific regression lines can be constructed by using the EB residuals. The presence of the random intercepts in the model means that each cluster has its own intercept uj

that varies randomly across clusters. Therefore, a graphical representation of this model would show parallel regression lines, one for each cluster, showing how the outcome cluster mean varies randomly across clusters. However, if the number of clusters is relatively large, this plot can be constructed for a subset of clusters. The same plot for the random coefficients model will show non-parallel cluster specific lines, as each line will also depend on the values of zij.

Interpretation can also be given for the variance components and the intra- cluster correlation can be calculated. The covariance term between the random intercepts and random slope, σu01, can also be interpreted. This parameter shows

the relationship between the random slope and intercept, and it can be used to assess for example whether clusters with above average intercept have above average or below average slopes.

One very important point raised by all the authors cited so far is the need to centre the covariates around their means in order to improve the interpretation of the random effects. Centring is highly advisable for those explanatory variables where the value 0 (zero) has no substantive meaning, and it is good practice to centre or re-scale these variables. In a multilevel model (or in a longitudinal multilevel model) Singer and Willett (2003) discussed whether the centring should be around either the total mean or the group mean and advised the use of group centring only if it can be justified substantively.

In document Methods for analysing complex panel data using multilevel models with an application to the Brazilian labour force survey (Page 32-39)