EXAMPLE: ANALYZING PROPORTIONS - The Multilevel Generalized Linear Model for Dichotomous Data a

The Multilevel Generalized Linear Model for Dichotomous Data and Proportions

6.4 EXAMPLE: ANALYZING PROPORTIONS

The second example uses data from a meta-analysis of studies that compared face-to-face, telephone, and mail surveys on various indicators of data quality (de Leeuw, 1992; for a more thorough analysis see Hox & de Leeuw, 1994). One of these indicators is the response rate: the number of completed interviews divided by the total number of eligible sample units. Overall, the response rates differ between the three data collec-tion methods. In addicollec-tion, the response rates differ also across studies. This makes it interesting to analyze what study characteristics account for these differences.

The Multilevel GLM for Dichotomous Data and Proportions 123

The data of this meta-analysis have a multilevel structure. The lowest level is the ‘condition level’, and the higher level is the ‘study level’. There are three variables at the condition level: the proportion of completed interviews in that specific condi-tion, the number of potential respondents who are approached in that condicondi-tion, and a categorical variable indicating the data collection method used. The categorical data collection variable has three categories: ‘face-to-face’, ‘telephone’, and ‘mail’ survey. To use it in the regression equation, it is recoded into two dummy variables: a ‘telephone dummy’ and a ‘mail dummy’. In the mail survey condition, the mail dummy equals one, and in the other two conditions it equals zero. In the telephone survey condition, the telephone dummy equals one, and in the other two conditions it equals zero. The face-to-face survey condition is the reference category, indicated by a zero for both the telephone and the mail dummy. There are three variables at the study level: the year of publication (0 = 1947, the oldest study), the saliency of the questionnaire topic (0 = not salient, 2 = highly salient), and the way the response has been calculated. If the response is calculated by dividing the response by the total sample size, we have the completion rate. If the response rate is calculated by dividing by the sample size cor-rected for sampling frame errors, we have the response rate. Most studies compared only two of the three data collection methods; a few compared all three. Omitting missing values, there are 47 studies, in which a total of 105 data collection methods are compared. The data set is described in Appendix A.

The dependent variable is the response. This variable is a proportion: the number of completed interviews divided by the number of potential respondents. If we had the original data sets at the individual respondents’ level, we would analyze them as dichotomous data, using full maximum likelihood analysis with numerical integration.

However, the studies in the meta-analysis report the aggregated results, and we have only proportions to work with. If we were to model these proportions directly by normal regression methods, we would encounter two critical problems. The first problem is that proportions do not have a normal distribution, but a binomial distribu-tion, which (especially with extreme proportions and/or small samples) invalidates several assumptions of the normal regression method. The second problem is that a normal regression equation might easily predict values larger than 1 or smaller than 0 for the response, which are impossible values for proportions. Using the generalized linear (regression) model for the proportion p of potential respondents that are responding to a survey solves both problems, which makes it an appropriate model for these data.

The hierarchical generalized linear model for our response data can be described as follows. In each condition i of study j we have a number of individuals who may or may not respond. Each condition i of study j is viewed as a draw from a specific binomial distribution. So, for each individual r in each condition i of study j the probability of responding is the same, and the proportion of respondents in condition i

124 MULTILEVEL ANALYSIS: TECHNIQUES AND APPLICATIONS

of study j is πij. Note that we could have a model where each individual’s probability of responding varies, with individual-level covariates to model this variation. Then, we would model this as a three-level model, with binary outcomes at the lowest (vidual) level. Since in this meta-analysis example we do not have access to the indi-vidual data, the lowest level is the condition level, with conditions (data collection methods) nested within studies.

Let p_ij be the observed proportion of respondents in condition i of study j. At the lowest level, we use a linear regression equation to predict logit (πij). The simplest model, corresponding to the intercept-only model in ordinary multilevel regression analysis, is given by:

πij= logistic (β0j), (6.6)

which is sometimes written as:

logit (πij)= β0j. (6.7)

Equation 6.7 is a bit misleading because it suggests that we are using an empir-ical logit transformation on the proportions, which is precisely what the generalized linear model avoids doing, so equation 6.6 is a better representation of our model.

Note again that the usual lowest level error term e_ij is not included in equation 6.6. In the binomial distribution the variance of the observed proportion depends only on the population proportion πij. As a consequence, in the model described by equation 6.6, the lowest-level variance is determined completely by the estimated value for πij, and therefore it does not enter the model as a separate term. In most current software, the variance of π is modeled by:

VAR(πij)= σ² (πij (1− πij)) / n_ij. (6.8)

In equation 6.8, σ² is not a variance, but a scale factor used to model under- or overdispersion. Choosing the binomial distribution fixes σ² to a default value of 1.00.

This means that the binomial model is assumed to hold precisely, and the value 1.00 reported for σ² need not be interpreted. Given the specification of the variance in equation 6.8, we have the option to estimate the scale factor σ², which allows us to model under- or overdispersion.

The model in equation 6.6 can be extended with an explanatory variable X_ij at the condition level (e.g., a variable describing the condition as a mail or as a face-to-face survey):

πij= logistic (β0j+ β1jX_ij). (6.9)

The Multilevel GLM for Dichotomous Data and Proportions 125

The regression coefficients β are assumed to vary across studies, and this variation is modeled by the study-level variable Z_j in the usual second-level regression equations:

β0j= γ00+ γ01 Z_j+ u0j (6.10)

β1j= γ10+ γ11 Z_j+ u1j (6.11)

Substituting 6.10 and 6.11 into 6.9, we get the multilevel model:

πij= logistic (γ00+ γ10X_ij+ γ01Z_j+ γ11X_ij Z_j+ u0j+ u1jX_ij). (6.12)

Again, the interpretation of the regression parameters in 6.12 is not in terms of the response proportions we want to analyze, but in terms of the underlying variate defined by the logit transformation logit(p) = ln (p/(1 − p)). The logit link function transforms the proportions (between .00 and 1.00 by definition) into values on a logit scale ranging from −∞ to +∞. The logit link is nonlinear, and in effect it assumes that it becomes more difficult to produce a change in the outcome variable (the proportion) near the extremes of .00 and 1.00, as is illustrated in Figure 6.1. For a quick examin-ation of the analysis results, we can simply inspect the regression parameters calculated by the program. To understand the implications of the regression coefficients for the proportions we are modeling, the predicted logit values must be transformed back to the proportion scale.

In our meta-analysis, we analyze survey response rates when available, and if these are not available, the completion rate is used. Therefore the appropriate null model for our example data is not the ‘intercept-only’ model, but a model with a dummy variable indicating whether the response proportion is a response rate (1) or a completion rate (0). The lowest-level regression model is therefore:

πij= logistic (β0j+ β1j resptype), (6.13)

where the random intercept coefficient β0j is modeled by:

β0j= γ00+ u0j (6.14)

and the slope for the variable resptype by:

β1j= γ10, (6.15)

which leads by substitution to:

πij= logistic (γ00+ γ10 resptype+ u0j). (6.16)

126 MULTILEVEL ANALYSIS: TECHNIQUES AND APPLICATIONS

Since the accurate estimation of the variance terms is an important goal in meta-analysis, the estimation method uses the restricted maximum likelihood method with second order PQL approximation. For the purpose of comparison, Table 6.4 presents for the model given by equation 6.16 the first order MQL parameter estimates in addition to the preferred second order PQL approximation. Estimates were made using MLwiN.

The MQL 1 method estimates the expected response rate (RR) as (0.45 + 0.68 =) 1.13, and the PQL 2 method as (0.59 + 0.71 =) 1.30. As noted before, this refers to the underlying distribution established by the logit link function, and not to the propor-tions themselves. To determine the expected proportion, we must use the inverse trans-formation, the logistic function, given by g(x) = e^x/(1 + e^x). Using this transformation, we find an expected response rate of 0.79 for PQL 2 estimation and 0.76 for MQL 1 estimation. This is not exactly equal to the value of 0.78 that we obtain when we calculate the mean of the response rates, weighted by sample size. However, this is as it should be, for we are using a nonlinear link function, and the value of the intercept refers to the intercept of the underlying variate. Transforming that value back to a proportion is not the same as computing the intercept for the proportions themselves.

Nevertheless, the difference is usually rather small when the proportions are not very close to 1 or 0.

In the binomial distribution (and also in the Poisson and gamma distributions), the lowest-level variance is completely determined when the mean (which in the binomial case is the proportion) is known. Therefore, σ²e has no useful interpretation in these models; it is used to define the scale of the underlying continuous variate. By default, σ²e is fixed at 1.00, which is equivalent to the assumption that the binomial distribution (or other distributions such as Normal, Poisson, gamma) holds exactly. In some applications, the variance of the error distribution turns out to be much larger than expected; there is overdispersion (see Aitkin et al., 1989; McCullagh & Nelder,

Table 6.4 Null model for response rates

MQL 1 PQL 2

Fixed part Coeff. (s.e.) Coeff. (s.e.)

Intercept 0.45 (.16) 0.59 (.15)

Resptype is RR 0.68 (.18) 0.71 (.06) Random part

σu0² 0.67 (.14) 0.93 (.19)

The Multilevel GLM for Dichotomous Data and Proportions 127

1989). If we estimate the overdispersion parameter for our meta-analysis data, we find a value of 21.9 with a corresponding standard error of 2.12. This value for the scale factor is simply enormous. It indicates a gross misspecification of the model. This is indeed the case, since the model does not contain important explanatory variables, such as the data collection method. Taking into account the small samples at the condition level (on average 2.2 conditions per study), the large estimate for the extrabi-nomial variation is probably not very accurate (Wright, 1997). For the moment, we ignore the extrabinomial variation.

It is tempting to use the value of 1.00 as a variance estimate to calculate the intraclass correlation for the null model in Table 6.3. However, the value of 1.00 is just a scale factor. The variance of a logistic distribution with scale factor 1 is π²/3 ≈ 3.29 (with π ≈ 3.14, see Evans, Hastings, & Peacock, 2000). So the intraclass correlation for the null-model is ρ = 0.93/(0.93 + 3.29) = .22.

The next model adds the condition-level dummy variables for the telephone and the mail condition, assuming fixed regression slopes. The equation at the lowest (condition) level is:

πij= logistic (β0j+ β1j resptype_ij+ β2j tel_ij+ β3j mail_ij), (6.17)

and at the study level:

β0j= γ00+ u0j (6.18)

β1j= γ10 (6.19)

β2j= γ20 (6.20)

β3j= γ30. (6.21)

Substituting equations 6.18–6.21 into 6.17 we obtain:

πij= logistic (γ00+ γ10 resptype_ij+ γ20 tel_ij+ γ30 mail_ij+ u0j). (6.22)

Until now, the two dummy variables are treated as fixed. One could even argue that it does not make sense to model them as random, since the dummy variables are simple dichotomies that code for our three experimental conditions. The experimental condi-tions are under control of the investigator, and there is no reason to expect their effect to vary from one experiment to another. But some more thought leads to the conclu-sion that the situation is more complex. If we conduct a series of experiments, we would expect identical results only if the research subjects were all sampled from exactly the same population, and if the operations defining the experimental

condi-128 MULTILEVEL ANALYSIS: TECHNIQUES AND APPLICATIONS

tions were all carried out in exactly the same way. In the present case, both assumptions are questionable. In fact, some studies have sampled from the general population, while others sample from special populations such as college students. Similarly, although most articles give only a short description of the procedures that were actu-ally used to implement the data collection methods, it is highly likely that they were not all identical. Even if we do not know all the details about the populations sampled and the procedures used, we may expect a lot of variation between the conditions as they were actually implemented. This would result in varying regression coefficients in our model. Thus, we analyze a model in which the slope coefficients of the dummy variables for the telephone and the mail condition are assumed to vary across studies.

This model is given by:

β0j= γ00+ u0j

β1j= γ10

β2j= γ20+ u2j

β3j= γ30+ u3j

which gives:

πij= logistic (γ00+ γ10 resptype_ij+ γ20 tel_ij+ γ30 mail_ij

+ u0j+ u2j tel_ij+ u3j mail_ij). (6.23)

Results for the models specified by equations 6.22 and 6.23, estimated by second order PQL, are given in Table 6.5.

Table 6.5 Models for response rates in different conditions

Conditions fixed Conditions random

Fixed part Coeff. (s.e.) Coeff. (s.e.)

Intercept 0.90 (.14) 1.16 (.21)

Resptype is RR 0.53 (.06) 0.20 (.23)

Telephone −0.16 (.02) −0.20 (.09)

Mail −0.49 (.03) −0.57 (.15)

Random part

σu0² 0.86 (.18) 0.87 (.19)

σ²u(tel) 0.26 (.07)

σ²u(mail) 0.57 (.19)

The Multilevel GLM for Dichotomous Data and Proportions 129

The intercept represents the condition in which all explanatory variables are zero. When the telephone dummy and the mail dummy are both zero, we have the face-to-face condition. Thus, the values for the intercept in Table 6.5 estimate the expected completion rate (CR) in the face-to-face condition, 0.90 in the fixed model.

The variable resptype indicates whether the response is a completion rate (resptype = 0) or a response rate (resptype = 1). The intercept plus the slope for resptype equals 1.36 in the random slopes model. These values on the logit scale translate to an expected completion rate of 0.76 and an expected response rate of 0.80 for the aver-age face-to-face survey. The negative values of the slope coefficients for the telephone and mail dummy variables indicate that the expected response is lower in these condi-tions. To find out how much lower, we must use the regression equation to predict the response in the three conditions, and transform these values (which refer to the under-lying logit variate) back to proportions. For the telephone conditions, we expect a response rate of 1.16, and for the mail condition 0.79. These values on the logit scale translate to an expected response rate of 0.76 for the telephone survey and 0.69 for the mail survey.

Since we use a quasi-likelihood approach, the deviance difference test is not available for testing the significance of the study-level variances. However, using the Wald test based on the standard errors, the variances of the intercept and the condi-tions are obviously significant, and we may attempt to explain these using the known differences between the studies. In the example data, we have two study-level explana-tory variables: year of publication, and the salience of the questionnaire topic. Since not all studies compare all three data collection methods, it is quite possible that study-level variables also explain between-condition variance. For instance, if older studies tend to have a higher response rate, and the telephone method is included only in the more recent studies (telephone interviewing is, after all, a relatively new method), the telephone condition may seem characterized by low response rates. After correcting for the year of publication, in that case the telephone response rates should look better. We cannot inspect the condition-level variance to investigate whether the higher-level vari-ables explain condition-level variability. In the logistic regression model used here, without overdispersion, the lowest-level (condition-level) variance term is automatic-ally constrained to π²/3 ≈ 3.29 in each model, and it remains the same in all analyses.

Therefore, it is also not reported in the tables.

Both study-level variables make a significant contribution to the regression equa-tion, but only the year of publication interacts with the two conditions. Thus, the final model for these data is given by:

πij= logistic (β0j+ β1j resptype_ij+ β2j tel_ij+ β3j mail_ij)

at the lowest (condition) level, and at the study level:

130 MULTILEVEL ANALYSIS: TECHNIQUES AND APPLICATIONS

β0i= γ00+ γ01 year_j+ γ02 saliency_j+ u0j

β1j= γ10

β2j= γ20+ γ21 year_j+ u2j

β3j= γ30+ γ31 year_j+ u3j,

which produces the combined equation:

πij= logistic (γ00+ γ10 resptype_ij+ γ20 tel_ij+ γ30 mail_ij+ γ01 year_j+ γ02 saliency_j + γ21 tel_ij year_j+ γ31 mail_ij year_j+ u0j+ u2j tel_ij+ u3j mail_ij). (6.24)

Results for the model specified by equation 6.24 are given in Table 6.6. Since interactions are involved, the explanatory variable year has been centered on its overall mean value of 29.74.

Compared to the earlier results, the regression coefficients are about the same in the model without the interactions. The variances for the telephone and mail slopes in the interaction model are lower than in the model without interactions, so the cross-level interactions explain some of the slope variation. The regression coefficients in Table 6.6 must be interpreted in terms of the underlying logit scale. Moreover, the logit Table 6.6 Models for response rates in different conditions, with random slopes and cross-level interactions

No interactions With interactions

Fixed part Coeff. (s.e.) Coeff. (s.e.)

Intercept 0.33 (.25) 0.36 (.25)

Resptype 0.32 (.20) 0.28 (.20)

Telephone −0.17 (.09) −0.21 (.09)

Mail −0.58 (.14) −0.54 (.13)

Year −0.02 (.01) −0.03 (.01)

Saliency 0.69 (.17) 0.69 (.16)

Tel × Year 0.02 (.01)

Mail × Year 0.03 (.01)

Random part

σu0² 0.57 (.13) 0.57 (.14)

σu2² 0.25 (.07) 0.22 (.07)

σu3² 0.53 (.17) 0.39 (.15)

The Multilevel GLM for Dichotomous Data and Proportions 131

transformation implies that raising the response becomes more difficult as we approach the limit of 1.00. To show what this means, the predicted response for the three methods is presented in Table 6.5 as logits (in parentheses) and proportions, both for a very salient (saliency = 2) and a non-salient (saliency = 0) questionnaire topic. To compute these numbers we must construct the regression equation implied by the last column of Table 6.6, and then use the inverse logistic transformation given earlier to transform the predicted logits back to proportions. The year 1947 was coded in the data file as zero; after centering, the year zero refers to 1977. As the expected response rates in Table 6.7 show, in 1977 the expected differences between the three data collec-tion modes are small, while the effect of the saliency of the topic is much larger. To calculate the results in Table 6.7, the rounded values for the regression coefficients given in Table 6.6 were used.

To gain a better understanding of the development of the response rates over the years, it is useful to predict the response rates from the model and to plot these predic-tions over the years. This is done by filling in the regression equation implied by the final model for the three survey conditions, for the year varying from 1947 to 1998 (followed by centering on the overall mean of 29.74), with saliency fixed at the intermediate value of 1.

Figure 6.2 presents the predictions for the response rates, based on the cross-level interaction model, with the saliency variable set to intermediate. The oldest study was published in 1947, the latest in 1992. At the beginning, in 1947, the cross-level model implies that the differences between the three data collection modes were large. After that, the response rates for face and telephone surveys declined, with face-to-face interviews declining the fastest, while the response rates for mail surveys remained stable. As a result, the response rates for the three data collection modes have become more similar in recent years. Note that the response rate scale has been cut off at 65%, which exaggerates the difference in trends.

Table 6.7 Predicted response rates (plus logits), based cross-level interaction model.

Year centered on 1977

Topic Face-to-face Telephone Mail

Not salient 0.65 (.63) 0.61 (.44) 0.53 (.11)

Salient 0.88 (2.01) 0.86 (1.82) 0.82 (1.49)

132 MULTILEVEL ANALYSIS: TECHNIQUES AND APPLICATIONS

6.5 THE EVER CHANGING LATENT SCALE: COMPARING

In document 2010 Hox (Page 134-144)