Data Generation - Estimation of Average Total Effects in Quasi-Experimental Designs: Nonlinear

Datasets for the Monte Carlo study were generated in accordance with the theory of stochastic causality in order to study how different implementations of generalized analysis of covariance for quasi-experimental designs perform when the covariate Z and the treatment variable X are stochastic regressors.

Overview Datasets for the simulation study were generated in the following four steps: In the first step, a single univariate covariate Z was created according to the different sample sizes N , as summarized in

105

4.2 Data Generation 106

Table 4.1. A standardized normally distributed covariate Z^∗with an expectation of zero and a variance equal to one was generated for each replication of the data generation (see Listing 4.1 for details), i. e., Z^∗∼ Norm(0,1). The simulated N subjects are assumed to be a random sample from an infinite universe (see Schochet, 2009, as well as section 1.1.9). In the second step, the values of the true outcome variable τ0

were generated according to the parameterization of the intercept function, and the values of the individ-ual total effect variable δ10were generated in line with the parameterization of the effect function. Besides different regression coefficients for the outcome model, different true residual variances were incorporated to vary the amount of between-group residual variance heterogeneity. In the third step, allocation to the treatment conditions X = 0 or X = 1 was simulated based on individual treatment probabilities computed from an assignment model. Theses probabilities were obtained as a function of the values of the standard-ized covariate Z^∗and the additional true parameters of the assignment model. Finally, in the fourth step, the appropriate value of the outcome variable Y was assigned for each simulated subject. According to the treatment variable X , either τ0or τ0+ δ10was used for the computation of the outcome variable.

Table 4.1: Sample sizes used for data generation in simulations I and II

Simulation Study Sample Sizes N

I 100, 250, 400, 1000

II 20, 30, 50, 75, 100, 150, 200, 250, 500, 1000

4.2.1 Assignment Model

The treatment assignment was randomized conditional on the covariate Z based on a model for (condi-tional) treatment probabilities, computed as logistic transformation of the standardized covariate Z^∗. This transformation was parameterized with the two parameters α0and α1for the simple case of two groups and one covariate as considered in the simulation studies I and II. In order to hold the αi parameter con-stant for the generation of different expectations of the observed covariate E (Z ) the covariate was used as standardized Z^∗, i. e.,

P (X = 0|Z^∗) = 1 1 + exp¡

α0+ α1· Z^∗¢ . (4.1)

The model-implied treatment probability P(X = 1|Z^∗= z) = 1 − P(X = 0|Z^∗= z) was compared to draws from a uniformly distributed random variable for each simulated unit (see line 9 in Listing 4.1). This data generation procedure ensured that the treatment assignment was randomized given the value of Z^∗(i. e., the treatment assignment was generated to be strongly ignorable given Z^∗). According to this procedure the correlation between the treatment variable X and the covariate Z^∗(as well as the correlation between X and the transformed covariate Z ) depends on the α1parameter, for a given value of the α0parameter [see Equation (4.1)]. We describe this correlation by the index of determination R²_{Y |Z}for dichotomous variables

4.2 Data Generation 107

(Nagelkerke, 1991). Furthermore, the group size P(X = 1) depends on the parameters α0and α1used for generating the true treatment probabilities. Table 4.2 presents the selected values of α0and α1, the resulting correlations between X and Z , and the resulting group sizes P(X = 1).¹

Table 4.2: Data generation (assignment model) used in simulations I and II Nagelkercke’s Coefficient of Group Size Logistic Pearson Correlation

Determination R²_{X |Z} P (X = 1) α0 α1 Cor (X , Z )

0.75 0.2 -4.3 -4.9 -0.65

0.75 0.5 0 -4.2 -0.73

0.75 0.8 4.3 -4.9 -0.65

0.5 0.2 -2.4 -2.3 -0.55

0.5 0.5 0 -2 -0.6

0.5 0.8 2.4 -2.3 -0.55

0.25 0.2 -1.8 -1.2 -0.39

0.25 0.5 0 -1.1 -0.44

0.25 0.8 1.8 -1.2 -0.39

0.1 0.2 -1.4 -0.7 -0.27

0.1 0.5 0 -0.6 -0.28

0.1 0.8 1.4 -0.7 -0.27

Note: Coefficient of determination is Nagelkercke R²

X |Z. Correlation between X and Z is given as Pearson correlation and estimated over 3000 replications with a sample size of 1000.

In total, twelve distinct combinations of the parameters α0and α1were selected in order to gener-ate datasets which cover a wide range of different dependencies of X and Z . Furthermore, the following three group size conditions were chosen: equal group sizes [P(X = 1) = 0.5], unequal group sizes with the treatment group larger than the control group [P(X = 1) = 0.8], and unequal group sizes with the treatment group smaller than the control group [P(X = 1) = 0.2].

4.2.2 Outcome Model

While the treatment assignment was generated with respect to the mean-centered covariate Z^∗with a unit variance, transformed covariates with expectations different from zero were used for the outcome model:

Z = µ^Z+ σZ· Z^∗. Two values for the expectation E (Z ) = µZwere selected and incorporated in both parts of the simulation study.²The variance of the covariate, i. e., Var (Z ) = σ²_Z, was fixed at the value of one for all conditions.

The datasets were generated in such a way that the covariate-treatment regression E (Y |X , Z ) was al-ways Z -conditionally unbiased. The following linear parameterization of the regression of the outcome

1The interdependence of Nagelkerke’s R²_{X |Z}, the Pearson correlation coefficient Cor (X , Z ), and the group size P (X = 1) as a func-tions of α0and α1is also visualized in the additional Figure 1 on page 12 of the digital appendix.

2In simulation study I a value of µZ= 10 was used for data generation, and in study II datasets were generated with µZ= 5. Within each part of the Monte Carlo study the expectation of the covariate E(Z ) was not manipulated as an additional factor of the simulation design.

4.2 Data Generation 108

# Covariate 1

z_star <- rnorm(n,0,1) 3

z <- mean.tau_z + sqrt(var.tau_z) * z_star 4

# Assignment Model 6

pscore <- 1 - 1 / (1 + exp(alpha0 + alpha1 * z_star)) 8

x <- 0.0 + (runif(n) <= pscore) 9

# Outcome Model 11

eps_tau_0 <- rnorm(n, mean=0, sd=sqrt(var.eps_tau_0)) 13

eps_delta_10 <- rnorm(n, mean=0, sd=sqrt(var.eps_delta_10)) 14

zeta <- rnorm(n, mean=0, sd=sqrt(var.zeta)) 15

tau_0 <- ga00 + ga01 * z + eps_tau_0 17

delta_10 <- ga10 + ga11 * z + eps_delta_10 18

y <- tau_0 + delta_10 * x + zeta 20

Listing 4.1:Rsyntax for the data generation

variable Y on treatment variable X (i. e., for the covariate-treatment regression) with Z as univariate nu-merical covariate was selected as the functional form:

E (Y |X , Z ) = E(τ⁰|Z ) + E (δ10|Z ) · X

= ¡

γ00+ γ01· Z¢ +¡

γ10+ γ11· Z¢

· X .

(4.2)

As discussed as one of the implications of the theory of stochastic causality in section 3.1, we can differen-tiate ζ ≡ Y − E(Y |X , Z ), εX =1≡ τ0− E (τ0|Z ) and εδ₁₀≡ δ10− E (δ10|Z ) as residual terms for a dichotomous treatment variable X :

Y = E(Y |X , Z ) + ζ

= ¡

E (τ0|Z ) + εX =0

¢· IX =0+¡

E (τ1|Z ) + εX =1

¢· IX =1+ ζ

= ¡

E (τ0|Z ) + εX =0

¢+¡

E (δ10|Z ) + (εX =1− εX =0)¢

· X + ζ

= ¡

E (τ0|Z ) + εX =0

¢+¡

E (δ10|Z ) + εδ10

¢· X + ζ.

(4.3)

The three residuals ε_{X =0}, ε_δ₁₀ and ζ (respective their variances) cannot be identified in empirical applica-tions with the methods discussed in this thesis. Nevertheless, the individual total effect δ10= τ1− τ0can

4.2 Data Generation 109

alternatively be expressed as a regression on the covariate, i. e., for the linear parameterization presented in Equation (4.2) as

E (δ10|Z ) = E (τ1|Z ) − E (τ0|Z )

= γ10+ γ11· Z , with

εδ10≡ δ10− E (δ10|Z ) , and

Var (δ10) = Var (γ10+ γ11· Z + εδ10)

= γ²₁₁V ar (Z ) + Var (εδ10).

(4.4)

According to the parameterization described above, the complete generation for the outcome variable Y can be summarized as:

Y =¡

γ00+ γ01· Z + εX =0

¢+¡

γ10+ γ11· Z + εδ₁₀

¢· X + ζ. (4.5)

Technically, the outcome variable was generated by segmenting Equation (4.5) into small parts: In line 4 of Listing 4.1 the covariate Z is generated. In lines 13, 14, and 15 the values of the three residual variables are drawn from standard normal distributions. The values of the true outcome variable in the control condition τ0and the value of the individual total effect variable δ10are drawn in lines 17 and 18. Finally, the outcome model is completed in line 20. In other words, the covariate-treatment regression E (Y |X , Z ) was generated as a linear moderated regression with heteroskedastic errors.³

Table 4.3: Regression coefficients and effect sizes used for data generation in simulations I and II Simulation Study Parameter Selected Values for the Data Generation

I γ01 1, 5

γ11 0.5, 1, 2.5, 5, 7.5, 10

d 0

II γ01 1, 5

γ11 1, 2.5, 5

d 0, 0.2, 0.5, 0.8

For all conditions of simulation study I, data were generated with a true population value of zero for the average total effect, i. e., ATE10= E (γ10+ γ11· Z ) = 0 (d = 0). Therefore, the appropriate regression coefficient γ10 was computed as a function of γ11and E (Z ), i. e., γ10= −γ11E (Z ). For simulation study II, the true average total effect was generated with different effect sizes based on Cohen’s definition. Four different values for the effect size d were chosen for this part of the simulation study. In terms of d, a

3The individual residual variances are equal to Var (ζ)+Var (εX =0) for all individuals assigned to treatment group X = 0 and Var (ζ)+

Var (ε_{X =0}) + Var (εδ₁₀) for units assigned to X = 1. All residual terms are generated independently and are therefore assumed to be uncorrelated.

In document Estimation of Average Total Effects in Quasi-Experimental Designs: Nonlinear Contraints in Structural Equation Models (Page 127-132)