Efficient Nonparametric Estimation of Causal Effects in Randomized Trials with Noncompliance

(1)

Efficient Nonparametric Estimation of Causal Effects in

Randomized Trials with Noncompliance

BY

Jing Cheng

Division of Biostatistics, University of Florida College of Medicine,

Gainesville, Florida 32610, U.S.A.

Dylan S. Small

Department of Statistics, University of Pennsylvania,

Philadelphia, Pennsylvania 19104, U.S.A.

Zhiqiang Tan

Department of Statistics, Rutgers University,

Piscataway, New Jersey 08854, U.S.A

Thomas R. Ten Have

Division of Biostatistics, University of Pennsylvania School of Medicine,

Philadelphia, Pennsylvania 19104, U.S.A.

(2)

SUMMARY

Causal approaches based on the potential outcome framework provide a useful tool for addressing noncompliance problems in randomized trials. We propose a new estimator of causal treatment effects in randomized clinical trials with noncompliance. We use the empir-ical likelihood approach to construct a profile random sieve likelihood and take into account the mixture structure in outcome distributions, so that our estimator is robust to paramet-ric distribution assumptions and provides substantial finite-sample efficiency gains over the standard instrumental variable estimator. Our estimator is asymptotically equivalent to the standard instrumental variable estimator, and it can be applied to outcome variables with a continuous, ordinal or binary scale. We apply our method to data from a randomized trial of an intervention to improve the treatment of depression among depressed elderly patients in primary care practices.

Some key words: Causal effect; Efficient nonparametric estimation; Empirical likelihood;

Noncompliance; Randomized trials.

1. Introduction

When there is noncompliance in randomized trials, there is often interest in estimat-ing the causal effect of actually receivestimat-ing the treatment compared to receivestimat-ing the control. Knowledge of this effect is useful for predicting the impact of the treatment in a setting for which compliance patterns might differ from the randomized trial and for scientific under-standing of the treatment (Sommer & Zeger, 1991; Sheiner & Rubin, 1995; Small et al., 2006; Cheng & Small, 2006).

Note that intention-to-treat analysis is not suitable for estimating the causal effect of actually receiving the treatment when there is noncompliance because it estimates the effect of assignment to the treatment group. An as-treated analysis seeks to estimate the causal effect of receiving the treatment but is biased if compliers are not comparable to noncom-pliers. Imbens & Angrist (1994) and Angrist et al. (1996) show that the causal effect of actually receiving the treatment for the subgroup of subjects who would receive the treat-ment if assigned to the treattreat-ment group and would receive the control if assigned to the

(3)

control group, called the complier average causal effect or the local average treatment ef-fect in the econometrics literature (Imbens & Angrist, 1994), is nonparametrically identified under certain, often plausible, assumptions that do not require compliers and noncompliers to be comparable. These assumptions, henceforth referred to as the instrumental variable assumptions, are discussed in §2. The complier average causal effect can be consistently esti-mated under the instrumental variable assumptions by the standard two-stage least-squares instrumental variables estimator. Imbens & Rubin (1997a, b) demonstrate that, under the assumptions, the standard instrumental variable estimator is an inefficient estimator of the complier average causal effect because it does not make full use of the mixture structure of the outcome distributions of the four observed groups defined by the cross classification of the randomization and treatment received; see §2.4 for further discussion. Imbens & Rubin (1997b) present three new alternatives to the standard IV estimator. One is based on a normal approximation and two are based on multinomial approximations to the outcome distributions in the four groups. In a simulation study with normally distributed outcomes, Imbens & Rubin (1997b) show that all three alternative estimators are more efficient than the standard IV estimator. However, the estimator that is based on a normal approximation to the outcome distributions can have substantial bias when the outcomes are not normal; this is demonstrated in §4. The estimators based on multinomial approximations to the outcome distributions are in principle nonparametric. However, a systematic approach for choosing the multinomial approximations is needed.

Multinomial approximations to the outcome distributions are a type of sieve. A sieve is a sequence of approximations {Fn} to a space F of distributions such that Fn→ F as n → ∞

(Grenander, 1981). Maximizing the likelihood over a sieve rather than the whole parameter space often leads to desirable statistical properties, especially when the underlying parameter space is large (Shen & Wong, 1994). However, the construction of sieves is not an easy task. One approach to constructing sieves is to use a random approximation ˆFn that depends on

the data, a random sieve. The empirical likelihood approach (Owen, 1991) is based on an easily constructed random sieve (Shen et al., 1999). In this paper, we use the empirical

(4)

likelihood approach to construct an efficient estimator for the complier average causal effect. 2. Notation, Assumptions and Review of Established Estimators

2.1 Notation

We consider a two-arm randomized trial with N subjects, n0 of whom are randomly

assigned to the control group. We use letters with and without star to denote vectors and scalars respectively. Let R∗ be the N-dimensional vector of randomization assignments

for all subjects, with individual element Ri = r ∈ {0, 1} according to whether subject i

is assigned active treatment, Ri = 1, or control, Ri = 0. We let Ar∗∗ be the N-dimensional

vector of potential treatment receiveds under the vector of randomization assignment r∗with

individual element Ar∗

i = a ∈ {0, 1} according to whether subject i would take the control

or treatment under randomization assignment r∗. We let Y∗r∗,a∗ be the vector of potential

responses under randomization assignment r∗ and treatment receiveds a∗, with individual

element Yr∗,a∗

i being the potential response for subject i with the vectors of randomization

assignments r∗ and treatment receiveds a∗. The sets of {Yir∗,a∗|r∗ ∈ {0, 1}N, a∗ ∈ {0, 1}N}

and {Ar∗

i |r∗ ∈ {0, 1}N} are ‘potential’ responses and treatment receiveds in the sense that

we can only observe one member of each set. The observed outcome and treatment received variables for subject i are YR∗,AR∗∗

i ≡ Yi and ARi ∗ ≡ Ai respectively.

2.2. Assumptions

We make similar assumptions to those in Angrist et al. (1996).

Assumption 1: Stable unit treatment value assumption (Rubin, 1980). (i). If r = r0_{, then}

Ar∗

i = A

r0 ∗

i for subject i. (ii). If r = r0 and a = a0, then Yir∗,a∗ = Y r0

∗,a0∗

i for subject i. This

assumption allows us to write Yr∗,a∗

i , Ari∗ as Yir,a, Ari.

Assumption 2: Random assignment. This assumption implies independence between

assignment and pretreatment variables including potential outcomes and treatment receiveds.

Assumption 3: Random sampling. We assume that the N subjects in the trial are

independent and identically distributed draws from a superpopulation; that is, Y_ir,a and Ar i,

i = 1, . . . , N, are independent and identically distributed with the same distribution as the

(5)

Assumption 4: Mean exclusion restriction. We assume that E(Yr,a_{) = E(Y}r0_,a

) for all

r, r0_{, a; that is, the randomization assignment affects the mean of the observed outcome}

only through its effect on treatment received. Note that the mean exclusion restriction is weaker than the unit level exclusion restriction of Angrist et al. (1996), who assume that

Y_ir,a= Y_ir0,a for all r, r0_{, a. However, we think that in most applications in which the weaker}

mean exclusion restriction is plausible, the stronger unit-level exclusion restriction is also plausible and so we primarily use the weaker mean exclusion restriction because it is easier to work with this assumption.

Assumption 5: Nonzero average causal effect of R on A.

Assumption 6: Monotonicity. We assume that pr(A1 _{≥ A}0_{) = 1. This assumption says}

that there is no one who would receive the opposite treatment of his or her assignment under both assignment to treatment and to control.

2.3 Compliance classes

A subject in a two-arm trial can be classified into one of four compliance classes:

Ci =                  0 (never-taker) if (A0 i, A1i) = (0, 0) 1 (complier) if (A0 i, A1i) = (0, 1) 2 (always-taker) if (A0 i, A1i) = (1, 1) 3 (defier) if (A0 i, A1i) = (1, 0).

In practice, we can observe only one of A0

i and A1i, so that a subject’s compliance status is not

observed directly in a trial, but it can be partially identified based on treatment assignment and observed treatment-received; see Table 1. Note that the monotonicity assumption rules out the existence of defiers. For single consent design trials (Zelen, 1979), which have the property that the control group cannot access the treatment, that is, pr(A0 _{= 0) = 1, the}

presence of always-takers and defiers is ruled out. 2.4 Major established estimators

Under Assumptions 1 − 6, the compliers are the only subgroup for which a randomized trial provides information about the causal effect of receiving treatment (Angrist et al., 1996).

(6)

For always-takers and never-takers, assignment to treatment has no effect on treatment received. The complier average causal effect, E(Y1_{− Y}0_{|C = 1), can be thought of as the}

causal effect of receiving treatment for the subpopulation of compliers because, for compliers, assignment of treatment agrees with receipt of treatment. Angrist et al. (1996) show that, under Assumptions 1-6, the complier average causal effect is

CACE= E(Y |R = 1) − E(Y |R = 0)

E(A|R = 1) − E(A|R = 0), (1)

which is the intention-to-treat effect divided by the proportion of compliers. The standard instrumental variable estimator is the sample analogue of (1),

ˆ CACES = ˆ E(Y |R = 1) − ˆE(Y |R = 0) ˆ E(A|R = 1) − ˆE(A|R = 0), (2)

where the ˆE’s denote sample means; (2) is sometimes called the Wald estimator.

The standard instrumental variable estimator does not take full advantage of the mixture structure of the outcomes of the four observed groups in Table 1, as we will discuss in §3.1. Imbens & Rubin (1997a,b) present two approaches of using mixture modeling to estimate the complier average causal effect. One approach assumes a parametric distribution, such as normal, for the outcomes for each compliance class group under each randomization assignment. The complier average causal effect is then estimated by maximum likelihood for this model using theEM algorithm. This estimator provides considerable efficiency gains over the standard instrumental variable estimator when the parametric assumptions hold; see Table 4. However, when the parametric assumptions are wrong, this estimator can be inconsistent whereas the standard instrumental variable estimator is consistent; see Table 4 for finite-sample results.

Imbens and Rubin’s other approach to using mixture modeling to estimate the complier average causal effect is to approximate the density of the outcome distribution for each compliance class under each randomization group as a piecewise constant function, and then estimate the complier average causal effect by maximum likelihood. This approach is

(7)

in principle nonparametric as the number of constant pieces in each density function can be increased with the sample size. However, Imbens & Rubin (1997b) do not provide a systematic approach for choosing the number of and locations of the pieces. We develop a systematic easily implementable approach for doing this using empirical likelihood in the next section.

3. Estimation through Empirical Likelihood Approach 3.1 Motivation and description of empirical likelihood approach

We first motivate and describe our method for single consent design trials, where the presence of always-takers and defiers is ruled out. Table 2 shows the relationship between observed (R, A) groups and latent compliance classes for a single consent design trial. The complier average causal effect can be re-expressed under Assumptions 1-6 as follows:

CACE= µc1_{− µ}c0 _{= µ}c1₋µR=0− (1 − πc)µn

πc

=

E(Y |R = 1, A = 1) −E(Y |R = 0) − {1 − pr(A = 1|R = 1)}E(Y |R = 1, A = 0)

pr(A = 1|R = 1) (3)

where µc1_{, µ}c0_{, µ}n_{and µ}R=0_{denote the mean potential outcomes of the compliers under}

treat-ment, compliers under control, never takers and the whole population of subjects when as-signed to the control respectively; and πcdenotes the proportion of compliers. The standard

instrumental variable estimator estimates the complier average causal effect by substituting the method of moments estimates from the sample for E(Y |R = 1, A = 1), E(Y |R = 0), pr(A = 1|R = 1) and E(Y |R = 1, A = 0) into (3). However, as noted by Imbens & Rubin (1997b), there are restrictions on the joint density of (Y, R, A) that are not taken into account by the method of moments that can be useful for estimating E(Y |R = 0), pr(A = 1|R = 1) and E(Y |R = 1, A = 0). To be specific, Assumptions 1-6 imply the following restrictions.

Restriction 1. The distribution of Y |R = 0 is a mixture of the outcome distribution of

the never-takers under R = 0 and the outcome distribution of the compliers under R = 0.

Restriction 2. The mixing proportion πc for Y |R = 0 equals pr(A = 1|R = 1) as a

consequence of Assumption 2.

(8)

Restriction 3. The mean of the never-takers under R = 0 is equal to the mean of the

never-takers under R = 1, which equals E(Y |R = 1, A = 0), as a consequence of Assumption 4.

The sample mean of Y |R = 0 uses only the information in those of Y1, . . . , YN for which

Ri = 0 to estimate E(Y |R = 0), but Restrictions 1-3 imply that there is additional

in-formation in those of Y1, . . . , YN for which Ri = 1. Similarly, the sample proportion of

A = 1|R = 1 uses only the information in those of A1, . . . , AN for which Ri = 1 to estimate

pr(A = 1|R = 1) but Restrictions 1-3 imply that there is additional information in those of A1, . . . , AN for which Ri = 0. A body of work has shown that supplementing a sample

from a distribution that is a mixture of two components with samples from one or both of the components alone provides additional information for estimating aspects of the mixture distribution; see for example Hall & Titterington (1984), Lancaster & Imbens (1996) and Qin (1999). Here, the sample of Y1, . . . , YN for which Ri = 1, Ai = 0 provides information

about the never-taker component of the mixture Y |R = 0 and the sample of A1, . . . , AN for

which Ri = 1 provides information about the mixing proportion in the mixture Y |R = 0.

We now illustrate how this information is useful in a setting with a binary outcome in which

πc= 0.5, µn = 0.2, µc1 = 0.8, µc0 = 0.9, N = 40, n0 = 20. (4)

The following is a plausible sample in this setting: #(Yi = 1, Ai = 1, Ri = 1) = 8, #(Yi =

0, Ai = 1, Ri = 1) = 2, #(Yi = 1, Ai = 0, Ri = 1) = 2, #(Yi = 0, Ai = 0, Ri = 1) = 8,

#(Yi = 1, Ai = 0, Ri = 0) = 13 and #(Yi = 0, Ai = 0, Ri = 0) = 7; the p-value for a χ2 test of

whether or not this sample comes from the distribution (4) is 0.37. Note that, for this sample, the method of moments estimates of the quantities in (3), namely ˆE(Y |R = 1, A = 1) = 0.8,

ˆ

E(Y |R = 0) = 0.65, ˆpr(A = 1|R = 1) = 0.5, ˆE(Y |R = 1, A = 0) = 0.2, violate Restrictions

1-3. Figure 1 plots the profile log-likelihood for this sample under the probability model given by Assumptions 1-6 with binary outcomes. The maximum likelihood estimator of

(9)

1-3, has a noticeably higher likelihood than the standard instrumental variable estimator, which ignores some of the restrictions. The maximum likelihood estimator’s property of taking into full account the mixture structure leads to substantially better estimates; in 1000 simulations from model (4), the mean squared error of the maximum likelihood estimator was 0.048 compared to 0.156 for standard instrumental variable estimator.

To take account of the mixture structure of the outcomes given by Restrictions 1-3 for more general distributions of outcomes in a nonparametric way, we use the empirical likelihood approach. The empirical likelihood for a parameter such as the complier average causal effect is the nonparametric profile likelihood for the parameter. Maximum empirical likelihood estimators have good properties for a wide class of semiparametric problems; see Owen (2001) and Qin & Lawless (1994) for discussion.

Without loss of generality, we arrange the subjects so that R1 = . . . = Rn0 = 0 and

Rn0+1 = . . . = RN = 1; thus, (Y1, A1), . . . , (Yn0, An0) is a random sample from the population

of Yr=0,a=Ar=0

, Ar=0_{and (Y}

n0+1, An0+1), . . . , (YN, AN) is a random sample from the population

of Yr=1,a=Ar=1

, Ar=1_{. The empirical likelihood L}

E of the parameters (πc, µn, µc1, µc0) is:

LE(πc, µn, µc1, µc0) = max Ã n0 Y i=1 qi ! Ã N Y i=n0+1 qi ! , (5) subject to n0 X i=1 qi = 1, N X i=n0+1 qi = 1, qi ≥ 0, i = 1, . . . N, (6) N X i=n0+1 qiAi = πc, N X i=n0+1 qiYiAi = µc1πc, N X i=n0+1 qiYi(1 − Ai) = µn(1 − πc), (7) There exist pc0 i , pni, i = 1, . . . , n0 such that πcpc0i + (1 − πc)pni = qi, (8) n0 X i=1 pc0 i = n0 X i=1 pn i = 1, pc0i , pni ≥ 0, i = 1, . . . , n0, (9) n0 X i=1 pn i(Yi− µn) = 0, (10) n0 X i=1 pc0 i (Yi− µc0) = 0. (11) 8

(10)

Note that throughout our paper, we will follow Owen (2001, Ch. 2.3) and regard tied data values Yi, Yj as representing distinct outcomes in the empirical likelihood as this

sim-plifies calculations and does not affect inferences. The pc0

i and pni in (8)-(11) represent

the population probabilities that a complier assigned to the control and a never-taker as-signed to the control have the same outcome as subject i respectively. The conditions (8)-(11) involving the pc0

i and pni encode the restrictions on the distribution of Y |R = 0

that come from it being a mixture of the compliers and never-takers under Assumptions 1-6, see Restrictions 1-3. The maximum empirical likelihood estimate of (πc, µn, µc1, µc0)

is arg maxπc,µn,µc1,µc0LE(πc, µ

n_{, µ}c1_{, µ}c0_{). To ease the computational burden of computing}

the maximum empirical likelihood estimate, we do not maximize over µn_{, but instead use}

the method of moments estimator ˆµn ₌PN

i=1YiRi(1 − Ai)/

P_N

i=1Ri(1 − Ai) and maximize

LE(πc, ˆµn, µc1, µc0) over (πc, µc1, µc0). In model (4), this approximate maximum empirical

likelihood estimator of the complier average causal effect performed almost as well as the maximum empirical likelihood estimator; its mean squared error was 0.051 compared to 0.048 for the maximum empirical likelihood estimator. We now present an algorithm for finding the approximate maximum empirical likelihood estimate.

3.2 Computation for empirical likelihood approach

To find the approximate maximum empirical likelihood estimate, we conduct a grid search over πc, finding maxµc1_,µc0L_E(˜π_c, ˆµn, µc1, µc0) over a grid of ˜π_cfrom 0 to 1. As we will see be-low, arg maxµc1L_E(˜π_c, ˆµn, µc1, µc0) does not depend on µc0and arg max_µc0L_E(˜π_c, ˆµn, µc1, µc0) does not depend on µc1_{, so finding the maximizing µ}c1 _{and µ}c0 _{can be done separately. For}

finding the maximizing µc1_{, we note that arg max}

µc1L_E(˜π_c, ˆµn, µc1, µc0) equals arg max_µc1 Q

i:Ri=Ai=1qi

subject to (i) P_i:R_i_=A_i₌₁qi = ˜πc, (ii) qi ≥ 0, i = 1, . . . , N and (iii)

P

i:Ri=Ai=1qiYi = ˜πcµ

c1_.

By multiplying the qi’s by 1/˜πc, we see that finding arg maxµc1L_E(˜π_c, ˆµn, µc1, µc0) is equiv-alent to finding the maximum empirical likelihood estimator of the mean of the popula-tion of Y1,1_{|C = 1 based on the random sample Y}

1, . . . , Yn|Ai = 1, Ri = 1; consequently,

arg maxµc1L_E(˜π_c, ˆµn, µc1, µc0) is the mean of Y₁, . . . , Y_n|A_i = 1, R_i = 1; see Theorem 2.1 of Owen (2001). Thus, our estimate of µc1 _{is ˆ}_µc1 ₌ ³PN

i=1RiAi

´₋₁_P

N

(11)

our estimate of µc0_{, let (q}∗

1, . . . , qn∗0) = arg maxq1,...,q_n0

Q_n₀

i=1qi subject to (6) and (8)-(10) with

µn _{= ˆ}_µn_{, π} c= ˜πc. We have that arg max µc0 LE(˜πc, ˆµ n_{, µ}c1_{, µ}c0_{) =} P_n₀ i=1q∗iYi − (1 − ˜πc)ˆµn ˜ πc , (12)

where we use the fact that, for the µc0 _{that satisfies}Pn0

i=1qi∗Yi = ˜πcµc0+ (1 − ˜πc)ˆµn, the

con-straints (8)-(11) are satisfied for q1 = q1∗, . . . , qn0 = qn∗0. Thus, to find arg maxµc0LE(˜πc, ˆµn, µc1, µc0), we just need to find q∗

1, . . . , q∗n0. To do this, we note that we can view (q

∗

1, . . . , qn∗0) as the

maximum likelihood estimate of the category probabilities for the sample Y1, . . . , Yn0 from

an independent and identically distributed multinomial model with categories Y1, . . . , Yn0,

corresponding category probabilities q1, . . . , qn0 and parameter restrictions given by (6) and

(8)-(10) with µn _{= ˆ}_µn_{, π}

c = ˜πc. Finding the maximum likelihood estimate directly is

chal-lenging because of the complex parameter restrictions in (8)-(10). However, consider using the EM algorithm, where we regard each subject’s compliance class as ‘missing data.’ We can reexpress the observed data likelihoodQn0

i=1qiand the parameter restrictions (6) and (8)-(10)

in terms of pc0

i and pni; see Appendix 1 for details. We can then use the EM algorithm to find

the pc0

i and pni to maximize the observed data likelihood and then find the corresponding

max-imizing qi’s by (8). The complete-data likelihood is

Q i:Ri=0,Ci=1p c0 i Q i:Ri=0,Ci=0p n i. Since the

complete data follows an exponential family distribution, the E-step has a closed form expres-sion. The M-step involves a calculation analogous to finding the empirical likelihood for the mean (Owen, 1988); convex duality enables us to avoid maximizing over pc0

i , pni, i = 1, . . . , n0,

and instead we maximize over a single variable. The tractability of both the E- and M- steps makes the EM algorithm with each subject’s compliance class as missing data easy to use for finding q∗

1, . . . , qn∗0 and hence finding arg maxµc0LE(˜πc, ˆµ

n_{, µ}c1_{, µ}c0_{) by (12).}

Note that, given qi, i = 1, . . . , n0, there are typically more than one set of pc0i , pni, i =

1, . . . , n0, that satisfy the constraints (8)-(10). Numerical experiments, not shown here,

verify that although the EM algorithm converges to different values of pc0

i and pni for different

sets of starting values for the pc0

i and pni, the corresponding qi’s to which the EM algorithm

(12)

converges are the same, as Lemma 1 shows more formally. Lemma 1. Regardless of the starting values for the pc0

i , pni, i = 1, . . . , n0, the sequence of

estimates of qi from the EM algorithm converges to the global maximum of the likelihood

Q_n₀

i=1qi subject to the restrictions (6) and (8)-(10) with µn= ˆµn, πc= ˜πc.

The proof of Lemma 1 is outlined in Appendix 2.

In summary, we estimate πc, µn, µc1, µc0 as follows; a program is available from the authors.

Step 1. We obtain ˆµn _{as the sample mean of Y |R = 1, A = 0.}

Step 2. We obtain ˆµc1 _{as the sample mean of Y |R = 1, A = 1.}

Step 3. For a grid of ˜πc, we find the maximum empirical likelihood estimate of µc0

given πc = ˜πc, µn = ˆµn, µc1 = ˆµc1 using the EM algorithm described above. Then ˆπc =

arg max˜πcmaxµc0LE(˜πc, ˆµn, ˆµc1, µc0) and ˆµc0 = arg maxµc0LE(ˆπc, ˆµn, ˆµc1, µc0).

Step 4. Our approximate maximum empirical likelihood estimate of the complier average

causal effect is CACEˆ A= ˆµc1− ˆµc0.

3.3 Estimation in trials in which the assigned to control group can access the treatment

Our method illustrated in §3.1 can be directly applied to more general trials under As-sumptions 1-6 in which the control group can access the treatment. For such trials, we have one more compliance class, the always-takers, in addition to the compliers and never-takers, see Table 3; we denote the proportion of always takers and the mean of always takers’ po-tential outcomes by πa and µa respectively. The empirical likelihood LE of the parameters

(πc, πa, µn, µa, µc1, µc0) is the maximum likelihood for multinomial distributions (q1, . . . , qn0)

on (Y1, A1), . . . (Yn0, An0) and (qn0+1, . . . , qN) on (Yn0+1, An0+1), . . . , (YN, AN) that are

con-sistent with (πc, πa, µn, µa, µc1, µc0) and the restrictions on the parameter space specified by

Assumptions 1-6, namely LE(πc, πa, µn, µa, µc1, µc0) = max (

Q_n₀ i=1qi) ³Q_N i=n0+1qi ´ subject to (i) Pn0 i=1qi = 1, P_N

i=n0+1qi = 1; (ii) qi ≥ 0, i = 1, . . . , N; (iii)

P

i:Ri=1,Ai=0qi = 1 − πa− πc; (iv)P_i:R_i_=0,A_i₌₁qi = πa; (v)

P i:Ri=1,Ai=0qiYi = µ n_(1−π a−πc); (vi) P i:Ri=0,Ai=1qiYi = µ a_π a;

(vii) There exist pc0

(13)

πa−πc)/(1−πa)}pni = qi, (viib) P pc0 i = P pn i = 1, (viic) pc0i , pni ≥ 0, (viid) P pn i(yi−µn) = 0 and (viie) Ppc0

i (Yi − µc0) = 0; and (viii) There exist pc1i , pai for the i with Ri = 1, Ai = 1

such that (viiia) {πc/(πc + πa)}pc1i + {πa/(πc + πa)}pai = qi, (viiib)

P pc1 i = P pa i = 1; (viiic) pc1 i , pai ≥ 0; (viiid) P pa i(Yi − µa) = 0 and (viiie) P pc1 i (Yi − µc1) = 0. As with

the single consent design, rather than finding the maximum empirical likelihood estimate of (πc, πa, µn, µa, µc1, µc0), we find the approximate maximum empirical likelihood estimate

by setting µn _{equal to the sample mean of Y |R = 1, A = 0, corresponding to the known}

never-takers in the sample, and µa_{equal to the sample mean of Y |R = 0, A = 1,}

correspond-ing to the known always-takers in the sample, and then maximizcorrespond-ing the empirical likelihood over (πc, πa, µc1, µc0). This can be done by using the EM algorithm for estimating µc0 in the

Y |R = 0, A = 0 sample as in §3.2, and an analogous EM algorithm for estimating µc1 _{in the}

Y |R = 1, A = 1 sample. The details are provided in a technical report available from the

authors.

4 Simulation Studies

We compare our approximate maximum empirical likelihood estimator with the standard instrumental variable estimator and Imbens and Rubin’s parametric estimator, considering single consent design trials as discussed in §3.1. We set πc = 0.5 and compare the three

estimators under different outcome distributions and under sample sizes of N = 100 and

N = 500 with pr(R = 1) = 0.5. The outcome distributions we consider are Normal, gamma,

and lognormal distributions. For each outcome distribution, we set µc1_{= 2, µ}c0_{= 1, so that}

the CACE= µc1_{− µ}c0 _{= 1. The variances are fixed at 1.}

Before explaining our settings for µn_{, we discuss the impact of the distance between µ}n

and µc0_{on the efficiency of the approximate maximum empirical likelihood estimator relative}

to standard instrumental variable estimator. The distance between µn_{and µ}c0_{is a measure of}

the separation between the distributions of the compliers and never-takers under the control. To see the impact of the distance between µn_{and µ}c0_{, we consider under what conditions the}

approximate maximum empirical likelihood and standard instrumental variable estimators are equal. Standard instrumental variable estimator estimates the complier average causal

(14)

effect by substituting method of moments estimates into (3). The approximate maximum empirical likelihood estimator estimates the complier average causal effect by substituting maximum empirical likelihood estimates into (3) conditional on E(Y |R = 1, A = 0) being set equal to its method of moments estimate. The approximate maximum empirical likelihood estimator equals the standard instrumental variable estimator if the method of moments estimates of pr(A = 1|R = 1) and E(Y |R = 1, A = 0), denoted by ˆpr(A = 1|R = 1) and

ˆ

µn _{respectively, satisfy (8)-(10) with q}

i = 1/n0 for i = 1, . . . , n0. This will happen if and

only if ˆµn _{is between the trimmed mean of Y |R = 0 over the 0 to {1 − ˆ}_{pr(A = 1|R = 1)}}

quantiles and the trimmed mean of Y |R = 0 over the ˆpr(A = 1|R = 1) to 1 quantiles. It is more likely that ˆµn _{will escape these bounds when the distributions of the compliers and}

the never-takers are more separated. When ˆµn _{does escape these bounds, we expect that}

the approximate maximum empirical likelihood estimator will provide a better estimate than standard instrumental variable estimator because the approximate maximum empirical likelihood estimator is taking better account of the mixture structure of outcomes implied by Assumptions 1-6. Thus, we expect that the approximate maximum empirical likelihood estimator will gain more efficiency over standard instrumental variable estimator when the distance between µn _{and µ}c0 _{is greater, because then the distributions of the compliers and}

never-takers under the control are more separated.

To see the effect of the separation between the compliers and never-takers under the control, we chose two sets of values for µc0_{and µ}n_{such that the distributions of the compliers}

and never-takers under the control are well separated under one set of values but are close to each other under another set of values. In setting N1_{, the distributions of Y}1,1

i |Ci =

1, Y_i0,0|Ci = 1 and Yi0,0 = Yi1,0|Ci = 0 are Normal with (mean, variance) combinations

(2, 1), (1, 1) and (3, 1), respectively. In setting G1 _{and LN}1_{, the distributions are gamma}

and lognormal, respectively. Settings N2_{, G}2_{and LN}2_{differ only in that the (mean, variance)}

combination of Y_i0,0 = Y_i1,0|Ci = 0 is (1.5, 1).

For each setting, we present summary results over 1000 replications with sample sizes of 100 and 500. Table 4 shows the bias and mean squared error from the three different

(15)

estimators for the complier average causal effect for the different settings considered. Table 4 shows the following features.

First, the parametric estimator based on the normality assumption is unbiased and more efficient than standard instrumental variable and approximate maximum empirical likelihood estimators under the true normal distributions, but shows biases of 23% − 40% and is less efficient than other two estimators under nonnormal distributions.

Secondly, both the approximate maximum empirical likelihood and standard instrumental variable estimators have low bias for all settings considered. The approximate maximum empirical likelihood estimator has bias below 5% when the distributions of the never-takers and the compliers under the control are close to each other. When the distributions of the never-takers and compliers under the control are well separated and the sample size is 100, the approximate maximum empirical likelihood estimator has a bias of about 10% but this bias drops to below 5% when the sample size increases to 500.

Thirdly, the approximate maximum empirical likelihood estimator is more efficient than standard instrumental variable estimator for all settings considered. The gain in mean squared error is more substantial when the distributions of never-takers and compliers under the control are well separated, as expected from the discussion above. The gain in mean squared error is as large as 56%. The gain is generally smaller with a sample size of 500 rather than 100. In additional simulations not presented in Table 4, we found that there is still a gain in mean squared error with the approximate maximum empirical likelihood estimator over standard instrumental variable estimator with a sample size of 1000.

We also did a simulation study for the setting of §3.3 in which the assigned to control group can access the treatment. The results are not presented, but are available from the authors. The pattern of results is similar to that for the single consent design trials.

5 Asymptotic Properties

In §4, we showed that the approximate maximum empirical likelihood estimator gains over standard instrumental variable estimator in a range of finite-sample situations, with larger gains when the compliers and never-takers’ outcome distributions under the control

(16)

are more separated. The standard instrumental variable estimator is based on estimating the distribution of (Y, A, R) by the empirical distribution of (Y, A, R); the method of moments estimators on which standard instrumental variable estimator is based are the moments of the empirical distribution. The source of the approximate maximum empirical likeli-hood estimator’s gain over standard instrumental variable estimator is that the empirical distribution of (Y, A, R) might not satisfy the restrictions given by Assumptions 1-6. The approximate maximum empirical likelihood estimator takes these restrictions into account to provide a better estimate of the distribution of (Y, A, R) than the empirical distribution. However, unless the distribution of (Y, A, R) is ‘at the boundary’ of the restrictions given by Assumptions 1-6, the empirical distribution of (Y, A, R) should satisfy the restrictions with probability converging to 1 as the sample size N → ∞. Consequently, the approximate maximum empirical likelihood estimator will be asymptotically equivalent to the standard instrumental variable estimator. We establish this result in Theorem 1 under condition (13) below. Condition (13) specifies that the distribution of (Y, A, R) is not ‘at the boundary’ of the restriction that the Y |R = 0 is a mixture of the compliers and never-takers under the control in the sense that the distributions of the compliers and never-takers under the control overlap at least minimally. In condition (13) below, we let Fc0 _{and F}n0 _denote

the cumulative distribution functions of potential outcomes under the control for compliers and never-takers respectively, and we let G = πcFc0 + (1 − πc)Fn0 denote the cumulative

distribution function of potential outcomes under the control. The condition is 1 1 − πc Z _G−1_(1−π c) −∞ zdG(z) <R_−∞∞ zdFn0_{(z) = µ}n_, µn₌ Z _∞ −∞ zdFn0_{(z) <} 1 1−πc R_∞ G−1_(π_c₎zdG(z). (13) Condition (13) says that the trimmed mean of the πn-smallest part of the mixture of

never-takers and compliers is strictly less than the mean of the never-never-takers and that the trimmed mean of the πn-largest part of the mixture of never-takers and compliers is strictly greater

(17)

Theorem 1. Consider a single consent design. Suppose (i) (13) holds, (ii) 0 < πc< 1 and

(iii) n0/N = d, 0 < d < 1. Then, pr(CACEˆ A=CACEˆ S) → 1 as N → ∞.

The proof of Theorem 1 is in Appendix 2.

In spite of the asymptotic equivalence result in Theorem 1, the simulation study in §4 showed that the approximate maximum empirical likelihood estimator can provide substan-tial gains in practical situations. The gains provided by the approximate maximum empirical likelihood estimator are analogous to the gains provided in estimating a population mean in the knowledge of restrictions on the range of the mean. For example, consider estimating the mean µ of a normal distribution N(µ, σ2_{) based on a random sample Y}

1, . . . , YN when

it is known that µ is less than or equal to an upper bound µU. If µ is reasonably close to

µU, then the maximum likelihood estimate will gain substantially over the sample mean, the

maximum likelihood estimate if µ is unrestricted, for many sample sizes. However, as long as µ is less than µU by any amount, the estimators are equivalent asymptotically because,

for large enough N, the sample mean is less than µU with high probability.

6 Application to Depression Study

In this section, we apply our method to analyze a randomized trial of an intervention to improve treatment of depression among depressed elderly patients in primary care practices (Bruce et al., 2004). The encouragement intervention was that a depression care specialist collaborated with the patient’s primary care physician to facilitate adherence to a depression treatment strategy and provide education and assessment to the patient. The control was usual care. The study involved 539 depressed patients in 20 primary care practices at three sites followed for six visits: baseline, 4, 8, 12, 18 and 24 months. Each practice was randomized to either intervention, treatment, or usual care, control. For illustrative purposes, we ignore the fact that the trial was a group randomized trial and treat it as a completely randomized trial; for analyses that account for the group randomization, see Small et al. (2007). Compliance with the intervention was categorized as a binary variable, whether or not a patient had seen a depression care specialist in the prior four months of

(18)

follow-up. Patients in practices randomized to the usual-care group did not have access to the depression specialist, so there are only compliers and never-takers in this trial. To see the effects of estimators under different situations, we analyze two outcomes. One is the patients’ Hamilton depression scores measured at 4 months, which take integer values between 0 and 50. A lower value of the outcome means less depression. Another outcome of analysis is the composite anti-depression scores among males at one site measured at 12 months. This is an integer-valued score from 0 to 4 that indicates how much the patient is being treated for depression. A score of 3 or 4 is considered adequate treatment for depression while 1 or 2 means the patient is being treated in some way, but not a what is considered an adequate dose.

Table 5 shows the three estimates of the complier average causal effect for the Hamilton and composite anti-depression scores described above. The percentile bootstrap with 1000 resamples was used to compute approximate 95% confidence intervals. We first consider the Hamilton score at 4 months; see the second column of Table 5. The scores were observed for 517 subjects and 92.7% of these subjects that were assigned to treatment complied with the treatment. All the complier average causal effect estimates are negative and the 95% confidence intervals do not include zero, indicating that the intervention has a significant ben-eficial effect on depression compared to usual care. Comparing the three estimation methods, we first note from the histograms of the Hamilton outcome in Fig. 2 (a)-(c) that the Hamil-ton scores for the never-takers and compliers under the treatment are far from normally distributed, suggesting that the parametric estimator based on the normality assumption is probably a biased estimator. The standard instrumental variable estimator and the ap-proximate maximum empirical likelihood estimator provide very similar point estimates and similar 95% confidence intervals; see below for more explanation of this similarity. We now consider the outcome of the composite anti-depression scores among males at the site at 12 months, given in the third column of Table 5. The scores were observed for 37 subjects and 75% of these subjects who were assigned to treatment complied with the treatment. The approximate maximum empirical likelihood and standard instrumental variable

(19)

com-plier average causal effect estimates show a significant beneficial effect of the intervention on treating depression while the parametric normal estimate does not show a significant effect. As for the Hamilton score, the histograms of the composite anti-depression outcomes in Fig. 2 (d)-(f) show that the composite anti-depression scores from the never-takers and compliers under the control are far from normally distributed, suggesting that the parametric estima-tor based on the normality assumption is a biased estimaestima-tor. Unlike for the Hamilton score, for the complier average causal effect of the intervention on the composite anti-depression score, the approximate maximum empirical likelihood estimate has a substantially narrower 95% confidence interval than standard instrumental variable estimate.

The greater gain in efficiency of the approximate maximum empirical likelihood estimate compared to standard instrumental variable estimate for the composite anti-depression study rather than the Hamilton study is related to three factors. First, the sample size in the R = 0 group is smaller for the composite anti-depression study, making it more likely that the em-pirical distribution of (Y, A, R) will deviate from the restrictions implied by Assumptions 1-6. Secondly, the compliance rate among the subjects assigned to treatment is higher for the Hamilton study, 93%, than the composite anti-depression study, 75%, providing less scope in the Hamilton study for the extra information about Assumptions 1-6 used by the approximate maximum empirical likelihood estimator to have an impact. Thirdly, the sep-aration between the never-takers’ and compliers’ outcome distributions in the control group is greater for the composite anti-depression than for the Hamilton; if we use the estimates of

µn _{and µ}c0 _{obtained by substituting method of moments estimates into the population}

ex-pressions for these quantities in (3), the estimated absolute standardized difference between the never-takers’ and compliers’ means in the control group is 2.34 for the composite anti-depression compared to 0.72 for the Hamilton. As we have shown in our simulation studies, the approximate maximum empirical likelihood estimator will have a larger gain in efficiency over standard instrumental variable estimator when the distributions of the never-takers and compliers in the control group are more separated.

7 Discussion 18

(20)

Our method can be extended to observational studies in which a variable R which en-courages, R = 1, or does not encourage, R = 0, a subject to take the treatment is not randomly assigned but is ‘as good as randomly assigned’, that is, ignorable, conditional on some covariates; such studies are discussed in Abadie (2003) and examples are given in Table 1 of Angrist & Krueger (2001). Suppose we replace Assumption 2 with Assumption 20 _{that the encouragment variable R is independent of Y}1,1_{, Y}1,0_{, Y}0,1_{, Y}0,0_{, A}0_{, A}1

condi-tional on a subject’s covariate vector X and that the encouragement variables of different subjects are independent. Also, suppose we expand Assumption 3 to Assumption 30 _that

Xi, Yi1,1, Y 1,0 i , Y 0,1 i , Y 0,0

i , A0i, A1i are independent and identically distributed draws from a

su-perpopulation and expand Assumption 4 to condition on covariates, i.e, let Assumption 40

be that E(Yr,a_{|X) = E(Y}r0_,a

|X) for all r, r0_{, a, X. Furthermore, for a single consent}

de-sign, suppose we consider linear models for the expected potential outcomes in a compliance class given the covariates and a logistic model for compliance given the covariates, i.e.,

E(Y1,1_{|C = 1, X) = X}0_βc1_{, E(Y}0,0_{|C = 1, X) = X}0_βc0_{, E(Y}1,0_{|C = 0, X) = E(Y}0,0_{|C =}

0, X) = X0_βn _{and pr(C = 1|X) = expit(X}0_{α), where expit(z) = e}z_{/(1 + e}z_{). We}

in-clude an intercept in the covariate vector X and let p denote the dimension of X. Un-der this model, the complier average causal effect for compliers with covariate vector X is X0_βc1_{− X}0_βc0_{. Under Assumptions 1, 2}0_{, 3}0_{, 4}0_{, 5 and 6 and the above models for the}

outcomes and compliance probabilities, we have that the empirical likelihood of α, βc1_{, β}c0

and βn _{is L}

E(α, βn, βc1, βc0) = maxq1,...,qN Q_N

i=1qi subject to (i)

P_n₀ i=1qi = 1, P_N i=n0+1qi = 1; (ii) qi ≥ 0, i = 1, . . . , N; (iii) P_N i=n0+1qiXij{Ai − expit(X 0 iα)} = 0, j = 1, . . . , p; (iv) P_N i=n0+1qiAiXij(Yi − X 0 iβc1) = 0, j = 1, . . . , p; (v) P_N i=n0+1qi(1 − Ai)Xij(Yi − Xi0βn) = 0,

j = 1, . . . , p; (vi) there exist tc0

i , tni, i = 1, . . . , n0such that (via) tc0i +tni = qi; (vib) tc0i , tni ≥ 0;

(vic)Pn0 i=1tc0i + P_n₀ i=1tni = 1; (vid) P_n₀ i=1tc0i Xij{1−expit(Xi0α)}+ P_n₀ i=1tniXij{−expit(Xi0α)} = 0; (vie) Pn0

i=1tniXij(Yi − Xi0βn) = 0, j = 1, . . . , p; and (vif)

P_n₀

i=1tc0i Xij(Yi − Xi0βc0) = 0,

j = 1, . . . , p. Here the tc0

i , tni, respectively represent the population probabilities that a

subject assigned to the control has the same outcome and covariates as subject i and is a complier, never-taker respectively. The above expression for the empirical likelihood builds

(21)

on Owen’s (2001, Ch. 4) discussion of empirical likelihood for regression models. As in our method of §3, we can compute the approximate maximum empirical likelihood estimate by estimating βn _{using the R = 1, A = 0 sample and maximizing the empirical likelihood over}

α, βc1 _{and β}c0 _{given β}n _{= ˆ}_βn_.

When deriving the approximate maximum empirical likelihood estimator, we have as-sumed the weak exclusion restriction that the never-takers’, always takers’, respectively, mean is the same under assignment to treatment and control, rather than the strong exclu-sion restriction that the never-takers’, always- takers’, respectively entire outcome distribu-tion is the same under assignment to treatment and control. In most situadistribu-tions in which the weak exclusion restriction is plausible, we think that the strong exclusion restriction will also be plausible. We are currently adapting our approach to situations in which the strong exclusion restriction is plausible by enabling the empirical likelihood approach to use more equality constraints for aspects of the never-takers and always-takers under R = 0 and R = 1 distributions respectively than just equality of means.

ACKNOWLEDGEMENT

We thank the associate editor, a referee and Professor D.M. Titterington for their helpful comments and suggestions.

APPENDIX 1

Details of the EM algorithm

Reexpressing the observed data likelihoodQn0

i=1qi and the parameter restrictions (6) and

(8)-(10) in terms of pc0

i , pni, we have that the observed data likelihood, with πc= ˜πc, µn= ˆµn,

is Qn0

i=1{˜πcpc0i + (1 − ˜πc)pni}, with parameter restrictions n0 X i=1 pc0_i = n0 X i=1 pn_i = 1, pc0_i ≥ 0, pn_i ≥ 0, i = 1, . . . , n0, n0 X i=1 pn_i(Yi− ˆµn) = 0. (A1) where qi = ˜πcpc0i + (1 − ˜πc)pni and µc0 = P_n₀

i=1pc0i Yi. Note that, if ˆµn is such that there is

no pc0

i , pni that satisfies (A1), then our approximate maximum empirical likelihood estimator

does not exist; in this case we can modify the approximate maximum empirical likelihood 20

(22)

estimator to use ˆµn _{as the closest point to} P

i:Ri=1,Ai=0Yi/#{Ri = 1, Ai = 0}, the usual estimate of µn_{for the approximate maximum empirical likelihood estimator, such that there}

exists pc0

i , pni, i = 1, . . . , n0, that satisfy (A1). If we view each subject’s compliance class as

missing data, the complete data likelihood is Q_i:R_i_=0,C_i₌₁pc0 i

Q

i:Ri=0,Ci=0p

n

i.

E-step. The expectation of the complete data log-likelihood conditional on the observed

data and the parameter estimates pc0(k−1)

i and pn (k−1) i at the (k − 1)th step is Q(k) = E( n0 X i=1

[Ci(log pc0i + log ˜πc) + (1 − Ci){log pni + log(1 − ˜πc)}|Y1, . . . , Yn0, p

c0(k−1) i , pn (k−1) i ] = n0 X i=1

[W_i(k)(log pc0_i + log ˜πc) + (1 − Wi(k)){log pni + log(1 − ˜πc)}]

where W_i(k) = pr(k−1)_(C i = 1|Yi, Ri = 0, Ai = 0) = ˜πcpc0 (k−1) i /{˜πcpc0 (k−1) i + (1 − ˜πc)pn (k−1) i }.

M-step. We wish to maximize Q(k) _{over p}c0

i , pni subject to (A1) with µn = ˆµn, πc = ˜πc.

We do this by conducting a grid search over µc0 ₌Pn0

i=1pc0i Yi. We now discuss maximizing

Q(k) _{given µ}c0_{= ˜}_µc0_{. We will denote the maximizing values of p}c0

i , pni for µc0 = ˜µc0 by ˜pc0i , ˜pni.

Note that ˜µc0 _{is a possible value of µ}c0 _{if and only if}

{pc0 i , i = 1, . . . , n0| X i pc0 i = 1, pc0i ≥ 0, X i pc0

i (Yi− µc0) = 0} is not empty. (A2)

For such a ˜µc0_{, maximizing Q}(k)_{via Lagrange multipliers subject to (A1) and µ}c0 _{= ˜}_µc0_gives

˜ pc0 i = W_i(k) (P_iW_i(k)){1 + ˜tc_(Y i− ˜µc0)} , p˜n i = 1 − W_i(k) {P_i(1 − W_i(k))}{1 + ˜tn_(Y i− ˆµn)}

where ˜tc _{and ˜t}n _{can be determined in terms of ˜}_µc0 _{and ˆ}_µn _by

0 =X i ˜ pc0 i (Yi− ˜µc0) = X i W_i(k)(Yi− ˜µc0) (P_iW_i(k)){1 + ˜tc_(Y i− ˜µc0)} (A3) 0 = X i ˜ pn i(Yi− ˆµn) = X i (1 − W_i(k))(Yi− ˆµn) {P_i(1 − W_i(k))}{1 + ˜tn_(Y i− ˆµn)} (A4)

(23)

re-spectively, so that a safeguarded zero-finding algorithm, such as Brent’s method, can be used. Starting points for the zero finding algorithm can be found by noting that, since 0 ≤ ˜pc0 i , ˜pni ≤ 1, ˜tc_{∈ (}1 − W_i(k) P iW (k) i ˜ µc0_{− Y} (n0) , 1 − Wi(k) P iW (k) i ˜ µc0_{− Y} (1) ) , ˜tn∈ ( 1 − (1−Wi(k)) P i(1−W (k) i ) ˆ µn_{− Y} (n0) , 1 − (1−Wi(k)) P i(1−W (k) i ) ˆ µn_{− Y} (1) )

where Y(n0) = max(Yi|Ri = 0) and Y(1) = min(Yi|Ri = 0). The kth-step parameter estimates

pc0(k)

i , pn

(k)

i , i = 1, . . . , n0, are the ˜pc0i , ˜pni that correspond to the ˜µc0 that maximizes Q(k) over

the grid of ˜µc0 _{considered. Note that we can avoid the need to consider the constraint (A2)}

by replacing the logarithm function with the pseudo-logarithm function of Owen (2001, p. 62) in the definition of Q(k)_.

Appendix 2

Proofs

Outline proof of Lemma 1. The complete proof is provided in a technical report available

from the authors. Here we outline the steps in the proof.

Step 1. We show that maximizing Qn0

i=1qi subject to (6) and (8)-(10) with µn= ˆµn, πc=

˜

πc is a convex optimization problem so that there is a unique global maximum.

Step 2. Our problem involves maximization over a constrained parameter space.

Nettle-ton (1999) shows that, under regularity assumptions, the EM algorithm converges to either (a) a stationary point or (b) a boundary point of the constrained parameter space at which the likelihood function can be increased only by moving in a direction outside the param-eter space. For an unconstrained paramparam-eter space, under regularity assumptions, the EM

algorithm converges only to points of type (a) (Wu, 1983). We show that, even though our parameter space is constrained, under regularity assumptions, the EM algorithm converges only to points of type (a) for our problem.

Step 3. We combine the results in Steps 1 and 2 with results aboutEMfor unconstrained problems of Wu (1983) and Dempster et al. (1977) to prove the lemma.

(24)

Proof of Theorem 1. Let Z1, . . . , Zn0 denote the Y |R = 0 sample, and let ˆπcR=1 equal the

method of moments estimate of πc based on the R = 1 sample, ˆπR=1c = #{Ri = 1, Ai =

1}/(N −n0). Note that, if there exist pc0i , pni that satisfy (i) ˆπcR=1pc0i +(1−ˆπR=1c )pni = 1/n0, (ii)

P_n₀

i=1pni(Zi− ˆµn) = 0, (iii)

P_n₀

i=1pc0i =

P_n₀

i=1pni = 1 and (iv) pci0, pni ≥ 0, then the approximate

maximum empirical likelihood estimator equals the standard instrumental variable estimator and the maximizing values of qi are qi = 1/n0, i = 1, . . . , n0. By considering the minimum

and maximum values of Pn0

i=1pniZi subject to (i), (iii) and (iv) above, we have that there

exist pc0

i , pni that satisfy (i)-(iv) if and only if ˆµn∈ [µl(N), µu(N)], where

µl(N) = bk_X_n0c i=1 Z(i) 1 kn0 + Z(bk_n0c+1) kn0 − bkn0c kn0 , µu(N) = n0 X i=n0−bk_n0c+1 Z(i) 1 kn0 + Z(n0−bk_n0c) kn0 − bkn0c kn0 ,

kn0 = n0(1 − ˆπR=1c ) and bkc is the greatest integer less than or equal to k. Let ˜µl(N) and

˜

µu(N) be the trimmed sample means of Z1, . . . , Zn0 trimmed to the [0, 1 − πc] quantiles and

[πc, 1] quantiles respectively; that is,

˜ µl(N) = bn0(1−π_Xc)c i=1 Z(i) 1 1 − πc + Z(bn0(1−πc)c+1) n0(1 − πc) − bn0(1 − πc)c n0(1 − πc) .

Then, letting G denote the cumulative distribution function of the potential outcomes under the control, we have that, as N → ∞, in probability,

˜ µl(N) → 1 1 − πc Z _G−1_(1−π_c₎ −∞ zdG(z) = µ∞ l , ˜ µu(N) → Z _∞ G−1_(π_c₎ zdG(z) = µ∞ u ,

by the properties of trimmed means (Shao, 2003, Ch. 5). Now we show that µl(N) → µ∞l

in probability and µu(N) → µ∞u in probability by showing that |µl(N) − ˜µl(N)| → 0 in

(25)

|µl(N) − ˜µl(N)| ≤ |s| max ¡ |Z(dn0(1−πc)+n0s+1e)|, |Z(bn0(1−πc)−n0s−1c)| ¢ (A5) where s = |(1 − ˆπR=1

c )−1− (1 − πc)−1| and dke is the least integer greater than or equal to k.

The first term on the right hand side of (A5) converges in probability to 0 as N → ∞ and the second term converges in probability to a number less than or equal to max(|G−1_{(1 −}

πc + a)|, |G−1(1 − πc− a)|) for any number a > 0, for this, note that n0 = dN → ∞ as

N → ∞ since d > 0. This shows that the right-hand side, and hence the left hand side, of

(A5) converges in probability to 0 as N → ∞. Similarly,

|µu(N) − ˜µu(N)| ≤ |s| max

¡

|Z(dn0πc+n0s+1e)|, |Z(bn0πc−n0s−1c)|

¢

→ 0 in probability.

Thus, we conclude that µl(N) → µ∞l in probability and µu(N) → µ∞u in probability. By

assumption (13) that the distributions of compliers and never-takers overlap, we have that

µ∞

l < µn and µ∞u > µn. Combining the facts that µ∞l < µn < µ∞u , µl(N) → µ∞l in

probability and µu(N) → µ∞u in probability with the fact that ˆµn → µn in probability, by

the law of large numbers, because N − n0 = (1 − d)N → ∞ as N → ∞, we conclude that

pr{µl(N) < ˆµn < µu(N)} → 1 as N → ∞. Thus, pr(CACEˆ A =CACEˆ S) → 1 as N → ∞.

REFERENCES

ABADIE, A. (2003). Semiparametric instrumental variable estimation of treatment re-sponse models. J. Economet. 113, 231-63.

ANGRIST, J.D., IMBENS, G.W. & RUBIN, D.B. (1996). Identification of causal effects using instrumental variables. J. Am. Statist. Assoc. 91, 444-55.

ANGRIST, J.D. & KRUEGER, A.B. (2001). Instrumental variables and the search for identification. J. Econ. Persp. 15, 1-17.

BOYD, S. & VANDENBERGHE, L. (2004). Convex Optimization. Cambridge: Cambridge University Press.

BRUCE, M., TEN HAVE, T., REYNOLDS, C., KATZ, I., SCHULBERG, H., MULSANT, B., BROWN, G., MCAVAY, G., PEARSON, J. & ALEXOPOULOS, G. (2004).

(26)

ing suicidal ideation and depressive symptoms in depressed older primary care patients: a randomized controlled trial. J. Am. Med. Assoc. 291, 1081-91.

CHENG, J. & SMALL, D. (2006). Bounds on causal effects in three-arm trials with non-compliance. J. R. Statist. Soc. B 68, 815-36.

DEMPSTER, A.P., LAIRD, N.M. & RUBIN, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with Discussion). J. R. Statist. Soc. B 39, 1-38.

GRENANDER, U. (1981). Abstract Inference. New York: Wiley.

HALL, P. and TITTERINGTON, D.M. (1984). Efficient nonparametric estimation of mix-ture proportions. J. R. Statist. Soc. B 46, 465-73.

HOLLAND, P.W. (1986). Statistics and causal inference. J. Am. Statist. Assoc. 81, 945-60.

IMBENS, G.W. & ANGRIST, J.D. (1994). Identification and estimation of local average treatment effects. Econometrica 62, 467-76.

IMBENS, G.W. & RUBIN, D.B. (1997a). Bayesian inference for causal effects in random-ized experiments with noncompliance. Ann. Statist. 25, 305-27.

IMBENS, G.W. & RUBIN, D.B. (1997b). Estimating outcome distributions for compliers in instrumental variables models. Rev. Econ. Stud. 64, 555-74.

LANCASTER, T. & IMBENS, G.W. (1996). Case control studies with contaminated con-trols. J. of Economet. 71, 145-60.

NETTLETON, D. (1999). Convergence properties of the EM algorithm in constrained parameter spaces. Can. J. Statist. 27, 639-48.

OWEN, A. (1988). Empirical likelihood ratio confidence intervals for a single functional.

Biometrika 75, 237-49.

OWEN, A.B. (2001). Empirical Likelihood. Boca Raton, FL: Chapman & Hall/CRC. QIN, J. (1999). Empirical likelihood based confidence intervals for mixture proportions.

Ann. Statist. 27, 1368-84.

QIN, J. & LAWLESS, J. (1994). Empirical likelihood and general estimating equations.

(27)

RUBIN, D.B. (1974). Estimating causal effects of treatments in randomized and nonran-domized studies. J. Educ. Psychol. 66, 688-701.

RUBIN, D.B. (1980). Comment on a paper by D. Basu. J. Am. Statist. Assoc. 75, 591-3. SHAO, J. (2003). Mathematical Statistics, 2nd ed. New York: Springer.

SHEINER, L.B. & RUBIN, D.B. (1995). Intention-to-treat analysis and the goals of clinical trials. Clin. Pharmacol. & Therap. 57, 6-15.

SHEN, X. & WONG, W.H. (1994). Convergence rate of sieve estimates. Ann. Statist. 22, 580-615.

SHEN, X., SHI, J. & WONG, W.H. (1999). Random sieve likelihood and general regression models. J. Am. Statist. Assoc. 94, 835-46.

SMALL, D.S., TEN HAVE, T.R., JOFFE, M.M. & CHENG, J. (2006). Random effects logistic models for analyzing efficacy of a longitudinal randomized treatment with non-adherence. Statist. Med. 25, 1981-2007.

SMALL, D.S., TEN HAVE, T.R. & ROSENBAUM, P.R. (2008). Randomization infer-ence in a group-randomized trial of treatments for depression: covariate adjustment, noncompliance and quantile effects. J. Am. Statist. Assoc. 103, 271-9.

SOMMER, A. & ZEGER, S.L. (1991). On estimating efficacy from clinical trials. Statist.

Med. 10, 45-52.

TANNER, M. & WONG, W. (1987). The calculation of posterior distributions by data augmentation (with Discussion). J. Am. Statist. Assoc. 82, 528-50.

WU, C.F.J. (1983). On the convergence properties of the EM algorithm. Ann. Statist. 11, 95-103.

ZELEN, M. (1979). A new design for randomized clinical trials. N. Engl. J. Med. 300, 1242-5.

(28)

Table 1: The relationship between observed groups and latent compliance classes Ri Ai Ci 1 1 1 (Complier) or 2 (Always-taker) 1 0 0 (Never-taker) or 3 (Defier) 0 0 0 (Never-taker) or 1 (Complier) 0 1 2 (Always-taker) or 3 (Defier)

Table 2: The relationship between observed groups and latent compliance classes in single consent design trials

Ri Ai Ci

1 1 1 (Complier)

1 0 0 (Never-taker)

0 0 1 (Complier) or 0 (Never-taker)

Table 3: The relationship between observed groups and latent compliance classes under Assumptions 1 − 6 Ri Ai Ci 1 1 1 (Complier) or 2 (Always-taker) 1 0 0 (Never-taker) 0 0 1 (Complier) or 0 (Never-taker) 0 1 2 (Always-taker)

(29)

Table 4: Estimates of theCACEwith true value 1 in single-consent treatment trials

Distn. N Bias Mean squared error

Std. IV AMELE Parametric Std. IV AMELE Parametric

N1 ₁₀₀ _0.0178 _−0.1141 _−0.0240 _0.3482 _0.2003 _0.1649 500 0.0202 −0.0016 −0.0054 0.0679 0.0515 0.0294 N2 ₁₀₀ _0.0150 _0.0105 _−0.0053 _0.1682 _0.1604 _0.1311 500 0.0019 0.0020 −0.0062 0.0214 0.0211 0.0186 G1 ₁₀₀ _0.0429 _−0.0981 _0.2851 _0.3697 _0.1945 _0.2424 500 −0.0060 −0.0212 0.3963 0.0637 0.0529 0.1907 G2 ₁₀₀ _0.0088 _−0.0048 _0.3390 _0.1957 _0.1726 _0.2311 500 0.0235 0.0232 0.3765 0.0454 0.0450 0.1561 LN1 ₁₀₀ _0.0173 _−0.1364 _0.2299 _0.2277 _0.1008 _0.1897 500 0.0177 −0.0266 0.3666 0.0411 0.0235 0.1568 LN2 _{100 −0.0007 −0.0137} _0.2813 _0.0670 _0.0563 _0.1593 500 0.0126 0.0129 0.2627 0.0120 0.0117 0.0814

Distn., distributions; Std. IV, standard instrumental variable estimate; AMELE, approximate maximum empirical likelihood estimate

Table 5: Results from the depression study

Hamilton score Composite anti-depression score Estimator estimate (95% CI) estimate (95% CI)

Std. IV −2.55(−4.13, −0.97) 1.86(0.76, 3.14)

AMELE −2.54(−4.12, −0.97) 1.60(0.73, 2.40) Parametric −2.82(−4.39, −1.16) 1.41(−0.66, 2.47)

CI, confidence interval; standard IV, standard instrumental variable estimate; AMELE, approximate maximum empirical likelihood estimate

(30)

−0.6 −0.4 −0.2 0.0 0.2 −10 −9 −8 −7 −6 −5

Complier average causal effect

Log−likelihood

Std IV

MLE

Figure 1: Profile log-likelihood for the maximum likelihood estimator and standard instru-mental variable estimator of the complier average causal effect for the sample described in

(31)

(a) Hamilton score Density 0 10 20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 (b) Hamilton score Density 0 10 20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 (c) Hamilton score Density 0 10 20 30 40 50 0.00 0.02 0.04 0.06 0.08 0.10 (d)

Composite anti−depression score

Density 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 (e)

Density 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 (f)

Density 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0

Figure 2: Depression study. Histograms of (a)-(c) the Hamilton score and (d)-(f) the com-posite anti-depression score for (a),(d) the R = 1, A = 1 group; (b), (e) the R = 1, A = 0 group; (c), (f) the R = 0 group

Efficient Nonparametric Estimation of Causal Effects in Randomized Trials with Noncompliance