Locally efficient estimation in generalized par-

Locally efficient estimation in generalized

partially linear model with measurement error

in nonlinear function

3.1 Introduction

Generalized partially linear models have been widely used in statistics. Such models enrich the more classic generalized linear models by allowing a covariate to enter the link function through a nonparametric form. This is useful when the dependence of the response to some covariates, even after transformation through a suitable link function, is still not linear and difficult to specify. At the same time, the model also allows the more classic generalized linear dependence on some other covariates. Many works exist in the literature for estimation and inference for generalized partially linear models, see, for example Carroll et al. (1995), Liang et al. (2009), Yu & Ruppert (2012).

When one of the covariates involved in the generalized partially linear model can- not be measured precisely, the problem becomes much more difficult. In fact, most of the works in handling measurement error issues in the generalized partially linear model considered only the case that measurement error occurs to a covariate involved in the linear component (Ma & Carroll 2006, Liu et al. 2017, Liang & Ren 2005, Liu 2007, Liang & Thurston 2008). When the model degenerates to simply the general-

ized linear model, even more literatures exist to handle the measurement error issues (Stefanski & Carroll 1985, 1987, Huang & Wang 2001, Ma & Tsiatis 2006, Carroll & Crainiceanu 2006, Buonaccorsi 2010, Xu et al. 2015). When handled properly, it can be shown that the parameters can be estimated at the root-nconvergence rate despite of the presence of the measurement error and the possible presence of the nonparametric function in the model. However, it is a different story when the covariate inside the nonparametric function itself is measured with error. We conjecture that this is because as soon as the covariate inside an unknown function is subject to error, the problem falls into the general framework of nonparametric measurement error models and the standard practice for estimation and inference is through deconvolution. Deconvolution method is widely used in handling latent components and has been used to show that nonparametric regression with errors in covariates can have very slow convergence rate. Possibly due to these inherent difficulties generalized partially linear models with errors in the covariate inside the nonparametric function has not been studied systematically.

We tackle this difficult problem where the error occurs to the covariate inside the nonparametric component of the generalized partially linear model through a novel approach that avoids the deconvolution treatment completely. Two key ideas lead to our success in this endeavor. The first is the idea of using B-splines expansion to approximate the nonparametric function of the latent covariate. The B-spline nature allows us to write out the approximation form without having to perform the estimation simultaneously. This is different from nonparametric estimation via kernel method, where the approximation and estimation is integrated and inseparable. The second idea is the recognition that after the B-spline approximation, the error-free model is effectively a parametric model, or at least a parametric model in terms of operation, hence the only nonparametric component in the measurement error model is the distribution of the latent covariate. This implies that the semiparametric ap-

proach in Tsiatis & Ma (2004) can be adopted here to help establishing the estimation procedure. The encouraging discovery is that we not only can bypass the difficulties caused by nonparametric function of a covariate measured with error in terms of estimation, we also prove that the procedure can retain the root-n convergence rate of the parameter estimation in the original model.

The structure of this chapter is as follows. We describe the model and the estimation methodology in Section 3.2, following with establishing the large sample properties of the parameter estimation in Section 3.3. Two simulation studies are conducted in Section 3.4 and we analyzed the AIDS Clinical Trials Group (ACTG) study in Section 3.5. We finish the chapter with some discussions in Section 3.6. All the technical details and proofs are provided in an Appendix.

3.2 Main results

3.2.1 The model

The generalized partially linear model we study is

fY|X,Z(y, x,z,α,β, g) = f{y,zTβ+g(x),α}, (3.1)

where f is a known link function up to the unknown parameters α,β and unknown function g(·). For example, f(·) can be the inverse logit link function f(·) = 1−

1/{exp(·) + 1}. The response variable Y is an observable variable and X,Z are covariates. Here Z is observable, while X is a random variable measured with error, thus it is not directly observable. Instead of observing X, we observe W, where

W =X+U, (3.2)

andU is a normal random error independent ofX,Zwith mean zero and varianceσ2

U. For ease the presentation of the main methodology, we assume σ2

U is known. When

σ2

U first and then plug in. The observed data are (Wi,Zi, Yi), i= 1, . . . , n, which are independent and identity distributed (iid). Our goal is to estimate α, β and g(·) hence to understand the dependence ofY on the covariates (X,Z).

3.2.2 Efficient Score Derivation

For preparation, we first approximateg(x) with a B-spline representation, i.e. g(x)≈

B(x)T_γ_{. Under this approximation, model (3.1) becomes}

fY|X,Z(y, x,z,α,β, g)≈fY|X,Z(y, x,z,θ)≡f{y,zTβ+B(x)Tγ,α},

which is a complete parametric model with unknown parameters θ≡(αT_,_βT_,_γT₎T_.

This model falls in the general framework of Tsiatis & Ma (2004) hence the estimation procedure there can be adopted here. Specifically, the joint distribution of the observed variables conditional on Z is

fW,Y|Z(y, w,z,θ) = Z

f{y,zTβ+B(x)Tγ,α}fW|X(w, x)fX|Z(x,z)dµ(x).

with the condition distribution function fX|Z(x,z) being a nuisance parameter. The nuisance tangent space Λ and its orthogonal complement Λ⊥ can be written as

Λ = [E{a(X,Z)|Y, W,Z}:E{a(X,Z)|Z}= 0],

Λ⊥ = [h(Y, W,Z) :E{h(Y, W,Z)|X,Z}=0 almost everywhere].

The efficient score forθis the residual of its score vectorSθ(y, w,z) after projecting

it on to the nuisance tangent space Λ, denoted by

Sres(y, w,z,θ)≡Sθ(y, w,z,θ)−Π{Sθ(Y, W,Z,θ)|Λ},

where

Sθ(y, w,z,θ)≡

∂logfW,Y|Z(y, w,z,θ)

∂θ .

Here “res” stands for residual. The detailed form of Sres(y, w,z,θ) is given as

where a(X,Z,θ) satisfies

E{Sθ(Y, W,Z,θ)|X,Z}=E[E{a(X,Z,θ)|Y, W,Z} |X,Z]. (3.4)

Now, noting that the above derivation is obtained from the approximate model (3.3), we hence perform some further analysis. Separating the components corresponding to

α,βandγinθ, we can write theSθ(y, w, z,θ)≡ {Sα,β(y, w, z,θ)T,Sγ(y, w, z,θ)T}T,

which leads to the corresponding relation S_r_es(y, w,z,θ)≡ {S_r_es₁(y, w,z,θ)T,

Sres2(y, w,z,θ)T}T. The estimating equation of the approximate model can be writ-

ten as n X i=1 Sres(Yi, Wi,Zi,θ)≡ n X i=1 {Sres1(Yi, Wi,Zi,θ)T,Sres2(Yi, Wi,Zi,θ)T}T=0. (3.5) Remember that our original model contains an unknown function g(z). Thus, for the estimation ofα,β, it is beneficial to treat g as a nuisance parameter as well first, and estimate α,β using profiling. We then plug in the estimated values of α and

β and estimate g via the B-spline approximation. Of course in addition to g, the distribution of the unobservable covariate conditional on the observable covariate Z

is also a nuisance component and still has to be taken into account.

Letδ ≡(αT_,_βT₎T _{be a} _p_{-dimensional parameter. We propose to solve for} _γ_from

i=1Sres2(Yi, Wi,Zi,θ) = 0 to obtain γb(δ) first. Now from

fW,Y|Z(w,z, y,δ, g, fX) = Z

f{y,zTβ+g(x),α}fW|X(w, x)fX|Z(x,z)dµ(x).

we can construct the nuisance tangent space as Λ = ΛfX + Λg, where

ΛfX = [E{a(X,Z)|Y, W,Z}:E{a(X,Z)|Z}=0] Λg =

Ehs{Y,ZTβ+g(X),α}b(X)|Y, W,Zi:∀b(X),

each other. We can further verify that

Λ⊥_fX = [h(Y, W,Z) :E{h(Y, W,Z)|X,Z}=0 almost everywhere],

Λ⊥_g = h(Y, W,Z) :Ehh(Y, W,Z)s{Y,ZTβ+g(X),α} |X,Zi=0

almost everywhere).

The efficient score for δ is now the residual of the score vector Sδ after projecting it

on to the nuisance tangent space Λ, denoted by

Sef f(Y, W,Z,δ, g) =Sδ(Y, W,Z,δ, g)−Π{Sδ(Y, W,Z,δ, g)|Λ}. (3.6)

Its explicit form is given as

Sef f(Y, W,Z,δ, g) = Sδ(Y, W,Z,δ, g)−E{a(X,Z)|Y, W,Z} −Ehs{Y,ZTβ+g(X),α}b(X)|Y, W,Zi,

where a(X,Z) and b(X) satisfy

E{Sδ(Y, W,Z,δ, g)|X,Z} = E[E{a(X,Z)|Y, W,Z} |X,Z] +E(E[s{Y,ZTβ+g(X),α}b(X)|Y, W,Z]|X,Z) and E[Sδ(Y, W,Z,δ, g)s{Y,ZTβ+g(X),α} |X,Z] = E[E{a(X,Z)|Y, W,Z}s{Y,ZTβ+g(X),α} |X,Z] +E(E[s{Y,ZTβ+g(X),α}b(X)|Y, W,Z] ×s{Y,ZTβ+g(X),α} |X,Z). (3.7)

We can then form the estimating quation Pn

i=1Sef f{Yi, Wi,Zi,δ,γb(δ)} =0 to solve

forbδas the estimator, wherea(X,Z),b(X) are the solutions to the integral equations

3.2.3 Estimation under working model

The above derivations are based on efficient score calculation and hence will yield the efficient estimator. However, a close look at the procedure reveals that the procedure is not practical because the implementation relies on the unknown function

fX|Z(x,z). Thus, our estimator needs to be calculated under a posited working model of f∗_X_|_Z(x,z). The procedure is described below, where we use ∗ to denote a quantity whose calculation is carried out using f∗_X_|_Z(x,z) instead of fX|Z(x,z).

1. Posit a working model f∗_X_|_Z(x,z). 2. Solving for γ from Pn

i=1Sres∗2(Yi, Wi,Zi,θ) =0 to obtain γb(δ).

3. Calculate the score functionS∗_δ(Y, W,Z,δ, g) under the working modelf∗_X_|_Z(x,z). 4. Solve the integral equation (3.7) to get a(X,Z) and b(X).

5. Calculate the approximate efficient score function S∗_e_{f f}(Y, W,Z,δ,gb) following

(3.6), wheregb(·) = B(·)

γ(δ). 6. Solve the estimating equation Pn

i=1S

∗

ef f(Yi, Wi,Zi,δ,gb) =0 to obtain δb.

When we calculate a(X,Z) at each observed z value and calculate b(X), we dis- cretize the distribution ofX onm equally spaced points on the support of fX|Z(x,z) and calculate the probability mass function πj(Z) at each of the m points. We of course normalize theπj(Z) in order to ensurePmj=1πj(Z) = 1. Note that using the de- scritization,f_X,Y,W∗ _|_Z(xj, y, w,z)≈f{y,zTβ+g(xj),α}fW|X=xj(w, xj)πj(Z).Further,

can be approximated by S∗_δ(Y, W,Z,δ, g) ≈ ∂log[ Pm i=1f{y,zTβ+g(xi),α}fW|X(w, xi)πi(Z)] ∂δ , E∗{a(X,Z)|Y, W,Z} ≈ Pm i=1a(xi,Z)fX,Y,W∗ |Z(xi, Y, W,Z) Pm i=1fX,Y,W∗ |Z(xi, Y, W,Z) , E∗[s{Y,ZTβ+g(X),α}b(X)|Y, W,Z] ≈ Pm i=1s{Y,Z T_β₊_g₍_x i),α}b(xi)fX,Y,W∗ |Z(xi, Y, W,Z) Pm i=1f ∗ X,Y,W|Z(xi, Y, W,Z) . LetA(X,Z)≡ {a(x1,Z),a(x2,Z), . . . ,a(xm,Z)}T and B(X)≡ {b(x1),b(x2), . . . , b(xm)}T. Let M1(X,Z)≡ {m1(x1,Z),m1(x2,Z), . . . ,m1(xm,Z)}T be am×pδ ma-

trix, wherepδ is the length ofδ andm1(xi,Z)≡E{S∗δ(Y, W,Z,δ, g)|xi,Z}. Further, let M2(X,Z)≡ {m2(x1,Z),m2(x2,Z), . . . ,m2(xm,Z)}T be am×pδ matrix, where

m2(xi,Z)≡ E h

S∗_δ(Y, W,Z,δ, g)s{Y,ZTβ+g(xi)} |xi,Z i

. Finally, let C(X,Z) be a

m×m matrix with the (i, j) block equal to

E    f_X,Y,W∗ _|_Z(xj, Y, W,Z) Pm i=1fX,Y,W∗ |Z(xi, Y, W,Z) |xi,Z    ,

letD(X,Z) be an m×m matrix with the (i, j) block equal to

E   s{Y,ZTβ+g(xj),α}fX,Y,W∗ |Z(xj, Y, W,Z) Pm i=1f ∗ X,Y,W|Z(xi, Y, W,Z) |xi,Z  ,

letF(X,Z) be an m×m matrix with the (i, j) block equal to

E   f_X,Y,W∗ _|_Z(xj, Y, W,Z)s{Y,ZTβ+g(xi)} Pm i=1fX,Y,W∗ |Z(xi, Y, W,Z) |xi,Z  ,

and letG(X,Z) be an m×m matrix with the (i, j) block





s{Y,ZTβ+g(xj),α}f_X,Y,W∗ |Z(xj, Y, W,Z)s{Y,ZTβ+g(xi)} Pm

i=1fX,Y,W∗ |Z(xi, Y, W,Z)

|xi,Z 

.

We can get a(xi,Z) and b(xi) by solving     C(X,Z) D(X,Z) F(X,Z) G(X,Z)         A(X,Z) B(X)     =     M1(X,Z) M2(X,Z)     .

3.3 Asymptotic properties

Let Sres2(Yi, Wi,Zi,α,β, g) be Sres2(Yi, Wi,Zi,α,β,γ) with all the appearance of

B(X)T_γ _{in it replaced by} _g₍_X_).

We first list the set of regularity conditions required for establishing the large sample properties of our estimator.

(C1) The true density fX(x) is bounded with compact support. Without loss of generality, we assume the support of fX(x) is [0,1].

(C2) The function g(x)∈Cq([0,1]), q >1, is bounded with compact support. (C3) The spline order r≥q.

(C4) Define the knotst−r+1 =· · ·=t0 = 0< t1 <· · ·< tN <1 = tN+1 =· · ·=tN+r, whereN is the number of interior knots that satisfiesN → ∞,N−1_n_(log_n₎−1 _→ ∞ and N n−1/(2q) _{→ ∞}_as _n _{→ ∞}_{. Denote the number of spline bases} _d

γ, i.e. dγ =N +r.

(C5) Let hj be the distance between the jth and (j −1)th interior knots. Lethb = max1≤j≤Nhj and hs = min1≤j≤N hj. There exists a constant ch ∈ (0,∞) such that hb/hs < ch. Hence, hb =Op(N−1) and hs=Op(N−1).

(C6) γ₀ is a dγ-dimensional spline coefficient vector such that supx∈[0,1]|B(x)Tγ0− g(x)|=Op(hqb).

(C7) The equation set

E{S∗_e_{f f}(Yi, Wi,Zi,δ,γ)} = 0,

E{Sres∗2(Yi, Wi,Zi,δ,γ)} = 0

has unique root forθ in the neighborhood ofθ0. Recall thatθ = (αT,βT,γT)T.

θ, with its singular values bounded and bounded away from 0. Let the unique root be θ∗. Note that θ0 and θ∗ are functions of N, that is, for any sufficiently

large N, there is a unique rootθ∗ in the neighborhood of θ0.

(C8) The maximum absolute row sum of the matrix∂S∗_e_{f f}(Yi, Wi,Zi,δ0,γ0)/∂γT0,

i.e. k∂S∗_e_{f f}(Yi, Wi,Zi,δ0,γ0)/∂γT0k∞, is integrable.

The conditions listed above are all standard bounded, smoothness conditions on functions and some classical conditions imposed on the spline order and number of knots. These are commonly used conditions in spline approximation and semiparametric regression literature. We now establish the consistency of bδ_n and γ_b_n as well

as the asymptotic distribution property ofδb_n.

Theorem 3. Assume Conditions(C1)−(C7) to hold. Let θb_n satisfy

1 n n X i=1 S∗_e_{f f}(Yi, Wi,Zi,δb_n,γ_b_n) = 0 1 n n X i=1 Sres∗2(Yi, Wi,Zi,δb_n, b γ_n) = 0. Then θb_n−θ₀ =o_p(1) element-wise.

The result in Theorem 3 is used to further establish the asymptotic properties of the estimator of the parameters of interestδb_nand estimator of the function of interest

B(·)T

γ_n.

Theorem 4. Assume Conditions(C1)−(C8) to hold and let

Q ≡E    ∂S∗_e_{f f}(Yi, Wi,Zi,δ0,γ) ∂δT₀     _B₍_·₎T_γ₌_g₍_·₎    . Then √ n(δb_n−δ₀) =−Q−1 1 √ n n X i=1 S∗_e_{f f}(Yi, Wi,Zi,δ0, g) +op(1).

Consequently, √n(δb_n−δ₀)→N(0,V) in distribution when n → ∞, where

Theorem 4 indicates thatδis estimated at the root-nrate. The proofs of Theorems 3 and 4 are given in the Appendix. Because the B-spline estimation of g(·) is at a slower rate than root-n, the estimation of δ does not have any impact on the first order asymptotic properties ofgb. Thus, for the analysis of the asymptotic properties

ofgb, we can treatδ as known. Then, the proof of Theorem 2 in Jiang & Ma (2017) can

be directly used. We skip the details of the proof and provide the specific convergence property of the estimation of g in Theorem 5.

Theorem 5. Assume Conditions(C1)−(C8) to hold and let

P≡E    ∂Sres∗2(Yi, Wi,Zi,δ0,γ) ∂γT     _B₍_·₎T_γ₌_g₍_·₎    . Then kγbn−γ0k2 =Op{(nhb)−1/2}. Further, b γ_n−γ₀ =−P−1n−1 n X i=1 Sres∗2(Yi, Wi,Zi,δ0,γ){1 +op(1)}.

This leads to that gb(x), which equals B(x)

T b γ_n, satisfies sup x∈[0,1] |gb(x)−g(x)|=Op{(nhb) −1/2_}. Specifically, bias{gb(x)}=E{gb(x)−g(x)}=O(h q−1/2 b ) and q nhb[gb(x)−g(x)−bias{gb(x)}] = qnhbB(x)T ( −P−1n−1 n X i=1 Sres∗2(Yi, Wi,Zi,δ0, g) ) +op(1). 3.4 Numerical Study

In our first simulation, we generated the observations (Wi,Zi, Yi) from the model

pr(Yi = 1|Xi =xi, Zi =zi) =H{g(xi) +β1z1i+β2z2i+β3z3i+β4z4i}, (3.8)

where W = X + U and U = normal(0,0.03). The true function is: g(x) =

−5 exp{−0.8(x−2.5)2_} _and _H₍_t_{) is the inverse logit link function. We set} _β

β2 = 0.5,β3 = 1 andβ4 =−0.3. The sample size is 1000 and we ran 1000 simulations. Xi is generated from a truncated normal distribution with mean 0.5 and variance 1/36 on [0,1] independently of Zi. We implemented our method using a normal working model, corresponding to a correct working model case. In order to investigate the performance of our method under a misspecified working model, we also performed another study, in which we haveXi generated from a truncated student-t distribution with degrees of freedom 5. Covariates Z1i, Z2i and Z4i are generated from the standard normal distribution. The covariateZ3i is generated from a uniform distribution on [−1,1]. In both studies, we estimated both the parameters β1, β2, β3, β4 and the

function g(x).

In the second simulation, we set the trueg function to beg(x) =−5exp(−0.2x2₎₊

5, while all other settings remain the same. Similarly to the first simulation, we compared the performance of a correct working model and a misspecified working model in terms of estimating both β1, β2,β3, β4 and g(x).

In both simulations 1 and 2, we discretized the distribution of X on [0,1] to

m = 15 equal segments and we use the truncated normal distribution discussed earlier as our working model. We used quadratic splines with 7 knots to estimate

g(x). The simulation results are shown in Tables D.1, D.2 and Figures D.1, D.2. The results in Tables D.1 and D.2 show little bias for the βestimation, regardless a correct working model or a misspecified working model is used. Figures D.1 and D.2 show that the estimators of g(x) have somewhat large bias on the boundary in both methods, which is within our expectation when factoring in the boundary effect. The performance of g(x) estimation is satisfactory in the interior of the function domain. The simulation results show no big difference between the performance of the correct working model offX(x) and a misspecified one, confirming our theory on consistency in both cases.

3.5 Data Analysis

The data set we analyzed is from an AIDS Clinical Trials Group (ACTG) study. The goal of this study is to compare four different treatments, ‘ZDV’, ‘ZDV+ddI’, ‘ZDV+ddC’ and ‘ddC’, on HIV infected adults whose CD4 cell counts were from 200 to 500 per cubic millimeter. We labelled those treatments as treatment 1, treatment 2, treatment 3 and treatment 4. We used treatment 1 as the base treatment because it is a standard treatment. There were 1036 patients enrolled in the study and they had no antiretroviral therapy at enrollment. The criteria that we used to compare the four treatments is whether a patient has his or her CD4 count drop below 50%, which is an important indicator for HIV infected patients to develop AIDS or die. We have Y = 1 if a patient has his or her CD4 count drop below 50%, and Y = 0 otherwise.

Our model has the form:

pr(Yi = 1|Xi =xi, Zi =zi) =H{g(xi) +β1z1i+β2z2i+β3z3i}, (3.9)

where W = X +U and U = normal(0, σ2

U). The covariates Z1, Z2, and Z3 are

dichotomous variables. Z1i = Z2i = Z3i = 0 indicates the ith individual receives

In document The Effect of Professional Development on Middle School Teachers' Technology Integration: An Action Research Study (Page 46-96)