CHAPTER 3 : NONPARAMETRIC ESTIMATION FOR TIME-VARYING MISSING COVARIATES IN
3.2 Proposed Nonparametric Maximum Estimated Likelihood
Let Y be a continuous outcome with n repeated measures and let X and Z be time-varying or time-independent variables. Define X andZ as matrices withn rows and qX and qZ columns,
respectively. We assume thatZ is measured for all subjects butX is only available for a random subsample.
Those subjects for whomX is measured make up the validation setV. Subjects who are missing X make up the nonvalidation setV¯. Here we note that for all subjectsXis either fully observed or not observed at all. This means that for subjects in the validation set, each part ofX is observed if qX>1and measured at each observation time ifX is time-varying.
In addition, we assume Z can be decomposed into (Z∗, A), where Z∗ is the components of Z that are independent of the missing covariate andA is an auxiliary variable that contains some information aboutX. Thus, the observed data consists of (Yi,Xi,Zi∗,Ai) fori ∈V and (Yj,Zj∗,Aj)
forj∈V¯.
Now that we have definedA, we can explain what we mean by X is available for a random sub- sample. We assume that the missing mechanism is independent of the auxiliary variable, but not necessarily independent of the other covariates. So this assumption is less restrictive than the MCAR assumption.
We define a linear mixed-effects model forYias
Yi=XiβX+Zi∗βZ∗+γibi+i, (3.1)
ni×1vector of random errors, andniis the number of observations for subjecti. In addition, we
assume as usual thati’s are independent and follow anni-variate normal distribution with mean
0 and varianceσ2Λi(ν)whereν defines the parameters ofΛi,bi’s are iid, independent ofi, and
follow a qγ-variate normal distribution with mean 0 and variance D. Pβ(Yi|Xi, Zi) then follows a
multivariate normal distribution with meanµi=XiβX+Zi∗βZ∗and varianceΣi=γiDγiT+σ2Λi(ν).
Note thatAiis not used in the model to prevent problems due to the collinearity inAandX.
The full likelihood for the data in the validation and nonvalidation sets can be expressed as in Pepe and Fleming (1991) as L=Y i∈V Pβ(Yi|Xi, Zi) Y j∈V¯ Pβ(Yj|Zj). (3.2)
If the distribution ofP(X|A)were known, thenPβ(Y|Z)could be calculated as
R
Pβ(Y|x, Z)P(X|A)dx. However, P(X|A) is not known and even if it were the calculation of
Pβ(Y|Z)would likely require some form of numerical integration. Instead, following Pepe and Flem-
ing (1991) and Carroll and Wand (1991), we obtain unbiased, nonparametric estimates ofPβ(Y|Z)
using empirical estimates ofP(X|A)based on the random subsample that makes up the validation set. SinceP(X|A) = PP(X,A(A)), we need estimates forP(A). The empirical estimate for the distribu- tion of discreteAisfˆA(aj) = n1v
P
i∈V I(Ai =Aj), wherenv is the size of the validation set. For
continuousA, kernel density estimates are used so thatfˆA(aj) = n1vh
P
i∈V Φ( ai−aj
h ), whereΦis
a symmetric density function andhis the bandwidth. Using these empirical estimates ofP(A), we can obtain unbiased estimates ofP(Yj|Zj)as defined below.
For brevity of notation, letwD = n1v andwC = n1vh. Then, ifX is time-independent, an unbiased estimate ofP(Yj|Zj)for subjectjfrom the nonvalidation set can be written
ˆ P(Yj|Zj) = wkPi∈V P(Yj|Xi, Zj)Kk(A) wkPi∈V Kk(A) = P i∈V P(Yj|Xi, Zj)Kk(A) P i∈V Kk(A) , (3.3)
wherek=D, Cfor discrete or continuousA, respectively, andKD(A) =I(Ai=Aj)andKC(A) =
ΦAi−Aj
h
. Note thatfˆa(Aj) =wkPi∈V Kk(A).
For time-varying covariates, we introduce the following notation. LetMi be anni×qmatrix repre-
we assume is discrete. ThenMi[tj]is annj×qmatrix with the rows ofMithat correspond to the po-
sitions where the elements oftjare equal to the elements ofti. For example, ifXi= (1.2,1.5,1.3)0,
ti = (0,1,2)0, and tj = (0,2)0, then Xi[tj] = (1.2,1.3)0. It is necessary to recognize that Mi[tj]
will havenj rows only iftj is at least a subset ofti. In other words, a subjectifrom the validation
set can only contribute to the estimation of Pˆ(Yj|Zj)for a subjectj from the nonvalidation set if
tj ⊆ti. We incorporate this condition intoKk(A, t)which is the time-dependent version ofKk(A)
from Eq. 3.3.
Now we can define the estimatePˆ(Yj|Zj)for a time-varyingX as
ˆ P(Yj|Zj) = P i∈V P(Yj|Xi[tj], Zj)Kk(A, t) P i∈V Kk(A, t) . (3.4)
IfAis time-independent, thenKD(A, t) =I(Ai=Aj, tj⊆ti)andKC(A, t) = Φ
A i−Aj h I(tj ⊆ti). If Ais time-varying,KD(A, t) = I(Ai[tj] = Aj, tj ⊆ti)andKC(A, t) = Φ Ai[tj]−A j h I(tj ⊆ ti), whereΦAi[tj]−Aj h = ΦAi[tj1]−Aj[tj1] h ×ΦAi[tj2]−Aj[tj2] h · · · ×ΦAi[tjnj]−hAj[tjnj]. Then the estimated likelihood can be written as
ˆ L=Y i∈V Pβ(Yi|Xi, Zi) Y j∈V¯ ˆ Pβ(Yj|Zj). (3.5)
We maximize this estimated likelihood using a pseudo Newton-Raphson algorithm and show that doing so yields consistent and asymptotically normal estimates for the unknown parameters.
3.2.1. Practical Considerations for Continuous Auxiliary Variables
Use of the kernel density estimator for continuous auxiliary variables introduces two important fac- tors for consideration. First is the choice of bandwidth. Similar to Carroll and Wand (1991), we also use anad hoc method to select the bandwidth based on the validation data. Specifically, we calculate the bandwidth based on the validation set auxiliary variable using the method of Sheather and Jones (1991), which is implemented asbw.SJinR(R Core Team, 2018).
A second consideration for continuous auxiliary variables is how to handle nonvalidation data at or beyond the edge of the validation data. Consider the denominator ofPˆ(Yj|Zj)in Eq. 3.3 and
is outside or near the edge of the range ofA in the validation set, then ΦAi−Aj
h
will be small for all i ∈ V and Pi∈V KC(A)or Pi∈VKC(A, t) will be close to zero. This can introduce bias
and numerical instability to the estimate. Therefore, it is necessary to restrict the nonvalidation set to those subjects whose auxiliary values are interior to the auxiliary values in the validation set. How the ‘interior’ nonvalidation set is defined results in the common trade-off between bias and variance. More restrictive thresholds on the nonvalidation auxiliary variable result in smaller bias but reduce the size of the ‘interior’ nonvalidation set, thereby increasing the variance. For our simulations in Section 3.4, we use the second and third quartiles of the validation set auxiliary values as thresholds for inclusion of subjects in the ‘interior’ nonvalidation set.