CHAPTER 5: EFFICIENT SECONDARY ANALYSIS OF
5.3 Numerical Studies
5.3.1 Derivation of an IPW estimator for Secondary Analysis
In this subsection, we derive an inverse probability weighting (IPW) estimator as a competing estimator for ξPˆ . Although the general idea of probability weighted estimator is long-standing (Horvitz and Thompson, 1952), we develop out this idea explicitly in the context of secondary analysis in two-phase studies. Later in the simulation studies, we will compare our proposed estimator ξˆP to the IPW estimator.
The IPW estimator βˆIP W is a weighted least squares estimator, where the weight is the inverse of the selection probability into the validation sample. That is, we aim to minimize
X i∈V 1 πi (Yi−β0 −β1Xi)2 = N X i=1 Ri πi (Yi−β0−β1Xi)2,
where πi is the selection probability for the ith subject.
In case-cohort design, πi = 1for cases, as all cases are included into the validation sample. For non-cases, πi = n0/N, the probability of being selected into the SRS portion of the
validation sample.
In two-phase SODS design, we use the observed selection probability for πi. That is, for each failed subject, πi is taken as the observed probability of being sampled within its respective stratum Ak. For non-case subjects, πi =n0/N.
5.3.2 Simulation Studies
We carried out simulation studies to compare our proposed estimatorξˆP to other competing estimators. First, we evaluated the finite sample performance of our proposed estimator when the underlying assumption is met. Then, we investigated the performance under scenarios where models are misspecified, for instance, if the error term in linear model (5.2) is not normally distributed.
Three competing estimators are compared: (i) ξˆSRS, fitting a linear regression of Y given X using SRS portion of the data. (ii) ξˆIP W, an IPW estimator described in subsection 5.3.1.
(iii) ξˆP, our proposed restricted maximum likelihood estimator.
Our proposed estimator works for general two-phase studies with a time to event outcome. However, in this section, we focus on secondary analysis of data from two-phase SODS design. Let full cohort size be N = 2500, the data were generated according to the following models:
Y =β0+β1X1 +β2X2+,
where X1 ∼N(0,1), X2 ∼Bernoulli(0.6), and ∼N(0,1). We set β0 = 1, β2 =−0.5 and
vary β1 to be either 0 or 0.5. The event time T˜ follows an exponential distribution with
hazard function:
λ(t) =λ0(t) exp{γ1X1+γ2X2+γ3Y},
with λ0(t) = 1, γ1 = log(1.2), γ2 =−0.5, γ3 = 0.5. The censoring time C follows a uniform
distribution on the interval (0, c). The censoring percentages are approximately 60% and
80% with cbeing 1and 0.4, respectively.
The two-phase SODS design is often implemented as follows: first, a simple random sample of size n0 is selected. Then all cases are partitioned into three strata, separated
by 0.3 and 0.7 quantiles of the failure times in the cases. Supplemental samples of size n1 and n3 are selected from the lower stratum and the higher stratum, respectively. The
validation sample has size nV = n0 +n1 +n3. We consider the following two settings: (i) (n0, n1, n3) = (400,100,100). (ii) (n0, n1, n3) = (500,50,50).
Based on 1000 replications, the simulation results are presented in Table 5.1. We report the mean of the parameter estimates, empirical standard deviation of parameter estimates, mean of the estimated standard deviation, and 95% confidence interval coverage.
From Table 5.1, we have the following conclusions: (i) All three estimators are virtually unbiased. The largest bias for our proposed estimator is about 5% of the true effect size. (ii) Regardless of estimating β1 orβ2, our proposed estimator ξˆP is the most efficient in all
Table 5.1: Simulation results, secondary analysis for two-phase SODS design. The full cohort size is N = 2500. The error term is normally distributed.
Trueβ1 Failure Method Estimated quantities for β1 Estimated quantities for β2
rate Mean SD dSD CI Mean SD dSD CI
(n0, n1, n3) = (400,100,100) 0.0 20% ξˆSRS −0.001 0.050 0.050 0.954 −0.499 0.101 0.102 0.951 ˆ ξIP W −0.001 0.047 0.047 0.950 −0.499 0.094 0.096 0.946 ˆ ξP −0.004 0.040 0.040 0.945 −0.477 0.078 0.076 0.946 40% ξˆSRS −0.003 0.052 0.050 0.938 −0.500 0.105 0.102 0.936 ˆ ξIP W −0.003 0.047 0.045 0.934 −0.499 0.093 0.092 0.948 ˆ ξP −0.005 0.042 0.040 0.938 −0.485 0.078 0.075 0.939 0.5 20% ξˆSRS 0.499 0.051 0.050 0.943 −0.502 0.106 0.103 0.942 ˆ ξIP W 0.499 0.047 0.047 0.943 −0.502 0.099 0.096 0.939 ˆ ξP 0.480 0.036 0.035 0.915 −0.467 0.082 0.079 0.925 40% ξˆSRS 0.502 0.050 0.050 0.957 −0.501 0.107 0.102 0.931 ˆ ξIP W 0.502 0.045 0.045 0.954 −0.500 0.096 0.092 0.936 ˆ ξP 0.489 0.035 0.034 0.932 −0.482 0.081 0.079 0.941 (n0, n1, n3) = (500,50,50) 0.0 20% ξSRSˆ 0.000 0.044 0.045 0.959 −0.497 0.089 0.091 0.960 ˆ ξIP W 0.000 0.042 0.043 0.961 −0.498 0.085 0.087 0.958 ˆ ξP 0.000 0.040 0.040 0.960 −0.495 0.077 0.076 0.946 40% ξSRSˆ 0.000 0.044 0.045 0.947 −0.501 0.094 0.092 0.944 ˆ ξIP W 0.000 0.042 0.042 0.948 −0.502 0.088 0.086 0.939 ˆ ξP −0.001 0.040 0.040 0.945 −0.498 0.081 0.075 0.923 0.5 20% ξˆSRS 0.501 0.045 0.045 0.955 −0.505 0.093 0.091 0.950 ˆ ξIP W 0.501 0.043 0.043 0.950 −0.504 0.088 0.087 0.957 ˆ ξP 0.494 0.035 0.034 0.947 −0.494 0.081 0.079 0.943 40% ξˆSRS 0.500 0.046 0.045 0.944 −0.496 0.090 0.091 0.957 ˆ ξIP W 0.501 0.044 0.042 0.943 −0.497 0.085 0.085 0.948 ˆ ξP 0.497 0.037 0.034 0.935 −0.493 0.078 0.079 0.948 The results are based on models Y = β0+β1X1 +β2X2+ and λ(t) =λ0(t) exp{γ1X1 +
γ2X2+γ3Y}where∼N(0,1), X1 ∼N(0,1), X2 ∼Bernoulli(0.6); the true parameters are
β0 = 1, β2 = −0.5, γ1 = log(1.2), γ2 = −0.5, γ3 = 0.5, λ0(t) = 1. ξˆSRS,ξˆIP W,ξˆP are defined in Section 5.3.2. SD, standard deviation; CI, confidence interval.
20% (80% censoring), the empirical standard deviation estimatingβ1 is 0.035 forξˆP, which is smaller than 0.043 for ξˆIP W and 0.045for ξˆSRS. The efficiency gain comes from the fact that our proposed estimator takes a likelihood approach which incorporated the available information (T,∆, Y) in the full cohort. (iii) For all estimators, averages of the estimated standard deviation is quite close to the empirical standard deviation (i.e., SDd is close to SD). (iv) The95% confidence interval coverage is approximately 0.95, which means that the asymptotic normal approximation works well for these settings. (v) When SRS sample size is larger (i.e. n0 = 400 vs n0 = 500), the empirical bias for our proposed estimator ξPˆ is
decreasing. The reason is that we used the SRS sample to estimate the nuisance function
Λ0(t). The smaller the SRS sample, the more likely to introduce some bias during this process.
(vi) The efficiency gain of ξˆP over ξˆSRS is smaller when SRS takes up a larger proportion of the validation sample (i.e., n0/nV is larger).
We further assess the performance of our proposed estimator under different allocations of SRS sample and supplemental samples. The simulation set up is similar to Table 5.1. Now we fix β1 = 0.5, failure rate being 40%, and change the size of the simple random sample.
Figure 5.1 shows the sample relative efficiency of ξˆP over ξˆSRS, and ξˆP over ξˆIP W in terms of estimating β1. The sample relative efficiency is defined as SREP:SRS =var( ˆξSRS)/var( ˆξP). The plot confirms that the efficiency gain of ξˆP over ξˆSRS and ξˆIP W is smaller when the SRS sample takes up a large portion of the validation sample. When the SRS proportion is larger than 90%, the efficiency gain is fairly small. This indicates that when n0/nV is very large, we can simply use naive SRS analysis or other IPW type estimators without losing too much statistical efficiency.
Finally, we investigate the scenarios where the error term in (5.2) is not normally distributed. The simulation set up is the same as before, except for the error term. We assume that follows a gamma distribution with shape parameter 2 or5, rate parameter 1, and then normalized to have mean 0and variance 1. The error term is right skewed.
0.2 0.4 0.6 0.8 0 2 4 6 8 10 12
proportion of SRS sample in the validation sample
SRE
SRE
P:SRSSRE
P:IPWFigure 5.1: SREs comparingξˆP to ξˆSRS andξˆIP W in terms of estimating β1, under various
β1 = 0, ξˆP would perform quite well in terms of estimating β1. ξˆP is the most efficient and CI coverage rate is close to 0.95. (ii) When true β1 6= 0, ξˆP would have larger bias while ξˆSRS andξˆIP W performs relatively well. This comes from the fact that our proposed estimator is more reliant on the normality assumption. On the other hand, least squares based approach (SRS and IPW) do not explicitly utilize the normality assumption of the error term. (iii) The empirical bias for ξˆP is smaller when the failure rate is larger. The smaller bias results in improved coverage rate of the 95%confidence interval. (iv) When the shape parameter is changed from 2to 5, the empirical bias forξPˆ is smaller. This is what we expected, as normalized gamma distribution with shape parameter 5 is more close to the normal distribution.
In practice, we could first use the SRS portion of the data to fit a linear regression of Y givenX to check the normality assumption. If evidence suggests that normality assumption is violated, we need to employ some variable transformation techniques before using our proposed method.