Let there be ๐๐ subjects observed during time 0 to ๐๐. ๐๐๐๐๐๐ is the outcome measured for the ๐๐-th subject at time ๐ก๐ก๐๐๐๐ (0 โค ๐๐ โค ๐๐). ๐๐๐๐๐๐ is the covariate for the ith subject at time ๐ก๐ก
๐๐๐๐ (0 โค ๐๐ โค ๐๐). Because the outcome is measured less frequently than the covariate, for each subject, when k=j both outcome and covariate are measured, and when there are no matching j for k only covariate is measured. The parametric or nonparametric mixed model cannot be applied to this type of data
directly because of the inconsistent measurement. We introduce a three-step estimation method to explore this type of data, which enables the use of all data with no loss of information by discarding or modifying the extra covariate measurements.
3.2.1 Step one โ insert pseudo data
At time ๐ก๐ก๐๐๐๐ when only the covariate are measured, pseudo data points of the outcome will be inserted for every subject to create a new dataset. At times ๐ก๐ก๐๐๐ฃ๐ฃ and ๐ก๐ก๐๐๐๐ (0 โค ๐ฃ๐ฃ < ๐ข๐ข โค ๐๐)both the outcome and the covariate are measured, and between these two time points ๐ก๐ก๐๐๐๐ (๐ฃ๐ฃ < ๐๐ < ๐ข๐ข), only covariate are measured. We assume that the change of outcome from time ๐ก๐ก๐๐๐ฃ๐ฃ to ๐ก๐ก๐๐๐๐ is linear. By solving for ai and bi in function below we will get the straight line from ๐๐๐๐๐ฃ๐ฃ to ๐๐๐๐๐๐ for
subject ๐๐.
๏ฟฝ๐๐๐๐๐ฃ๐ฃ = ๐๐๐๐ ๐ก๐ก๐๐๐ฃ๐ฃ+ ๐๐๐๐
๐๐๐๐๐๐ = ๐๐๐๐ ๐ก๐ก๐๐๐๐+ ๐๐๐๐ (3.2.1a) ๐๐๐๐๐๐ = ๐๐๐๐๐ก๐ก๐๐๐๐ + ๐๐๐๐ (3.2.1b) Then by substituting ๐ก๐ก๐๐๐๐ (๐ฃ๐ฃ < ๐๐ < ๐ข๐ข) into the function above we get pseudo data ๐๐๐๐๐๐ that matches real measurements ๐๐๐๐๐๐ between times ๐ก๐ก๐๐๐ฃ๐ฃ and (๐ก๐ก๐๐๐๐0 โค ๐ฃ๐ฃ < ๐ข๐ข โค ๐๐) for subject ๐๐. This procedure is done repeatedly in the same way between all of the adjacent time points tij for each subject to get all of the pseudo data of Y at the time points when only the covariate is recorded. After this step, a new dataset (๐ก๐ก๐๐๐๐,๐๐๐๐๐๐, ๐๐๐๐๐๐) is created.
Missing data are common in longitudinal studies. If data are missing for subject ๐๐ at time ๐ก๐ก๐๐๐ฃ๐ฃ or ๐ก๐ก๐๐๐๐ (0 โค ๐ฃ๐ฃ < ๐ข๐ข โค ๐๐), there is no way to insert data for Y between these two time points using functions (3.4.1a) and (3.4.1b). Missing data will be inserted as pseudo data for Y in this situation.
3.2.2 Step two โ raw estimates
For subjects at each time ๐ก๐ก๐๐๐๐ (0 โค ๐๐ โค ๐๐), a standard linear model (3.2.2) is fitted to get the raw estimates and standard errors (๐ต๐ต๐๐๐๐) of ฮฒ
Y(๐ก๐ก๐๐) = (X(๐ก๐ก๐๐) ฮฒ(๐ก๐ก๐๐) + e(๐ก๐ก๐๐), (3.2.2) where e(๐ก๐ก๐๐) is the error term. Here, sample sizes of the local linear regressions will not be the same, due to missing data.
3.2.3 Step three โ smooth raw estimates
Local polynomial smoothing will be used to smooth the raw estimates from step two. This step is a modification of local polynomial kernel smoothing. Rather than using kernel functions to assign weights during smoothing, only the analytical weights that indicate the importance of the raw estimates will be used.
Let ๐๐ be the degree of the polynomial being fit. At time point ๐ก๐ก๐๐(0 โค ๐๐ โค ๐๐), the smoothed estimate ฮฒsmooth(k) are obtained by smoothing raw estimates in the local neighborhood
๐ผ๐ผโ (๐ก๐ก๐๐) = [๐ก๐ก๐๐โ โ, ๐ก๐ก๐๐+ โ]. By using only analytical weight ๐๐, the smoothed estimate ฮฒsmooth(k) at
๐ก๐ก๐๐ is the value of estimated ๐ฝ๐ฝฬ0, where ๐ท๐ท๏ฟฝ = (๐ฝ๐ฝฬ0, ยทยทยท,๐ฝ๐ฝฬ๐๐) minimizes
โ๐๐๐๐=0๏ฟฝ๐ฝ๐ฝ๐๐๐๐๐๐(๐๐)โ ๐ฝ๐ฝ0โ ๐ฝ๐ฝ1(๐ก๐ก๐๐โ ๐ก๐ก) โ ยทยทยท โ ๐ฝ๐ฝ๐๐(๐ก๐ก๐๐โ ๐ก๐ก)๐๐๏ฟฝ2๐๐. Weighted least squares theory leads to the solution
๐ฝ๐ฝฬ = (๐ก๐กฬ๐๐ ๐๐ ๐ก๐กฬ )โ1๐ก๐กฬ๐๐ ๐๐ ๐ฝ๐ฝ๏ฟฝ ๐๐๐๐๐๐ where ๐ท๐ท๏ฟฝ๐๐๐๐๐๐ is the vector of raw estimates from step two, and
๐๐๏ฟฝ = ๏ฟฝ 1โฎ 1 t1โ ๐ก๐ก๐๐ โฎ tnโ ๐ก๐ก๐๐ โฆ โฑ โฆ (t1โ ๐ก๐ก๐๐)p โฎ (tnโ ๐ก๐ก๐๐)p ๏ฟฝ
is an ๐๐ ร (๐๐ + 1) design matrix. ๐พ๐พ is an ๐๐ ร ๐๐ diagonal matrix of analytical weights given by ๐พ๐พ = ๐๐๐๐๐๐๐๐{๐ค๐ค1 , โฆ , ๐ค๐ค๐๐}.
Because pseudo data were inserted in step one, the raw estimates ๐ฝ๐ฝ๐๐๐๐๐๐ (๐๐) do not have the same accuracy. The most accurate raw estimates will the ones that were estimated by fitting local linear regressions at ๐ก๐ก๐๐ (๐๐ = ๐๐) where ๐๐๐๐๐๐ and ๐๐๐๐๐๐ were both measured, so the highest analytical weights are given to those raw estimates during smoothing. The raw estimates from the local linear regressions that used pseudo data but are close to the time of the real measurement are also given higher analytical weights than the raw estimates that are far from the time of real measurement. The measure of the time distance between a raw estimate at time tk using pseudo data and the adjacent raw estimate using data of real measurement will be defined as
๐ท๐ท๐๐ = ๐๐๐๐๐๐ (|๐ก๐ก๐๐โ ๐ก๐ก๐๐๐๐๐๐๐ก๐ก ๐๐๐๐๐๐๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐๐๐| + 1, |๐ก๐ก๐๐โ tnext real measure| + 1).
For example, at times ๐ก๐ก๐๐๐ฃ๐ฃ and ๐ก๐ก๐๐๐๐ (0 โค v < u โค T)both outcome and covariate are measured and between these two time points ๐ก๐ก๐๐๐๐ (v < k < u), only the covariate is measured. ๐ท๐ท๐ฃ๐ฃ and ๐ท๐ท๐๐ will equal 1 and ๐ท๐ท๐๐ will be calculated as
๐ท๐ท๐๐= ๐๐๐๐๐๐ (|tkโ tv| + 1, |๐ก๐ก๐๐โ ๐ก๐ก๐๐| + 1).
We define the analytical weights in four different ways using ๐ท๐ท๐๐ and the standard error to reflect the importance of the raw estimate.
Type 1: ๐ค๐ค๐๐= 1 ๏ฟฝ๐ท๐ท๐๐ Type 2: ๐ค๐ค๐๐= 1 ๐ท๐ท๐๐ Type 3: ๐ค๐ค๐๐= 1 ๏ฟฝ๐ท๐ท๐๐ + 1 ๏ฟฝ๐๐๐๐๐๐ 23
Type 4: ๐ค๐ค๐๐= 1
๐ท๐ท๐๐ +
1 ๐ต๐ต๐๐๐๐
The performance of local polynomial smoothing also depends on the values chosen for bandwidth โ and order of polynomial ๐๐, but how to choose the best โ and ๐๐ for the proposed model is not the focus of this paper. We will use the value ๐๐=3 as recommended by Wand and Jones (1995) as this has been shown to be adequate. For data that are not equally spaced choosing certain bandwidth will result in less data that fall into the window where the data are sparser. In this situation, we will use a proportion of the data to which we will fit a local polynomial smoother like LOWESS. Because useful values of the smoothing parameter typically lie in the range 0.25 to 0.5 for most LOWESS applications, we will use 0.3.