This may be the author’s version of a work that was submitted/accepted for publication in the following source:
Fu, Liya, Yang, Zhuoran, Cai, Fengjing, &
Wang, You Gan(2021)
Efficient and doubly-robust methods for variable selection and parameter estimation in longitudinal data analysis.
Computational Statistics, 36(2), 781–804.
This file was downloaded from:
https://eprints.qut.edu.au/206696/Springer-Verlag GmbH Germany, part of Springer Nature 2020
cThis work is covered by copyright. Unless the document is being made available under a Creative Commons Licence, you must assume that re-use is limited to personal use and that permission from the copyright owner must be obtained for all other uses. If the docu- ment is available under a Creative Commons License (or other specified license) then refer to the Licence for details of permitted re-use. It is a condition of access that users recog- nise and abide by the legal requirements associated with these rights. If you believe that this work infringes copyright please provide details by email to [email protected]
License: Creative Commons: Attribution-Noncommercial 4.0
Notice: Please note that this document may not be the Version of Record
(i.e. published version) of the work. Author manuscript versions (as Sub- mitted for peer review or as Accepted for publication after peer review) can be identified by an absence of publisher branding and/or typeset appear- ance. If there is any doubt, please refer to the published source.
https://doi.org/10.1007/s00180-020-01038-3
Computational Statistics
Efficient and doubly-robust methods for variable selection and parameter estimation in longitudinal data analysis
--Manuscript Draft--
Manuscript Number: COST-D-20-00043R1
Full Title: Efficient and doubly-robust methods for variable selection and parameter estimation in longitudinal data analysis
Article Type: Original Paper
Keywords: Correlated data; Outliers; Rank-based method; Variable selection.
Manuscript Classifications: 1.02460: Longitudinal Data; 1.03900: Robust Estimation; 1.04920: Variable Selection
Corresponding Author: You-Gan Wang, D.Phil
Queensland University of Technology Brisbane, QLD AUSTRALIA
Corresponding Author Secondary Information:
Corresponding Author's Institution: Queensland University of Technology Corresponding Author's Secondary
Institution:
First Author: Liya Fu
First Author Secondary Information:
Order of Authors: Liya Fu
Zhuoran Yang Fengjing Cai
You-Gan Wang, D.Phi Order of Authors Secondary Information:
Funding Information: the National Science Foundation of China
(11871390) Dr. Liya Fu
Natural Science Foundation of Shaanxi Province
(2018JQ1006)
Dr. Liya Fu
the Fundamental Research Funds for the Central Universities
(xjj2017180)
Dr. Liya Fu
the Australian Research Council Discovery Project
(DP160104292)
Professor You-Gan Wang
Zhejiang Science Grant
(KZS1905002) Dr Fengjing Cai
Abstract: New technologies have produced increasingly complex and massive datasets, such
software R. (iv) the proposed method is proved to have desirable asymptotic properties for fixed number of covariates ( p ). Simulation studies are carried out to evaluate the proposed method for a number of scenarios including the cases when p equals to the number of subjects. The simulation results indicate that the proposed method is efficient and robust. A hormone dataset is analyzed for illustration. By adding additional redundant variables as covariates, the penalty approach and weighting schemes are proven to be effective.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation
Springer Journals Editorial O ffice Computational Statistics
Dear Editor,
E fficient and doubly-robust methods for variable selection and parameter estimation in longitudinal data analysis (COST-D-20-00043)
We are pleased to receive your decision letter regarding the above manuscript. We appreciate that two Reviewers are very positive about our paper. We have made a major revision by taking significant extra e ffort, which, we believe, has taken into account the comments and suggestions from Associate Editor and the reviewers.
Enclosed please find our point-by-point responses to the comments and suggestions from the reviewers.
We hope that the revised version is suitable for publication in the Computational Statistics.
Thank you again for your consideration. We look forward to hearing your decision soon.
Sincerely,
Liya Fu & Zhuoran Yang & Fengjing Cai & You-Gan Wang
Authors' Response to Reviewers' Comments Click here to access/download;Authors' Response to Reviewers' Comments;COST-D-20-00043ResponseLetter.pdf
Response to the Associate Editor’s comments (COST-D-20-00043)
The manuscript has been reviewed by two experts. They are both positive about your work, but I agree with the second reviewer that the existing literature should be properly cited and a thorough comparison with alternative rank based methods is needed.
We are thankful for your e ffort deployed on our work. We now provide point by point responses as follows summarizing how we revised the manuscript.
Response to the Reviewer #1’s comments (COST-D-20-00043)
This paper developed e fficient and robust methods for variable selection and parameter estima- tion in longitudinal data analysis. Asymptotic properties of the proposed method are studied and the algorithm can be conveniently implemented through existing functions in the R software.
Simulation studies demonstrate the good performance against outliers. Overall, this paper is well written.
We are grateful to your constructive suggestion and appreciation of the potential value of our work.
A minor comment is listed in the following:
1. Why the proposed method is doubly-robust? It is suggested to give some illustration.
Thanks for pointing this out. The method is robust against outliers in response variable and also robust against outliers in covariates. We have now give some illustration in page 3 paragraph 2.
Response to the Reviewer 2’s comments (COST-D-20-00043)
This paper introduces a weighted rank-based variable selection method by combining with the adaptive lasso. Asymptotic properties such as well-known oracle property are also established.
It provides an interesting methodology which tackles modeling the longitudinal data with rank regression. Here are detailed comments.
We are grateful to your constructive suggestions and appreciation of the potential value of our work.
2
1. In page 4, the weight w
iis taken as w
i= 1/(1 + (n
i− 1)ρ). Authors only give a reference and I think detailed clarifications should be added to explain the reason. Are there other approaches to select the weight w
i? In addition, on line 43 of this page, authors also report an estimation of ρ (defined as ˆρ ). It is a natural question to know whether ˆρ is consistent. Authors should give some explanations.
Thanks for pointing these out. The weighted function w
i= 1/(1 + (n
i− 1)ˆρ) was proposed by Wang and Zhao (2008, Biometrics). The estimator ˆ ρ of the correlation parameter is consistent and was also proposed by Wang and Carey (2003, Biometrika) in GEE setting. Simulation results of Wang and Zhao (2008, Biometrics) indicate that the weighted rank method based on w
i= 1/(1 + (n
i− 1)ˆρ) are efficient and robust, and hence we propose using this weight and this estimation of ρ. We have now added more explanations for choosing the weight w
iand the estimation of ρ.
2. There are some existing references to study rank regression for longitudinal data analysis.
For example, Jung et al. (2003), Wang and Zhao (2008), Fu et al. (2010), Fu and Wang (2012) and Fu and Wang (2018). Thus, authors should state the di fference and relation of the manuscript and above existing references in the introduction. Moreover, authors should compare the proposed method with existing approaches in simulations and real data analysis.
Fu and Wang (2018) also considered variable selection problem in longitudinal data analysis by using rank regression. At least, it is necessary to consider the methods of Fu et al. (2010), Fu and Wang (2012) and Fu and Wang (2018) in numerical studies.
Thanks for pointing these out. We have now stated the di fference and relationship between the manuscript and above existing references in the introduction. The methods of Fu et al.
(2010) and Fu and Wang (2012) only deal with parameter estimation in rank regression and is
irrelevant variables are added in real data analysis to further demonstrate the variable selection performance of the methods. However, the number of additional irrelevant variables is too small. Moreover, it is more reasonable to repeat the process at least 500 times to evaluate the performance of variable selection as the additional irrelevant variables are randomly comes from the standard normal distribution.
Thanks for pointing this out. We have now corrected this error and removed MSE
cvin Table 5.
Sorry about this confusion. Furthermore, we have now added 20 irrelevant variables randomly sampled from the standard normal distribution and carried out 500 times to evaluate the perfor- mance of the proposed method. The results have now been presented in lower panel of Table 5.
4
Computational Statistics manuscript No.
(will be inserted by the editor)
Efficient and doubly-robust methods for variable selection and parameter estimation in longitudinal data analysis
Liya Fu · Zhuoran Yang · Fengjing Cai · You-Gan Wang
Received: date / Accepted: date
Abstract New technologies have produced increasingly complex and massive datasets, such as next generation sequencing and microarray data in biology, dynamic treatment regimes in clinical trials and long-term wide-scale studies in the social sciences. Each study exhibits its unique data structure within individuals, clusters and possibly across time and space. In order to draw valid conclusion from such large dimensional data, we must account for in- tracluster correlations, varying cluster sizes, and outliers in response and/or covariate domains to achieve valid and efficient inferences. A weighted rank- based method is proposed for selecting variables and estimating parameters simultaneously. The main contribution of the proposed method is four fold: (i) variable selection using adaptive lasso is extended to robust rank regression so that protection against outliers in both response and predictor variables is obtained; (ii) within-subject correlations are incorporated so that efficiency of parameter estimation is improved; (iii) the computation is convenient via the existing function in statistical software R. (iv) the proposed method is
L. Fu
School of Mathematics and Statistics, Xi’an Jiaotong University, China Tel.: +86-29-82663004
E-mail: [email protected]
Z. Yang
School of Mathematics and Statistics, Xi’an Jiaotong University, China E-mail: [email protected]
Corresponding author: F. Cai
Colleage of Mathematics, Wenzhou University, China E-mail: [email protected]
Corresponding author: Y-G. Wang
School of Mathematical Science, Queensland University of Technology, Australia E-mail: [email protected]
Manuscript Click here to view linked References
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
2 Liya Fu et al.
proved to have desirable asymptotic properties for fixed number of covariates (p). Simulation studies are carried out to evaluate the proposed method for a number of scenarios including the cases when p equals to the number of subjects. The simulation results indicate that the proposed method is efficient and robust. A hormone dataset is analyzed for illustration. By adding addi- tional redundant variables as covariates, the penalty approach and weighting schemes are proven to be effective.
Keywords Correlated data · Outliers · Rank-based method · Variable selection
1 Introduction
Longitudinal data is commonly utilized in economics, medical studies, and environmental research. A large number of covariates are often collected in longitudinal studies. The inclusion of redundant variables can reduce the ac- curacy and efficiency of parameter estimation. Therefore, it is important to select the appropriate covariates in analyzing longitudinal data. However, it is a challenge to select significant variables in longitudinal data due to un- derlying correlations and unavailable likelihood. Fan and Li(2004) provided a penalized weighted least-squares approach for variable selection in a semi- parametric model in longitudinal data analysis. Ni et al. (2010) proposed a double-penalized Gaussian likelihood approach for simultaneous model selec- tion and parameter estimation in a semiparametric mixed model for longitu- dinal data. Wang et al. (2012) and Cho and Qu (2013) considered the pe- nalized generalized estimating equations (GEE) (Liang and Zeger,1986) and the penalized quadratic inference functions through smoothly clipped absolute deviation (SCAD) (Fan and Li,2001) with high dimension covariates. All the methods mentioned above are essentially based on the weighted least squares (WLS); thus, they are sensitive to outliers.
In longitudinal studies, the collected data often deviates from normality, and the response variable and/or covariates may contain some potential out- liers, which often results in serious problems for variable selection and param- eter estimation. Therefore, robust methods have attracted much attention in recent years. However, the literature on variable selection and against outliers in response or/and covariates for longitudinal data is quite limited.Fan et al.
(2012) andGuo et al. (2014) constructed a penalized robust GEE approach by applying the Huber’s score function to the standardized residuals in linear regression models and the semiparametric mean-covariance regression models for longitudinal data, respectively.Lv et al. (2015) utilized a bounded expo- nential score function (Wang et al., 2013) in the GEE framework to choose variables and estimate parameters. However, both the Huber’s score function and the exponential score function require specifying a tuning parameter to control the level of robustness at the cost of efficiency loss.
The well-known rank-based method has many beneficial properties; for ex- ample, it is robust and distribution free.Jung and Ying (2003) and Fu et al.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 3
(2010) studied rank regression with longitudinal data under an independence assumption. Wang and Zhao (2008) considered a weighted rank method to take account of correlations and varying cluster sizes. Fan and Wang (2012) con- structed a generalized estimating equations based on the rank method under an exchangeable correlation structure assumption. The above methods can’t be used to select variables and can be sensitive to outliers in covariates. As far as we know, studies on variable selection based on ranks are rather limited due to mathematical challenges and computational complexity.
Wang and Li.(2009) andYang et al.(2015) proposed weighted rank-based methods penalized by SCAD for automatic variable selection and parameter estimation, which are applicable for independent data only. Xu et al. (2010) utilized the rank-based method to choose variables in an accelerated failure time (AFT) model. Fu and Wang (2018) considered the rank-based method based on the independence assumption for variable selection, and the method is only robust for response variable and . In this paper, we extend the rank-based method to longitudinal data and propose a weighted rank-based method for selecting covariates and estimating parameters based on the Wilcoxon disper- sion function penalized by SCAD.The new method is robust against outliers in response and heavy-tailed distributions. It is also robust against leverage points in covariates. Furthermore, the proposed method is effective because of incorporating intracluster correlations and varying cluster sizes, and has the oracle properties. The computation of minimizing the penalized weighted dispersion function is convenient via the existing function in the statistical software R.
The rest of the paper is organized as follows: the proposed method is pre- sented in Section 2. Simulation studies are carried out to evaluate the perfor- mance of the proposed method in Section 3. The data from the longitudinal hormone study is used to illustrate the proposed method in Section 4. Finally, our conclusions are drawn. The proof of the oracle properties is shown in the Appendix.
2 Methodology
Suppose (Yik, Xik) are the observed response and predictors at the k-th time point from the i-th subject or cluster, where k = 1, · · · , ni and i = 1, · · · , N . Assume that observations from the different subject or clusters are indepen- dent and observations from the same subject or cluster are correlated. Here ni is often referred to as cluster size. Consider the following linear regression model:
Yik= β0+ XikTβ + ik, (1) in which β0 is the intercept, β = (β1, · · · , βp)T is a p dimensional unknown parameter vector, and ik is an error term. Suppose that the median of ik−
jl is zero when i 6= j, and i1, · · · , ini are correlated. We partition β as (β10T, β20T)T with β10 ∈ Rd and β20 ∈ Rp−d. Suppose that the true parameter values in model (1) are βT = (β10∗T, 0T20)T. We aim to identify the covariates 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 Liya Fu et al.
with zero coefficients β20 consistently and resolutely, and meanwhile estimate the other nonzero coefficients.
2.1 Weighting approach for efficiency and robustness
To seek doubly-robust and efficient parameter estimates and simultaneously select important covariates, we propose minimizing the following penalized dispersion function with two given weights bikjl and wij:
QW(β) = M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij|ik− jl| +
p
X
s=1
Pλ(|βs|), (2)
where M =PN
i=1niis the total number of observations, and Pλ(·) is a penalty function encouraging sparsity in β, and λ > 0 is a tuning parameter controlling the complexity of the model. The weight wij = wiwj is used to capture the effects of the within-subject correlations and varying cluster sizes, and the weight bikjl is used to control for effects of possible outliers in the covariates.
Moreover, the first term in QW(β) is based on the Wilcoxon-type dispersion function and hence derive robust estimates when the response variable contains outliers. When bikjl = 1 and wij = 1 for i 6= j, the first term in QW(β) can yield the well-known rank-based Wilcoxon estimate, and the proposed penalized function QW(β) will lead to the function of Fu and Wang (2018).
For bikjl, we consider the generalized rank (GR) weight proposed by Naranjo et al.(1994) and the high-breakdown rank (HBR) weight proposed byChang et al.(1999). For GR weight, bikjl= hikhjl, in which
hik= min (
1,
c
d2i(Xik)
κ/2)
, and d2i(Xik) = (Xik− ˆµx)TSx−1(Xik− ˆµx), where c and κ are tuning constants and ˆµxand Sxare the robust estimates of the location and covariance of Xik (Rousseeuw and Zomeren,1990; Terpstra and McKean, 2005). For the tuning parameters κ and c, we use κ = 2 and c = χ20.95(p), which is the 0.95 quantile of a χ2(p) distribution. For HBR weight,
bikjl= ψ
c2 aikajl
, in which aik= ˆik ˆ
σψ(χ20.95(p)/d2i(Xik)),
where ψ(t) = 1, t or −1 according to whether t ≥ 1, −1 < t < 1 or t ≤ −1, and the tuning constant c2 = [med(aik) + 3MAD(aik)]2, and ˆσ = 1.483med|ˆik( ˆβ0) − med{ˆik( ˆβ0)}|, where ˆβ0 = ( ˆβ01, · · · , ˆβp0)T is a consistent estimate of β.
For wi, we consider wi = 1/(1 + (ni− 1)ρ) proposed by Wang and Zhao (2008), where ρ is the average correlation coefficient.This weight incorporates correlations and different cluster sizes in a simple way and the resulting es- timator is efficient and robust (Wang and Zhao, 2008). The parameter ρ is 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 5
estimated via the moment method and is given by the following formula (3), which was proposed by Wang and Carey (2003) and Wang and Zhao (2008).
ˆ ρ =
PN i=1
Pni k=1
Pni
l6=k(rik− ¯r)(ril− ¯r) PN
i=1(ni− 1)Pni
k=1(rik− ¯r)2 , (3)
where rik=PN j=1
Pnj
l=1I(ˆjl≤ ˆik), and ¯r is the average of rank sum of all the residual terms. We use the widely used SCAD penalty function for the penalty function Pλ(·) (Fan and Li,2001). To reduce the computational burdens,Zou and Li (2008) proposed a local linear approximation to the SCAD penalty, which retains the same asymptotic properties. Therefore, the SCAD estimates are derived from the following objective function:
QW(β) = LW(β) +
p
X
s=1
Pλ0(| ˆβs0|)|βs|,
where
LW(β) = M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij|ik− jl|,
and
Pλ0(θ) = λ
I(θ ≤ λ) + (aλ − θ)+
(a − 1)λ I(θ > λ)
,
for a > 2 and θ > 0.Fan and Li(2001) indicated that a = 3.7 performs well for a variety of cases; hence, we will use a = 3.7 throughout this paper.
Besides robustness and effectiveness, another appealing feature of the rank- based method with the SCAD penalty is that its computation can be conve- niently carried out using the statistical software R since the penalty term can be easily merged with the first weighted term. The procedures are given as follows. The objective function QW(β) can be written as:
QW(β) = M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij|ik− jl| +
p
X
s=1
Pλ0(| ˆβs0|)|βs|
= M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij|(Yik− Yjl) − (Xik− Xjl)Tβ| +
p
X
s=1
Pλ0(| ˆβs0|)|βs|
= M−2
M (M −1)/2+p
X
r=1
| ˜Yr− ˜XrTβ|,
where ( ˜Yr, ˜Xr) are pseudo observations. The first M (M −1)/2 pseudo-observations are (bikjlwij(Yik − Yjl), bikjlwij(Xik− Xjl)) for 1 ≤ k ≤ ni, 1 ≤ l ≤ nj
and 1 ≤ j < i ≤ N . The last p pseudo observations are (0, M2Pλ0(| ˆβs0|)Es), s = 1, · · · , p, where Esis a p-dimensional vector, with the sth element being 1 and all the other elements being zeroes. Therefore, QW(β) can be treated as 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
6 Liya Fu et al.
the L1loss function for the pseudo-data ( ˜Yr, ˜Xr) for r = 1, · · · , M (M −1)/2+p.
The penalty estimate can be obtained using the rq function in the quantreg package in the statistical software R (Koenker,2005). Furthermore, the initial estimate ˆβ0can be obtained by minimizing LW(β). Because QW(β) is invari- ant to location, the intercept β0 cannot be simultaneously estimated with β.
We estimate it by the median of {ik( ˆβ), k = 1, · · · , ni; i = 1, · · · , N }.
For the tuning parameter λ, we select it by a data driven method via minimizing the following objective function over a given interval,
BICλ= log
M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij|Yik− Yjl− (Xik− Xjl)Tβˆλ|
+dfλlog N
N ,
where ˆβλ is the penalized estimator with a tuning parameter λ, and dfλ is the number of nonzero components in ˆβλ.
2.2 Asymptotic properties
Let fikand Fikbe the density and cumulative distribution functions of ikre- spectively. Define D = M−2PN
i=1
PN j<i
Pni k=1
Pnj
l=1bikjlwij(Xik− Xjl)(Xik− Xjl)TR fikdFij. Let D11 be the first d × d submatrix of D. Denote Σ11 = diag(Pλ00(|β1∗|)sign(β∗1), · · · , Pλ00(|β∗d|)sign(βd∗)) and
Pλ0(|β∗10|)sign(β10∗ ) = (Pλ0(|β1∗|)sign(β1∗), · · · , Pλ0(|βd∗|)sign(βd∗))T. The theorem below indicates that the proposed estimators have the oracle properties, and the proof of the theorem is given in the Appendix.
Theorem 1 Under some regularity conditions given in the Appendix, the pro- posed estimator ˆβ = ( ˆβ10T, ˆβ20T)T has the following properties, as λ → 0 and
√
M λ → ∞,
(a) Sparsity : P ( ˆβ20= 0) → 1.
(b) Asymptotic normality :
√
M { ˆβ10− β∗10− (D11+ Σ11)−1Pλ0(|β∗10|)sign(β∗10)} → N (0, B11), where B11= (D11+ Σ11)−1V11(D11+ Σ11)−1, in which V11 is the first d × d block matrix of V , as given in the Appendix.
3 Simulation studies
In this section, we carry out simulation studies to demonstrate the robust- ness and efficiency of the proposed method. The data is generated from the following model:
Yik= XikTβ + ik, k = 1, · · · , ni; i = 1, · · · , 50.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 7
where β = (3, 1.5, 2, 0p−3)T, and Xik1 and Xik2 are subject-level covariates;
that is, they do not change within each subject or cluster, but may dif- fer among subjects/clusters. Covariate Xik3 is a within-cluster covariate; it changes within each subject or cluster.
We generate Xik1 and Xik2 independently from the standard normal dis- tribution and generate Xik3from a uniform distribution U (−1, 1). Covariates (Xik4, · · · , Xikp) are drawn from a multivariate normal distribution with a covariance matrix, Ri(0.5|k−l|). Four different cases are considered for error terms i= (i1, · · · , ini)Tand covariates, respectively.
Case (1): We consider error terms ifollow a multivariate normal distribution, N (0, σ2Ri(α)).
Case (2): Error terms i follow a multivariate t distribution with two degrees of freedom, T2(0, σ2Ri(α)), which may contain some underlying outliers.
Case (3): To investigate the effect of outliers in the covariate direction, we randomly choose 3% of Xikto be Xik+ 5 in Case (1).
Case (4): In Case (1), we randomly choose 3% of Xik to be Xik+ 5. Yik are contaminated by adding an outlier equal to 6 or −6 with a probability of 0.05.
For these four cases, we randomly generated ni from integer values be- tween 3 and 10 with an equal probability. We set p = 8, and σ2= 1, 9. For the correlation matrix Ri(α), we used an exchangeable correlation structure with α = 0.5 and 0.8. We conducted a simulation study with 100 independent real- izations for all cases. The multivariate normal and the multivariate student’s t random numbers are generated using the rmvtnorm and rmvt functions in the mvtnorm library. We also carried out the simulation studies for p = N = 50 for each case. The results given as a supplementary material have the same pattern as those for p = 8. Therefore, we only summarized the results for p = 8 in Tables1-4.
We compare the proposed method with the oracle procedures, which set the zero coefficients to zero and estimated the nonzero coefficients by excluding the covariates of zero coefficients through the GEE method with the true correlation structure (denoted by gee.orcal). In Tables1-4, IND denotes the penalized objective function QW(β) with weights bikjl= 1 and wi= 1,which is corresponding to the method of Fu and Wang (2018); WIL denotes QW(β) with weights bikjl= 1 and wi= [1 + (1 + ni− 1)ˆρ]−1. GR and HBR correspond to QW(β) with GR and HBR weights (for bikjl) and wi= [1 + (1 + ni− 1)ˆρ]−1, respectively. We evaluate the performance of the proposed method in terms of model errors proposed byFan and Li(2001). We report biases, the relative efficiencies (Eff) of WIL, GR, HBR and gee.orcal to IND for the first three parameters, the average number of the p − 3 true zeroes coefficients that are properly estimated to be zero (CN), and the average number of three nonzero coefficients improperly estimated to be zero (IC). We also present the mean of the model errors (MME) and the percentile that correctly identified the true models (CP).
From Table 1 (multivariate normal distribution), we can see that the results of WIL, GR, and HBR are similar. The CNs of the weighted methods, WIL, GR, and HBR, are close to five. All the methods obtain unbiased estimates.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
8 Liya Fu et al.
Moreover, IC increases as σ2 increases. When σ2 = 1, WIL, GR, and HBR perform better than IND in terms of MME, CP, and CN. The efficiency of WIL, GR, and HBR is much higher than that of IND for the subject-level covariates but is slightly lower than that of IND for the within-subject covariate. The GEE estimates are the most efficient. When σ2= 9, CPs and CNs of IND are close to those of WIL, GR, and HBR, but ICs and MME of IND are larger than those of WIL, GR, and HBR.
When the joint distribution of error terms are heavy-tails (Table 2), the rank-based methods IND, WIL, GR, HBR have the same pattern as those in Case 1 (Table1) for σ2= 1, but they are much better than the GEE method in terms of efficiency and model error. When σ2= 9, HBR outperforms in terms of CN, CP, IC, and efficiency. When covariates are contaminated by outliers (Table 3), the estimates obtained from all the methods are biased, and the GEE method has much larger model errors. When σ2= 1, GR and HBR have much higher efficiency and much smaller model errors than IND, WIL, and gee.orcal. When σ2 = 9, GR has significantly higher CP and efficiency, and has much smaller IC and model errors than all of others.
4 Hormone study
In this section, we illustrate the proposed methods by analyzing the longitudi- nal progesterone data (Sowers et al.,1998), which has been analyzed in some literature (Fan et al.,2012;Fung et al.,2002;Zhang et al.,1998). In this study, a total of 492 urine samples were collected from 34 women (with menstrual cycles) aged between 27 and 45 years, and urinary progesterone was assayed on alternate days. Each woman contributed between 11 and 28 observations over a period of time; hence the data is unbalanced. One purpose of the study was to test the effects of age and body mass index (BMI) on women’s progesterone levels after an appropriate adjustment of their menstrual cycles.
Let Y be the log-transformed progesterone level. Figure 1 indicates that some outliers exist, which coincides with the findings ofFung et al.(2002). We check the data and find that one woman’s BMI (her ID number is 21208, and has 20 observations) exceeds 38 and 11.8% of women had BMI over 30. In our model, covariates include age, BMI, and time effects. We also considered their interaction effects. The model is given as follows:
Yik= β0+ β1Agei+ β2BMIi+ β3Timeik+ β4Agei∗ BMIi+ β5Agei∗ Timeik
+ β6BMIi∗ Timeik+ β7Time2ik+ β8Time3ik+ β9Time4ik+ β10Time5ik+ ik. In the model, Agei, BMIi, and Agei∗ BMIi are subject-level covariates. The terms, including time effect, are within-cluster covariates. All of the covariates are standardized in the computation. The results presented on theupperpanel in Table5 indicate that WIL, GR, and HBR select intercepts, Time, Time3, and Time5, which is consistent with the findings by Zhang et al. (1998) and Fung et al.(2002), where age and BMI were found to have no significant effect on progesterone level. IND selected eight other covariates, including age and 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 9
BMI, except Time2 and Time4. The left panel in Figure2 indicates that the predictions of IND, WIL, and HBR is better than that of GR. The right panel in Figure2indicates that the predictions of WIL and HBR are good, and IND and GR performed poorly after adding some outliers to age.
To further demonstrate the performance of the methods based on differ- ent weights, we also consider the regression that contains 20 additional irrele- vant variables randomly generated from the standard normal distribution. The lower panel in Table 5 presents the selected frequencies for the 31 variables which consist of the real 11 variables and other 20 irrelevant variables based on 500 replications. As we can see, intercept, Time, Time3 and Time5 have very high selected frequencies for all the methods after adding the 20 irrel- evant variables, which indicates that these variables may be very significant in this regression model. In addition, IND has higher frequencies of choosing the 20 irrelevant variables than GR, WIL and HBR methods, which indicates that IND is clearly affected by the adding irrelevant variables and tends to overfitting. According to the analysis made above, we prefer WIL and HBR methods to determine this hormone data. The dataset for this hormone study is available from the corresponding author on reasonable request.
5 Conclusion
In this paper, we have provided efficient and doubly robust methods for vari- able selection and parameter estimation in longitudinal data analysis. The ob- jective functions are based on the ranks of the pairwise residuals, and weight wij captures the effects of the within-subject correlations and varying cluster sizes; hence, the proposed methods are efficient and robust when responses deviate from the Gaussian distribution or contain underlying outliers, or a strong within-correlation exists. Moreover, the GR and HBR weights auto- matically downweight the outliers existing in the covariates. Therefore, the proposed methods are doubly robust. Furthermore, the calculation of the pro- posed methods can be easily implemented in the statistical software R. Ac- cording to the simulation results and the real data analysis for weight bikjl, we propose using the HBR weight when response distribution is heavy-tails or contaminated by outliers and using the GR weight when covariates contain outliers. If there is no evidence that outliers exist in response or covariates, the the Wilcoxon weight is preferable. The GR weight may depend on the selection of the tuning parameters c and κ, and the cross-validation method can be utilized to choose them, but this remains unexplored in our paper. It is worth noting that the performance of the proposed methods depends on the covariate type. The weight wij only improves the efficiency of parameter estimates in cluster-level covariates. We will seek a more efficient method for within-cluster covariates in future work.
Acknowledgements The authors thank the Associate Editor and referees for their con- structive comments.We would like to express gratitude to Professor Xihong Lin for provid-
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
10 Liya Fu et al.
ing the hormone data. This research was supported by the National Science Foundation of China (No. 11871390), the Fundamental Research Funds for the Central Universities (No.
xjj2017180), and the Natural Science Foundation of Shaanxi Province (No. 2018JQ1006), Zhejiang Science Grant (Project KZS1905002) and the Australian Research Council Discov- ery Project (DP160104292).
Conflict of interest
The authors declare that they have no conflict of interest.
References
Chang W. H., McKean J. W., Naranjo J. D. and Sheather S. J. (1999). High- breakdown rank regression, Journal of the American Statistical Association 94, 205–219.
Cho H-K and Qu A. (2013). Model selection for correlated data with diverging number of parameters. Statistica Sinica 23, 901–927.
Fan J. and Li R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association 96, 1348–1360.
Fan J. and Li R. (2004). New estimation and model selection procedures for semiparametric modeling in longitudinal data analysis. Journal of the Amer- ican Statistical Association 99, 710–723.
Fan Y., Qin G. and Zhu Z. (2012). Variable selection in robust regression models for longitudinal data. Journal of Multivariate Analysis 109, 156–
167.
Fu L. Y., Wang Y-G. and Bai Z. (2010). Rank regression for analysis of clus- tered data: A natural induced smoothing approach. Computational Statis- tics and Data Analysis, 54, 1036-1050.
Fu L. Y. and Wang Y-G. (2012). Efficient estimation for rank-based regression with clustered data. Biometrics 68, 1074-1082.
Fu L. Y. and Wang Y-G. (2018). Variable selection in rank regression for analyzing longitudinal data. Statistical Methods in Medical Research 27(8), 2447–2458.
Fung K-W, Zhu Z. Y., Wei B. C. and He X. M. (2002) Inference diagnostics and outlier tests for semiparametric mixed models. Journal of Royal Statistical Society, Series B. 64, 565–579.
Guo C. H., Yang H. and Lv J. (2014). Robust variable selection in semipara- metric mean-covariance regression for longitudinal data analysis. Applied Mathematics and Computation 245, 343–356.
Jaeckel L. A. (1972) Estimating regression coefficients by minimizing the dis- persion of the residuals. The Annals of Mathematical Statistics, 43, 1449–
1458.
Jung S. H. and Ying Z. (2003). Rank-based regression with repeated measure- ment data. Biometrika, 90, 732–740.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 11
Koenker R. (2005). Quantile regression, Cambridge University Press.
Liang K. Y. and Zeger S. L. (1986). Longitudinal data analysis using general- ized linear models. Biometrika 73, 13–22.
Lv J., Yang H. and Guo C. H. (2015). An efficient and robust variable selection method for longitudinal generalized linear models. Computational Statistics and Data Analysis 82, 74–88.
Naranjo J., Mckean J. W., Sheather S. J. and Hettmansperger T. P. (1994).
The use and interpretation of rank-based residuals. Nonparametric Statistics 3, 323–341.
Ni X., Zhang D. and Zhang H. H. (2010). Variable selection for semiparametric mixed models in longitudinal studies. Biometrics 66, 79–88.
Rousseeuw P. J. and Zomeren B. C. V. (1990). Unmasking multivariate outliers and leverage points. Journal of the American Statistical Association 85, 633–
639.
Sowers M. F., Crutchfield M., Randolph J. F., Shapiro B., Zhang B., Pietra M.L. Schork M. A. (1998). Urinary ovarian and gonadotrophin hormone levels in premenopausal women with low bone mass. Journal of Bone Mining Research 13, 1191–1202.
Terpstra J. T. and McKean J. W. (2005). Rank-based reanlaysis of linear models using R. Journal of Statistical Software 14, 1–26.
Wang H., Li G. and Jiang G. (2007). Robust regression shrinkage and con- sistent variable selection via the LAD-LASSO. Journal of Business & Eco- nomics Statistics 25, 347–355.
Wang L. and Li R. (2009). Weighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics 65, 564–571.
Wang L., Zhou J. and Qu A. (2012). Penalized generalized estimating equa- tions for high-dimensional longitudinal data analysis. Biometrics 68, 353–
360.
Wang X. Q., Jiang Y. L., Huang M. and Zhang H. P. (2013). Robust variable selection with exponential squared loss. Journal of the American Statistical Association 108, 632–643.
Wang Y-G. and Carey V. (2003). Working correlation structure misspecifica- tion, estimation and covariate design: Implications for generalised estimating equations performance. Biometrika 90, 29–41.
Wang Y-G. and Zhao Y. D. (2008). Weighted rank regression for clustered data analysis. Biometrics 64, 39–45.
Xu J. F., Leng C. L. and Ying Z. (2010) Rank-based variable selection with censored data. Statistics Computing 20, 165–176.
Yang H., Guo C. H. and Lv J. (2015) SCAD penalized rank regression with a diverging number of parameters. Journal of Multivariate Analysis 133, 321–333.
Zhang D., Lin X. H., Raz J. and Sowers M. F. (1998). Semiparametric stochas- tic mixed models for longitudinal data. Journal of the American Statistical Association 93, 710–719.
Zou H. and Li R. (2008) One-step sparse estimates in noncave penalized like- lihood models. The Annals of Statistics 36, 1509–1566.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
12 Liya Fu et al.
Appendix
C1. Cluster size ni is bound.
C2. ik is continuous, and the median of any pairwise difference ik− kl is zero.
C3. Density function fikis absolutely continuous.
C4. Matrix D is a positive definite matrix.
C5. lim infN →+∞lim infβ→0+Pλ0(β)/λ > 0, and max1≤s≤p{Pλ00(|βs|) : βs 6=
0} → 0.
Let fikand Fikbe the density and cumulative distribution functions of ik, respectively. Assume that bij and wij are given. Define
D = M−2
N
X
i=1 N
X
j<i ni
X
k=1 nj
X
l=1
bikjlwij(Xik− Xjl)(Xik− Xjl)T Z
fikdFij,
and
U (β) = M−2
N
X
i=1 N
X
j<i ni
X
k=1 nl
X
l=1
bikjlwij(Xik− Xjl)sign(ik− jl).
Before proving the theorem, we give the following lemmas.
Lemma 1. Under conditions C1-C3,√
M U (βT) converges in distribution to N (0, V ), where V = limN →+∞M−3PN
i=1ζiζiT, and ζi =PN j<i
P
k,lbikjlwij(Xik− Xjl)(Fjl(ik) − 1/2).
Lemma 2. Under conditions C1-C4, U (β) is asymptotic linearity, that is, sup
||b−β||≤M−1/2η
√
M {U (b) − U (β)} − M1/2D(b − β)}
= op(1 +
√
M ||b − β||).
Proofs of Lemmas 1 and 2 can refer to Jung and Ying(2003) andWang and Zhao(2008).
Lemma 3. Under conditions C1-C4, if λ → 0, then the estimator ˆβ ob- tained from QW(β) satisfies || ˆβ − βT|| = Op(M−1/2), where βT is the true value of β.
Proof. We will prove that, for ∀ > 0, there exists a large constant C that satisfies
P
inf
||u||=CQW(βT+ M−1/2u) > QW(βT)
≥ 1 − , (4)
where u = (u1, · · · , up)T. Because QW(β) is convex in β, the estimator ˆβ lies in the ball {βT+ N−1/2u : ||u|| ≤ C}. According toSievers(1983) and Lemma 2,
LW(βT + M−1/2u) − LW(βT) = −uTM−1/2U (βT) +1
2M−1uTD(βT)u + op(1).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65
Efficient and doubly-robust methods for variable selection 13
Therefore,
QW(βT + M−1/2u) − QW(βT) = LW(βT + N−1/2u) − LW(βT) +
p
X
s=1
Pλ0(| ˆβs0|){|βs+ N−1/2us| − |βs|}
≥ −uTM−1/2U (βT) +1
2M−1uTD(βT)u
− M−1/2
p
X
s=1
Pλ0(| ˆβs0|)|us| + op(1).
According to Lemma 1, M−1/2U (βT) = Op(M−1). Because D(βT) is a posi- tive definite matrix (C4), uTD(βT)u > 0. Furthermore, Pλ0(| ˆβs0|) → Pλ0(|βs|) in probability, and Pλ0(|βs|) = Pλ0(|βs|)I(|βs| ≤ aλ), hence for ∀ > 0, P (Pλ0(|βs|) > ) ≤ P (|βs| ≤ aλ) → 0 as λ → 0. Therefore, by choosing a sufficiently C, the sign of QW(βT + M−1/2u) − QW(βT) is dominated by the second term on the right-hand side, and (4) holds.
Lemma 4. If λ → 0, and√
N λ → +∞ as N → +∞, for any βd satisfying
||β10− β10∗ || = op(M−1/2) and any constant C, QW
β10 0
= min
kβ20k≤CM−1/2QW
β10 β20
.
Proof of Lemma 4. Because QW(β) is a convex, piecewise linear of β, it is sufficient to show that with probability tending to 1 as N → ∞, for any βd
satisfying kβ10− β10∗ k = Op(M−1/2) and for any small M = CM−1/2, and s = d + 1, · · · , p,
∂Q(β)
∂βs > 0 for 0 < βs< M
< 0 for −M < βs< 0.
Note that
∂QW(β)
∂βs = ∂LW(β)
∂βs + Pλ0(| ˆβs0|)sign(βs)
= −M−2
N
X
i=1 N
X
j=1 ni
X
k=1 nj
X
l=1
bikjlwij(Xiks− Xjls)sign(ikjl) + Pλ0(| ˆβs0|)sign(βs)
= −Us(β) + Pλ0(| ˆβs0|)sign(βs)
where Us(β) is the sth element of U (β), and ikjl = ik− jl. According to Lemam2, we have√
M (Us(β) − Us(βT)) =√ MPd
l=1Dsl(βl− β∗l) + oP(1 + M1/2|βs−βs∗|), where Dslis the (s, l)th element of D, and βs∗is the sth element of the true value of β. According to Lemma 1, Us(βT) = Op(M−1/2). Thus, for any βlsatisfying kβl− βl∗k = Op(M−1/2), we have
∂QW(β)
∂βs = Pλ0(| ˆβ0s|)sign(βs) + Op((d + 1)M−1/2/λ) + oP(M−1/2/λ) 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
14 Liya Fu et al.
Because ˆβs0 is a consistent estimate of βs, Pλ0(| ˆβs0|) = λPλ0(|βs|)/λ + op(1).
Under condition C5, lim infN →+∞lim infβ→0+Pλ0(βs)/λ > 0. Therefore, as λ → 0 and√
M λ → +∞, the sign of the derivative is completely determined by that of βj. This completes the proof of Lemma 4.
Proof of Theorem. It follows by Lemma 4 that part (1) of the theorem holds. Now, we prove part (2) of the theorem. Similarly as in Lemmas 2 and 3, it can be shown that there exists a ˆβ10, a√
M -consistent local minimizer of Q β100, which satisfies the equations
∂Q(β)
∂βs
|(β10ˆ0 )= 0 for s = 1, · · · , d.
Similar to the proof of Lemma 4,
∂Q(β)
∂βs
= −Us(β) + Pλ0(| ˆβ0s|)sign(βs), for s = 1, · · · , d.
Hence,√
M Us( ˆβ) −√
M Pλ0(| ˆβ0s|)sign( ˆβs) = 0, for s = 1, · · · , d. According to Lemma 2, we have
√
M Us( ˆβ) = −√
M Us(β) +√ M
d
X
l=1
Dsl( ˆβl− βl∗) + oP(1 + M1/2| ˆβs− βs∗|)
=√
M Pλ0(|βs|)sign(βs) +√
M Pλ00(|βs|)sign(βs)( ˆβs− β∗s).
Therefore,
√
M { ˆβ10−β10−(D11+Σ11)−1Pλ0(|β10|)sign(β10) = −(D11+Σ11)−1√
M UR(β)+op(1), where D11 and UR(β) correspond to the first d × d submatrix of D and the first d elements of U (β) with βT = (β10, 0p−d), and Pλ0(|β10|)sign(β10) = (Pλ0(|β1|)sign(β1), · · · , Pλ0(|βd|)sign(βd))T. According to conditions C4 and Lemma 1,
√
M (D11+ Σ11){ ˆβ10− β10− (D11+ Σ11)−1Pλ0(|β10|)sign(β10) → N (0, V11), where V11 is the first d × d submatrix of V .
References
Sievers, G. L. (1983). A weighted dispersion function for estimation in linear models. Communications in Statistics – Theory and Methods 12, 1161–1179.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65