Variable Selection and Penalized Regression

2.1 Introduction

2.1.1 Variable Selection and Penalized Regression

The literature on statistical methods for variable selection has developed considerably over the last twenty years. Breiman (1996) pointed out that the traditional method of best subset selection was unstable and that this instability could lead to poor performance regarding prediction. While ridge regression (Hoerl and Kennard, 1970) imposes some shrinkage which leads to more stable models, it does not set any coefficients to zero and therefore does not “select” key variables. Tibshirani (1996) proposed a “least absolute

shrinkage and selection operator”, widely known by the acronym “LASSO”. The LASSO attempts to maintain the advantages of both subset selection and ridge regression by shrinking some coefficients and setting other coefficients to zero through the addition of a particular penalty function to the log-likelihood. Several other penalty functions have been developed and studied over the last decade to cope with high dimensional predictor spaces, including the smoothly-clipped absolute deviation (SCAD) (Fan and Li, 2001; Zou and Li, 2008), the adaptive LASSO (Zou, 2006), the elastic net (Zou and Hastie, 2005), the grouped LASSO (Yuan and Lin, 2005), the fused LASSO (Tibshirani et al., 2005) and the minimax concave penalty (MCP) (Zhang, 2010).

While much of the work on variable selection techniques was initially carried out in the context of continuous responses (Tibshirani, 1996; Fan and Li, 2001; Zou, 2006), advances have been made to deal with binary responses (Park and Hastie, 2007; Friedman et al., 2010) and time to event responses (Tibshirani, 1997; Fan and Li, 2002; Zhang and Lu, 2007). For the latter, the penalty term is typically applied to the partial likelihood (Cox, 1975) arising from a semiparametric Cox regression model (Cox, 1972) when data are right-censored.

Witten and Tibshirani (2009) give an excellent overview of the challenges arising with particularly high dimensional covariate data in settings with censored outcomes and provide an extensive discussion of the specific objectives one might have in particular scientific contexts; another useful account can be found in Li and Ma (2013). The inherent difficulty in obtaining robust and generalizable findings from samples with censored responses and high dimensional covariates is evident from the inconsistency of findings across seemingly similar patient populations and the modest gains that have been made despite considerable advances in biotechnology and statistical methods (McShane et al., 2005a). The limitations analysts face due to inadequate sample size of individual studies (Polley et al., 2013) and

the inconsistency of findings across studies has led to an increased interest in synthesizing findings over multiple studies. Assimilating information from several sources can be helpful, but it is important to clearly understand the differences between the frameworks and goals of the studies contributing to this synthesis. Guidelines have been developed for reporting findings from biomarker studies with this in mind, which advocate clear statements of study objectives, study design, methods of processing samples, and the approach to statistical analysis (McShane et al., 2005b; Altman et al., 2012).

Many prospective studies, however, have the added complication that the event times of interest are subject to interval-censoring. In clinical trials involving cancer patients at risk of metastases, for example, new lesions are only detectable when imaging assessments are carried out (Hortobagyi et al., 1996), and the precise time from randomization to the development of a new lesion is unavailable. In patients infected with cytomegalovirus, the time from infection to viral shedding in the blood is only known to lie between the last negative and first positive serum sample (Betensky and Finkelstein, 1999; Cook et al., 2008). Vertebral fractures in patients with osteoporosis are often asymptomatic, and their occurrence is only detected upon a radiographic examination yielding evidence of a new fracture (Riggs et al., 1990). Sun (2006) gives an excellent account of statistical methods for parametric and semiparametric analysis of interval-censored failure time data.

We consider the problem of variable selection in the context of interval-censored time to event data. We adopt a flexible piecewise exponential model (Friedman, 1982) for the event of interest and penalize the complete data likelihood constructed by treating the interval-censored failure times as known. An expectation-maximization (EM) algorithm (Dempster et al., 1977) is then used for variable selection through optimization of the penalized observed data likelihood. The LASSO, adaptive LASSO and SCAD penalty functions are considered.

The remainder of this chapter is organized as follows. In Section 2.1.2 we describe the motivating study with the goal of identifying key human leukocyte antigens associated with the development of arthritis mutilans in a cohort of individuals with psoriatic arthritis. In Section 2.2 we describe a penalized expectation-maximization algorithm based on a piecewise exponential response model, for which existing techniques for variable selection can be exploited to handle interval-censored event times. This is the primary contribution of this chapter. Simulation studies involving multivariate normal covariates vectors are reported in Section 2.3, which demonstrate superior performance of the proposed method over analyses based on mid-point imputation (Lindsey and Ryan, 1998). Additional simulation studies for correlated binary covariates are described in Appendix 2.A and studies of different criteria for selection of tuning parameters (Bradic et al., 2011) are given in Ap- pendix 2.B. The data from the psoriatic arthritis clinic are analyzed in Section 2.4 using a variety of penalty functions, and concluding remarks are given in Section 2.5. An extension of this algorithm is described and evaluated to deal with truncated data in Section 2.6.

In document Modeling and Prediction of Disease Processes Subject to Intermittent Observation (Page 31-34)