ABSTRACT
SHEN, KAI. Contributions to Statistical Methods for Functional Data Analysis and Generalized Additive Model. (Under the direction of Dr. Yichao Wu.)
Due to the progress in modern data recording devices, we are able to collect data that vary continuously or intermittently over a continuum. Such kind of data is called “functional data”, which can be seen everywhere in modern society. In most of the cases the continuum over which the function is defined is time, but functional data can also accommodate other continuum forms, for example, wavelengh and spatial location.
Functional data differ from traditional multivariate data in their inherent infinite dimensionality, which is the most notable feature of functional data. Thanks to this feature, functional data contain far more information than multivariate data. However, the infinite dimensionality of functional data also poses challenges to both theoretical and computational aspects of data analysis. As an active area in statistics research, functional data analysis (FDA) has been developed in recent years and includes the statistical methodologies aimed at functional data. The central methodologies of FDA include functional principal component analysis (FPCA), functional linear regression, etc. The classical statistical problems such as classification and variable selection in the context of functional data have also become a hot topic in FDA.
In Chapter 2, we propose a binary classification method for functional data. The method is developed under the framework of continuously additive model (CAM), which is recently proposed as a new nonlinear functional regression technique. It lends great flexibility to the study of functional data. To achieve binary classification for functional data, we propose to couple the CAM with the support vector machine, a large margin classifier. The support vector machine is a popular binary classification method that has enjoyed great success but has not become popular yet for functional data classification. Also, the support vector machine has been shown to be sensitive to outliers since it is based on an unbounded hinge loss. To work around this issue we propose to couple the CAM with the robust support vector machine using the truncated hinge loss. We illustrate the performance of our methods with simulation examples and two real data sets that involve the classification of spectral data. The proposed approach is compared with classification based on the functional linear model.
re-sponse is scalar and the functional variables enter the regression model in a nonlinear fashion. To achieve variable selection in this case we propose an approach based on penalized least squares within the CAM framework. The proposed approach can simul-taneously control the sparsity of the model by group level smoothly clipped absolute deviation (SCAD) penalty and the smoothness of the additive surface by smoothness penalty. The performance of the approach is numerically investigated by Monte Carlo simulations. Some asymptotic properties associated with the proposed approach and their technical proofs are also given.
As a class of nonparametric models, generalized additive models proposed by Hastie and Tibshirani (1990) combine properties of generalized linear models with additive mod-els. Its flexibility is obtained by replacing the linear predictor in the generalized linear model by a sum of smooth functions of each predictor variable. Due to its flexibility, generalized additive models have a wide scope of application and are very popular in nonparametric modeling. Thus the problem of variable selection within generalized ad-ditive models deserves our consideration.
© Copyright 2017 by Kai Shen
Contributions to Statistical Methods for Functional Data Analysis and Generalized Additive Model
by Kai Shen
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fulfillment of the requirements for the Degree of
Doctor of Philosophy
Statistics
Raleigh, North Carolina 2017
APPROVED BY:
Dr. Len Stefanski Dr. Wenbin Lu
Dr. Zhilin Li Dr. Yichao Wu
DEDICATION
BIOGRAPHY
ACKNOWLEDGEMENTS
First and foremost, I would like to express my sincere gratitude to my advisor Dr. Yichao Wu. I really appreciate his continuous support of my Ph.D. study and research. With his patience, motivation, enthusiasm, and immense knowledge, I have been able to make progress in my Ph.D. research. His insightful guidance helped me all the time during my research and writing of this dissertation. I can hardly imagine completing this Ph.D. without his help and encouragement.
In addition to my advisor, I am also grateful to the rest of my committee members: Dr. Len Stefanski, Dr. Wenbin Lu, and Dr. Zhilin Li. They offered their precious time, support and commitment. My research work is greatly enriched with their insightful comments, valuable suggestions, and expert knowledge.
I would also like to thank the Department of Statistics at North Carolina State University, not only for those faculty members and graduate students who collectively create an excellent academic environment, but also for giving me a Graduate Industrial Trainee opportunity with United Therapeutics Corp., through which I had a chance to learn what it is really like to work in a biotechnology company as a statistician.
TABLE OF CONTENTS
LIST OF TABLES . . . vii
LIST OF FIGURES . . . viii
Chapter 1 Introduction . . . 1
Chapter 2 Binary Classification of Functional Data via Continuously Additive Modeling . . . 6
2.1 Introduction . . . 6
2.2 Preliminaries . . . 7
2.2.1 Binary Classification via Functional Linear Modeling . . . 7
2.2.2 Binary Classification via Functional Linear Discriminant Analysis 9 2.3 Proposed Model . . . 10
2.4 Estimation . . . 11
2.5 Simulation Studies . . . 14
2.5.1 Simulation Results . . . 15
2.6 Data Illustrations . . . 17
2.6.1 Yeast Gene Expression Data . . . 17
2.6.2 Orange Juice NIR Spectroscopy Data . . . 19
2.7 Conclusion . . . 21
Chapter 3 Variable Selection in Functional Nonlinear Models via Con-tinuously Additive Modeling . . . 23
3.1 Introduction . . . 23
3.2 Build Regularization Framework via CAM and Group SCAD Penalty . . 24
3.2.1 Model . . . 24
3.2.2 Zero-Degree Spline Approximation . . . 25
3.2.3 Group SCAD Penalty and Smoothness Penalty . . . 26
3.3 Estimation Algorithm and Tuning Parameter Selection . . . 28
3.4 Simulation Study . . . 31
3.4.1 Data Generation . . . 31
3.4.2 Choosing Tuning Parameters . . . 32
3.4.3 Simulation Results Summary . . . 32
3.5 Asymptotic Properties . . . 33
3.6 Conclusion . . . 44
Chapter 4 Automatic Structure Recovery for Generalized Additive Mod-els . . . 45
4.1 Introduction . . . 45
4.2.1 Generalized Additive Models . . . 46
4.2.2 Local Scoring Algorithm . . . 46
4.3 Variable Selection via Weighted Local Constant Smoothing . . . 50
4.3.1 Variable selection . . . 50
4.3.2 Modified coordinate descent algorithm . . . 51
4.3.3 Tuning . . . 52
4.4 Higher-Degree Weighted Local Polynomial Smoothing . . . 53
4.5 Automatic Structure Recovery . . . 54
4.6 Simulation Studies . . . 56
4.7 Real Data Examples . . . 59
4.7.1 Pima Indians Diabetes Data . . . 59
4.7.2 M&A Data . . . 60
4.8 Conclusion . . . 62
LIST OF TABLES
Table 2.5.1 Average classification error and corresponding standard error (×104) . 16 Table 2.5.2 Average classification error and corresponding standard error forg1with
larger sample size (×104) . . . . 16 Table 2.6.1 Average classification errors (×104) for brewer’s yeast gene expression
data . . . 19 Table 2.6.2 Average classification errors (×104) for orange juice NIR spectroscopy
data . . . 21 Table 3.4.1 Nonlinear functional variable selection performance with ntrain = 200,
ntest = 1000. Standard errors are given in the parentheses. . . 33
LIST OF FIGURES
Figure 2.6.1 Plot of standardized gene expression trajectories for G1 phase regula-tion in brewer’s yeast. . . 18 Figure 2.6.2 Plot of standardized orange juice spectra . . . 20 Figure 3.2.1 SCAD Penalty, SCAD Penalty Derivative, and Thresholding Rule. . . 28 Figure 4.6.1 Solution paths to regularized minimization problem (4.10) for one
sim-ulated data of Poisson regression model, plotted over 0≤τ0 ≤150. . 58 Figure 4.7.1 Beyond quadratic fit for body mass index in the Pima Indians Diabetes
Chapter 1
Introduction
With the progress in modern technology of data collection and storage, the data in the form of a curve, a surface, or anything that varies over a continuum become more and more readily available to us. Collectively, we call this kind of data as “functional data” (Ramsay, 1982). The most notable feature of functional data is the intrinsic in-finite dimensionality, this poses great challenge to both methodology and computation. Yet the feature of infinite dimensionality also brings an important benefit, namely the huge volume of potential useful information. For this very reason, as a branch of statis-tics devoted to the statistical methodologies aimed at functional data, Functional Data Analysis or FDA, has attracted increasing attention among statistics research commu-nity. There have been various monographs (Ramsay and Silverman, 2005; Ferraty and Vieu, 2006; Hsing and Eubank, 2015) and review articles (Rice, 2004; Wang et al., 2016; Morris, 2015) on this area.
alternative approach to perform functional nonlinear regression.
Additive modeling has been very successful as a flexible extension of linear regression (Mammen and Park, 2006; Yu et al., 2008), but only more recently has been applied to functional data analysis (Zhang et al., 2013). The continuously additive models (CAM) proposed by Müller et al. (2013) is a flexible class of nonlinear regression models coupling functional predictors and scalar responses. It includes the functional linear model as a special case. A closely related method was proposed by McLean et al. (2014). Recently CAM was extended to the case of functional responses (Ma and Zhu, 2016). In contrast to frequency-additive models that had been proposed before (Müller and Yao, 2008) and which are additive in functional principal component scores of the predictor processes, CAM is time-additive and is obtained as a limit of regression models that are additive in grids of discretized time, as described in detail in Müller et al. (2013). Due to the great flexibility of CAM, we try to investigate its potential application in functional data analysis beyond regression. Specifically, here we will apply CAM to the problems of functional data classification and functional variable selection.
With a wide range of applications, in addition to functional regression, classification of functional data is an important and challenging problem within functional data analysis. In the traditional setting with multivariate predictors, the classification problem has been widely studied with many different classifiers available. Well known examples are logistic regression, Fisher’s linear discriminant analysis, and support vector machine (SVM; Boser et al., 1992), among many others. For a comprehensive review, see Cristianini and Shawe-Taylor (2000) or Friedman et al. (2001). To classify functional data, James and Hastie (2001), Leng and Müller (2006), and Rossi and Villa (2006) have proposed extensions of linear discriminant analysis, logistic regression, and SVM, respectively, while Zhu et al. (2012) proposed to use functional mixed models to achieve robustness while classifying functional data. Wu and Liu (2013) studied multicategory classification of functional data. Important applications of the classification of functional data are featured in Alonso et al. (2012); Chiou (2012); Francisco-Fernández et al. (2012); Song et al. (2008); Wang and Qu (2014).
unbounded. To overcome this drawback, we will also couple the CAM technique with the robust support vector machine (Wu and Liu, 2013), which is built upon a truncated version of hinge loss. The empirical performance of the proposed methods is evaluated with finite-sample simulation examples and two real data applications. Compared with the classification methods based on the functional linear model, our methods based on CAM show superior performance.
The advent of new data acquisition and storage technologies has also created the de-mand for analyzing big datasets, which feature a large number of predictors and possibly a large number of observations as well. In order to analyze such datasets both efficiently and accurately, it is more and more imperative to use a variable selection method to subset the predictors before performing any further analysis. This leads to another re-search area of functional data analysis which involves functional variable selection. In the traditional multiple regression analysis setting, selecting a set of necessary variables is a very important problem, and various methods have been proposed to address this issue. Traditional methods include stepwise regression and the best subset variable se-lection. The trouble with these methods is that they neglect stochastic errors associated with stages of variable selection. Moreover, in some cases their performance can be very unstable, namely, a small change on data may yield a very different model, as demon-strated in Breiman (1995). To overcome the drawbacks of traditional methods, some novel shrinkage approaches are developed in recent years. Tibshirani (1996) proposed to add an L1 penalty to the least squares loss, thus resulting in a so-called Lasso regularization
problem. The Lasso regularization technique is known to be able to simultaneously shrink parameters and select variables due to its property of shrinking some parameters to exact zeros. However, one drawback of Lasso is that its variable selection performance is not consistent unless the irrepresentable condition is imposed (Zhao and Yu, 2006). To over-come this limitation Fan and Li (2001) developed a new penalty form called the smoothly clipped absolute deviation (SCAD) penalty, a non-concave penalization approach that produces sparse solutions by thresholding small estimates to zero and generate unbiased estimates to large coefficients. Furthermore, the estimates obtained through SCAD reg-ularization have nice oracle properties, namely the consistency in variable selection and the estimation performance as good as if true model were given (Fan and Li, 2001).
se-lection method nor the penalized approach can be directly applied. To handle this, a common method is to first apply basis expansion to the functional variable and cient function, then implement a group-wise selection of expansion coefficients of coeffi-cient function through a group level penalty. Candidates like group Lasso (Yuan and Lin, 2006) or group SCAD (Wang et al., 2007) can be used to achieve that. Based on this rationale, several functional variable selection methods have been developed in functional linear and generalized functional linear context, for example, Matsui and Konishi (2011) and Gertheiss et al. (2013). However, the underlying linearity of functional variable in the model can be a limiting factor of these methods.
In Chapter 3, we propose a nonlinear functional variable selection approach, which is also built under the CAM framework. Due to the great flexibility of CAM, the approach differs from the aforementioned functional variable selection methods in its capability of accommodating nonlinearity in the regression relationship. Specifically, to achieve func-tional variable selection we couple the CAM with the group SCAD penalty (Fan and Li, 2001; Wang et al., 2007) and the smoothness penalty (Müller et al., 2013). The cou-pling poses a penalized least squares problem whose solution simultaneously controls the sparseness of the nonlinear functional predictors and the smoothness of the additive sur-faces which characterize the continuously additive model. The finite-sample performance of the approach is numerically investigated by Monte Carlo simulations. The asymptotic properties associated with this approach are developed, and their technical proofs are also provided.
papers (e.g., Avalos et al. (2007), Cantoni et al. (2011), Wu and Stefanski (2015)). For generalized additive models, the aforementioned traditional variable selection methods such as stepwise regression and the best subset selection may also be applied, however, it can dramatically increase computation burden because we need to fit GAM for each submodel.
Chapter 2
Binary Classification of Functional
Data via Continuously Additive
Modeling
2.1
Introduction
To take advantage of the great flexibility of the continuously additive model, we propose here to couple it with the SVM to perform nonlinear classification of functional data. Robustness is an important issue in classification as pointed out in Zhu et al. (2012) and Wu and Liu (2013). While the SVM is known not to be robust to outliers as it is based on the unbounded hinge loss function (Wu and Liu, 2007), we couple the continuously additive model with the truncated-hinge-loss SVM, which has been shown to be robust to outliers (Wu and Liu, 2007). One challenge posed by the truncated-hinge-loss SVM is that the associated optimization problem is non-convex. Noting that the truncated hinge loss can be rewritten as the difference of two convex functions, we will employ the difference convex algorithm (DCA, An and Tao, 1997) to handle this challenge. In our setting, combining the loss functions with the continuously additive model leads to large-scale multivariate optimization problem. To tackle this we employ the modeling language AMPL (Fourer et al., 2003), which is a popular software widely used in optimization.
The rest of the chapter is organized as follows. Section 2.2 introduces some
reviews continuously additive model (CAM). In Section2.4we discuss how to couple the
CAM with the SVM and the truncated-hinge-loss SVM, and provide details on estima-tion. Section2.5 includes a set of simulation studies to illustrate the performance of the
proposed classification methods for functional data. Section 2.6 is devoted to two real
data examples. We conclude with a brief discussion in Section 2.7.
2.2
Preliminaries
2.2.1
Binary Classification via Functional Linear Modeling
As aforementioned in the introduction chapter, functional linear regression is an im-portant methodology in the realm of functional data analysis. The functional linear model can also be used in the classification of functional data by replacing the squared loss with an appropriate loss function for classification. For binary classification with a functional predictor X(·)defined over a compact domain T, we assume thatX(·)∈L2(T) and
de-note the binary response asY ∈ {−1,1}. HereL2(T)denotes the set of square-integrable
functions overT. Using the functional linear modeling technique, we are trying to identify
a decision function
f(X) = b0+ Z
T
(X(t)−µ(t))β(t)dt,
whose sign can be used to predict the binary class label. Here µ(t) = E(X(t)) denotes
the mean predictor function.
Towards model estimation, the Karhunen–Loève representation of X(t) can be used
as in Yao et al. (2005). Namely predictor X(·)is represented as
X(t) = µ(t) + ∞
X
m=1
ξmφm(t),
where {φm(·), m = 1,2, . . .} are eigenfunctions of the covariance operator of X(·), and
{ξm =
R
T(X(t)−µ(t))φm(t)dt, m= 1,2, . . .} are the corresponding functional principal
component scores. Since eigenfunctions {φm(·), m= 1,2, . . .} form an orthonormal basis
for L2(T), the slope parameter function β(t) can be decomposed accordingly as β(t) = P∞
m=1βmφm(t). See the functional linear regression paper Yao et al. (2005) for more
Given the aforementioned decompositions of the functionsX(t)andβ(t), the decision
function f(X) can be written as
f(X) =b0+ Z
T
(X(t)−µ(t))β(t)dt =b0+
∞
X
m=1
ξmβm (2.1)
by noting that the eigenfunctions are orthonormal. The right hand side of (2.1) involves a summation of infinitely many terms. Yao et al. (2005) proposed to truncate the sum-mation at some large value M, which can potentially diverge to infinity slowly as the sample size increases.
Denote the observed sample by{(Xi, Yi), i= 1, . . . , n}. For each individual trajectory
Xi(t), the functional principal component scores ξim, m = 1, . . . , M, can be estimated
by using the PACE package (Yao et al., 2005). Given the estimated functional principal component scoresξˆ
im, the expression on the right hand side of (2.1) can be approximated
by
f(Xi)≈b0+ξˆTi β
withξˆi = ( ˆξi1, ξˆi2, . . . , ξˆiM)T andβ= (β1, β2, . . . , β
M)T. In functional logistic regression,
coefficients β are estimated by minimizing the negative binomial log likelihood
min
β
n
X
i=1
log1 + exp −Yi(b0+ξˆiTβ)
,
where Yi ∈ {−1,1}. For the functional support vector machine, coefficients β can be
estimated by solving
min
β
n
X
i=1
H Yi(b0+ξˆTi β)
+λkβk2, (2.2)
where H(u) = max(1− u,0) denotes the hinge loss, kβk =
q PM
m=1βm2 and λ > 0
is a regularization parameter. Denote the optimizer by βˆ and ˆb0. Then the estimated
classification rule is given byYˆi =sign(ˆb0+ξˆT
i βˆ). To achieve robustness to outliers, one
2.2.2
Binary Classification via Functional Linear Discriminant
Analysis
Besides the functional classification approach based upon regression, there is another popular approach derived from the classical linear discriminant analysis (LDA) method. Given a new data object, the idea behind this approach is to classify according to the largest class conditional probability by applying the Bayes rule. Some relevant approaches based on this include a functional data-analytic approach to signal discrimination (Hall et al., 2001), functional linear discriminant analysis to classify curves (James and Hastie, 2001) and kernel functional classification methods for nonparametric curve discrimination (Ferraty and Vieu, 2003; Chang et al., 2014; Zhu et al., 2012).
Functional linear discriminant analysis (FLDA, James and Hastie, 2001) is an exten-sion of the classical Fisher’s linear discriminant analysis to the classification of functional data. It can be applied to infinite-dimensional functional data through fine discretization of the continuous time interval that forms the domain of the functional data. The FLDA method derived in James and Hastie (2001) (as we will see next) also works with sparse and irregular functional data.
In the context of binary classification, the FLDA assumes a functional data model of the form
Yij = (Yij(tij1), . . . , Yij(tijnij))
T +
ij, i= 1,2, j = 1, . . . , mi, (2.3)
where i denotes class label and mi denotes the number of individual curves in class i.
HereYij is the observation vector for thejth curve of classi. It containsnij observations
measured at discrete time pointstij1, . . . , tijnij. In addition,ij is the corresponding
mea-surement error vector and components ofij are assumed to be independent of each other
with zero mean and constant variance σ2. By expanding the function Y
ij(t) in terms of
natural cubic spline basis functions, model (2.3) can be rewritten as
Yij =Sijηij +ij, i= 1,2, j = 1, . . . , mi. (2.4)
Let s(t) denote a spline basis with dimension d, then Sij = s(tij1), . . . ,s(tijnij)
T in (2.4) is the corresponding spline basis matrix of size nij×d andηij is thed×1 vector of
normal distribution as in classical LDA, namely
ηij =µi+γij, γij i.i.d.
∼ N(0,Γ) (2.5)
for some d×d covariance matrix Γ. Substituting (2.5) for ηij in (2.4) yields the FLDA
model:
Yij =Sij(ηi+γij) +ij, i= 1, 2, j = 1, . . . , mi, (2.6)
γij ∼N(0,Γ), ij ∼N(0, σ2I).
The above FLDA model can be fitted by employing rank constraints and the EM algorithm. Interested readers can refer to James and Hastie (2001) and references therein for detailed fitting procedures. With the fitted model, binary classification can be achieved by using Bayes’ rule and comparing posterior class probabilities.
2.3
Proposed Model
To increase model flexibility, Müller et al. (2013) studied functional nonlinear regres-sion and proposed the continuously additive model (CAM). For regresregres-sion with a scalar response Y and a functional predictor X(·), the continuously additive model assumes
that
E(Y|X) =E(Y) +
Z
T
g{t, X(t)}dt, (2.7)
where the bivariate function g(t, x) is an unknown parameter function to be estimated
andT is the compact domain ofX(·). It includes functional linear regression as a special
case that is obtained by setting g(t, x) = β(t)x for some unknown slope function β(t),
t∈ T.
To conduct model estimation, Müller et al. (2013) proposed to approximate the un-known parameter function by a simple spline, implemented as a spline of order 0, i.e., as a step function. Coefficients associated with the approximation are estimated by solving a penalized least squares problem, where a smoothness penalty is applied to regularize the approximated step function.
using logistic regression. Yet it is well known that logistic regression is an example of the generalized linear model (McCullagh and Nelder, 1991) and based on a very restrictive parametric model assumption on the class conditional probabilities. On the other hand, the SVM (Cortes and Vapnik, 1995) is a flexible nonparametric classification method that only targets the classification boundary, where the class conditional probability is the same for both classes.
As customary in the SVM, we code the binary response as−1or1, that isY ∈ {−1,1}.
Our data are denoted as {(Xi, Yi), i = 1,2,· · · , n,} with binary response Yi ∈ {−1,1}
and functional predictorXi(·)over a compact domainT. Based on the observed data, we
are to estimate a decision functionf(X) = Rt∈T g{t, X(t)}dtfor some unknown parameter function g(t, x) to obtain an estimator fˆ(X) by borrowing the idea of the continuously
additive model. The sign of the decision function will be used as the classification rule, where Yˆ =sign( ˆf(X))gives the predicted class label.
We propose to couple the SVM with the continuously additive modeling framework by estimating the decision function through solving the following optimization problem
min
g n
X
i=1 H
Yi
Z
t∈T
g{t, Xi(t)}dt
+λPs(g), (2.8)
whereH(u)is the aforementioned hinge loss withu=Y R
t∈T g{t, X(t)}dtbeing the
func-tional margin. Here Ps(g) denotes some smoothness penalty on the parameter function
g(·,·), andλ >0is a regularization parameter, which controls the balance between data
fitting measured by the hinge loss and model complexity measured by the smoothness penalty. The optimization is solved with respect to g over some function space.
Denote the class conditional probability by p(x) = P(Y = 1|X = x). Based on the
Fisher consistency for the hinge loss established by Lin (2002),{x:Rt∈T gˆ{t, x(t)}dt = 0}
targets the Bayes classification boundary {x : p(x) = 1/2}. Here gˆ(·,·) denotes the
optimizer of (2.8).
2.4
Estimation
generality thatT = [0,1]and that the range ofX(·)is also[0,1]. We approximateg(t, x)
by a step function gp,q(t, x) = Ppj=1Pqk=1γjk1{(t,x)∈Bjk}, where the coefficients γjk =
g(tj, xk)and Bjk = [tj−1/(2p), tj+ 1/(2p)]×[xk−1/(2q), xk+ 1/(2q)]for j = 1,2, . . . , p
and k = 1,2, . . . , q form an equidistant partition of the whole domain [0,1]×[0,1]. Here
tj = (2j−1)/(2p) and xk = (2k−1)/(2q) form a uniform grid over the t domain and x
domain, respectively. Then the smoothness penalty can be approximated by
Ps(γ) = p−1,q
X
j=2,k=1
p2(γj−1,k −2γj,k +γj+1,k)2+ p,q−1
X
j=1,k=2
q2(γj,k−1−2γj,k+γj,k+1)2. (2.9)
To provide an approximation to R
t∈T g{t, Xi(t)}dt, for predictors X(·)we define
Ijk ={t ∈[0,1] :{t, X(t)} ∈Bj,k} and Zjk =
Z
1Ijk(t)dt (2.10)
for 1 ≤ j ≤ p and 1 ≤ k ≤ q. We define analogously Zijk, j = 1,2, . . . , p, k =
1,2, . . . , q for the ith predictor Xi(·) for i = 1,2, . . . , n. The resulting approximation
for R
t∈T g{t, Xi(t)}dt has the form
p,q
X
j,k=1
γjkZijk. (2.11)
For a given sample (Xi, Yi), i= 1, . . . , n, by combining the above approximations, we
aim at the solution of
min
γjk
n
X
i=1 H(Yi
p,q
X
j,k=1
γjkZijk) +λ
p−1,q
X
j=2,k=1
p2(γj−1,k−2γj,k+γj+1,k)2+ p,q−1
X
j=1,k=2
q2(γj,k−1−2γj,k +γj,k+1)2 (2.12)
and denote the minimizer byγˆjk. Then our estimate of the bivariate parameter function
g(t, x) is given by gˆ(t, x) = Pp
j=1 Pq
k=1ˆγjk1{(t,x)∈Bjk}. For a predictor X(·), the
corre-sponding class label is predicted by Yˆ = sign(Pp
j=1 Pq
k=1γˆjkZjk), where Zjk is defined above.
SVM classifier is its sensitivity to outliers in the training data, especially outliers that are far from their own classes (Wu and Liu, 2007). For SVM, it can be shown that only training observations with functional margin less than one have impact on the estimation of the decision boundary. The hinge loss increases to infinity when the functional margin goes to negative infinity. Thus outliers with big negative functional margin can strongly affect the SVM classification boundary.
To improve the SVM classifier’s robustness to outliers, Wu and Liu (2007) proposed to use the truncated hinge loss
Ts(u) =H(u)−Hs(u), (2.13)
whereHs(u) = max(s−u,0)ands(≤0) denotes the location of truncation. Whenu > s,
the truncated hinge loss is same as the hinge loss, while for u≤ s, the truncated hinge loss takes a constant value H(s). Therefore it is resistant to the outliers’ effects on the
estimated classification boundary while the unbounded hinge loss is not. Wu and Liu (2007) showed that the truncated hinge loss is Fisher-consistent for binary classification as long as the truncation location satisfies −1≤s ≤0.
While the truncation achieves robustness in the resulting classifier, it has the down-side that it makes the truncated hinge loss non-convex. The non-convexity poses great challenges for solving the corresponding optimization problem. Noting that the truncated hinge loss can be written as the difference of two convex functions as in (2.13), Wu and Liu (2007) proposed to use the difference convex algorithm (DCA, An and Tao, 1997).
When employing the truncated hinge loss, we need solve the optimization problem
min
γjk
n
X
i=1 Ts(Yi
p,q
X
j,k=1
γjkZijk) +λ
p−1,q
X
j=2,k=1
p2(γj−1,k−2γj,k+γj+1,k)2+ p,q−1
X
j=1,k=2
q2(γj,k−1−2γj,k+γj,k+1)2 (2.14)
the non-convexity, the DCA algorithm replaces the non-convex term
−
n
X
i=1
Hs(ui) =− n
X
i=1 Hs(Yi
p,q
X
j,k=1
γjkZijk) (2.15)
by its linear approximation at the current solution. Denote the current solution by ˜γjk.
Then the linear approximation is given by
n
X
i=1 Yi(
p,q
X
j,k=1
γjkZijk− p,q
X
j,k=1
˜
γjkZijk)I{YiPj,kp,q=1γ˜jkZijk≤s}, (2.16)
where IA = 1 if A is true and 0 otherwise.
Upon substituting this linear approximation, the objective function becomes convex and can be easily solved by a quadratic programming solver. This step is iterated till convergence. Note that these linear approximations are equivalent up to a constant if I{YiPp,qj,k=1γ˜jkZijk≤s} is the same at two consecutive iterations for all i. Whenever this
hap-pens, the corresponding optimization problems after substituting the linear approxima-tion are equivalent, as their objective funcapproxima-tions only differ by a constant and consequently have the same optimizer. In this case, the algorithm reaches convergence.
2.5
Simulation Studies
In this section we compare the performance of the proposed methods with some of the existing methods in simulations. The comparison methods include functional linear discriminant analysis (FLDA), functional logistic regression based on either functional linear modeling (FLM) or continuously additive modeling techniques. Namely, in addition to the FLDA, we include six classification methods by combining different modeling techiques (FLM or CAM) and different loss functions (hinge loss (SVM), truncated hinge loss (RSVM), or logistic loss (Logistic)). See Section 2 for some brief background on these existing methods.
We generated random predictor curves by setting X(t) =P4k=1ξkφk(t) for t∈ [0, T]
with T = 10, ξ1 = cos(U1), ξ2 = sin(U1), ξ3 = cos(U2), andξ4 = sin(U2). Here U1 and U2 are independently and identically distributed as Uniform[0,2π] and we choose φ1(t) =
response is generated in two steps. We first generate Y˜ = sign RT
0 g(t, X(t))dt−M
, where M = E RT
0 g(t, X(t))dt
. Here M = E RT
0 g(t, X(t))dt
is not readily available due to the nonlinearity inherent in the function g(·,·). We approximate it with a Monte
Carlo approximation based on a sample of size 2000.
To study the robustness of RSVM classifier, we want to bring in outliers, i.e., points far from their own classes. For that purpose we introduce random flipping errors to the simulation data. Specifically, we apply random flipping to Y˜ to generate our final binary
response. Denote the flipping percentage by p. The final binary response is given by Y = ˜Y with probability1−p and −Y˜ with probabilityp.
We consider the following data generating functions:g1(t, x) = (t−x−5)2,g2(t, x) =
tx+ 5, and g3(t, x) = cos(t+x+ 5). We use training, tuning, and test data sets of sizes
200, 200, and 2000, respectively. To study the sample size’s effect on the classification
performance, we choose the function g1 and redo the simulation with larger training and tuning data sets of size 400. For each simulation setting, test and tuning data sets
are independently generated once and used for all repetitions while training data are independently generated for each repetition.
When implementing the CAM framework, we choose p= 40 and q= 40 for domains
of t and x, respectively, by following the suggestion of Müller et al. (2013). For the truncated-hinge-loss SVM in both CAM and FLM frameworks, taking the suggestion of Wu and Liu (2007), we choose the truncation location parametersto be−1corresponding
to the least truncation to guarantee Fisher consistency. The regularization parameter λ is tuned by minimizing the classification error over the independent tuning set. The performance of the properly tuned classifier is evaluated in terms of classification error on the independent test set. We report the average classification error over 100 repetitions for each simulation setting together with the corresponding Bayes error which is same as the random flipping percentage used in the data generation.
2.5.1
Simulation Results
Table 2.5.1: Average classification error and corresponding standard error (×104)
RSVM SVM Logistic FLDA Bayes
CAM FLR CAM FLR CAM FLR Error
37(3) 136(10) 37(3) 136(10) 122(8) 197(12) 797(4) 0%
567(7) 639(9) 680(10) 800(10) 679(11) 853(12) 1204(5) 5% 1081(10) 1150(10) 1182(11) 1395(15) 1196(11) 1445(14) 1739(07) 10% g1 1593(09) 1732(12) 1680(10) 1987(14) 1685(15) 1986(15) 2276(24) 15% 2138(12) 2159(10) 2209(11) 2464(12) 2219(14) 2441(15) 3025(46) 20% 3118(8) 3246(12) 3159(4) 3392(9) 3193(11) 3385(13) 4328(37) 30%
34(2) 136(10) 33(2) 136(10) 120(7) 195(12) 797(4) 0%
567(8) 639(8) 673(12) 813(10) 649(10) 848(11) 1244(5) 5% 1080(12) 1155(10) 1173(11) 1342(12) 1185(11) 1376(13) 1645(08) 10% g2 1587(10) 1682(09) 1671(10) 1895(12) 1691(12) 1906(13) 2183(20) 15% 2136(15) 2238(15) 2202(13) 2452(12) 2226(16) 2417(13) 3039(49) 20% 3198(19) 3264(12) 3204(11) 3394(9) 3206(16) 3347(14) 4239(35) 30% 347(8) 1275(10) 347(8) 1292(9) 356(10) 1265(8) 2275(22) 0% 923(12) 1642(11) 961(11) 1691(9) 974(11) 1656(8) 2567(23) 5% 1468(14) 2006(12) 1508(14) 2097(12) 1549(17) 2040(10) 2865(23) 10% g3 1988(17) 2442(10) 2008(14) 2549(10) 2036(14) 2483(09) 3123(20) 15% 2543(15) 2842(11) 2561(12) 2968(12) 2607(15) 2902(11) 3395(18) 20% 3589(17) 3667(12) 3560(14) 3761(16) 3602(15) 3704(13) 3992(19) 30%
SVM: support vector machine with the hinge loss; RSVM: robust SVM with the truncated hinge loss; FLDA: functional linear discriminant analysis.
Bivariate Functions:g1(t, x) = (t−x−5)2,g2(t, x) =tx+ 5,g3(t, x) = cos(t+x+ 5).
Standard Errors(×104)are given in brackets.
Training and tuning sets of size 200; test set of size 2000; results over 100 replications.
Table 2.5.2: Average classification error and corresponding standard error for g1 with larger sample size (×104)
.
RSVM SVM Logistic FLDA Bayes
CAM FLR CAM FLR CAM FLR Error
25(1) 68(5) 25(1) 68(5) 98(6) 135(8) 788(3) 0%
529(3) 572(5) 636(7) 740(10) 625(7) 791(11) 1187(4) 5% 1035(4) 1072(6) 1126(8) 1338(11) 1126(8) 1356(13) 1711(4) 10% g1 1563(5) 1609(8) 1626(8) 1928(9) 1644(9) 1917(11) 2212(8) 15% 2079(6) 2142(7) 2176(9) 2426(8) 2189(11) 2378(11) 2913(40) 20% 3124(10) 3216(9) 3212(10) 3361(7) 3184(12) 3316(12) 4309(32) 30%
Abbreviations are explained in the legend of Table 2.5.1. Standard Errors(×104)are given in brackets.
research findings of Liu et al. (2011), where it was found that hard classifiers generally perform better than soft classifiers when the underlying class conditional probability is relatively non-smooth. Here the true class conditional probability function is a step function due to the random flipping used in the data generation. The SVM and RSVM are hard classifiers while logistic regression is a typical example of soft classifier.
The robust SVM delivers a slightly better performance than the SVM. This is also due to the random flipping which generates outliers. The SVM is sensitive to outliers while the truncation helps to achieve robustness to outliers as demonstrated in Wu and Liu (2007). The FLM-based classification methods and the FLDA delivered a slightly worse performance. This is within our expectation since both of these are inherently linear models while our simulated data are generated from nonlinear models.
Table 2.5.2 gives the simulation results for g1 function with size of the training and tuning sets increased to 400. Due to the larger sample size, the classification error is smaller than that of Table 2.5.1. Meanwhile a similar performance pattern of different classifiers is also observed in Table 2.5.2.
2.6
Data Illustrations
We next illustrate the proposed method with two real data sets: the yeast gene ex-pression data and the orange juice NIR spectra data1.
2.6.1
Yeast Gene Expression Data
The yeast data set was first reported in Spellman et al. (1998). It has been studied as an example of classifying functional curves by Song et al. (2008) and Müller et al. (2013) among others. During each gene expression time course, gene expression measurements are recorded every 7 minutes and there are 18 measurements in total. Our goal is to classify those genes based on whether or not they are related to the G1phase regulation
of the yeast cell cycle.
We process the data as in Müller et al. (2013), removing one outlier and pre-smoothing the functional curves. Here we also perform a pointwise standardization trans-formation procedure. That is, at each measurement time point we subtract the
tory’s average value and divide by its standard deviation. A plot of standardized trajec-tories is provided in Fig 2.6.1
Figure 2.6.1: Plot of standardized gene expression trajectories for G1 phase regulation in brewer’s yeast.
To compare classification performance, we randomly split the total 91 gene expression curves into a training set of size 75 and a test set of size 16 for each repetition. As with the simulation examples, under the CAM framework we use grid numbers of p=q = 40
and truncation location s =−1. The regularization parameters are tuned by a five-fold
training data as in the simulation examples. Note that random flipping is only applied to the training data but not the test data. Four flipping percentages 0%, 10%, 20% and 30% are considered. We repeat the process 20 times and report the average classification
errors in Table 2.6.1 with associated standard errors in the parentheses.
Table 2.6.1: Average classification errors (×104) for brewer’s yeast gene expression data
RSVM SVM Logistic Flipping Error
CAM FLR CAM FLR CAM FLR
1219(166) 1344(194) 1313(197) 1562(229) 1437(193) 1656(237) 0% 1500(233) 1750(263) 1562(229) 1750(216) 1656(237) 1719(231) 10% 1750(211) 1906(210) 1781(258) 2062(261) 1844(229) 2250(306) 20% 2938(294) 3594(314) 3219(377) 3594(307) 3250(350) 3563(276) 30%
Abbreviations are explained in the legend of Table 2.5.1. Standard Errors(×104)are given in brackets.
The results in Table 2.6.1 reveal a similar pattern as we observed in the simulation study. Under the CAM framework, truncated-hinge-loss SVM delivers the best classifica-tion performance due to the presence of outliers generated by random flipping, followed by the SVM, and then logistic regression. Moreover, we also observe for this data set that the functional linear regression models are inferior to CAM most likely because of the inherent limitations of a linear approach. Also note that when the flipping errors are 0%, 10% and 20%, the differences between classification errors under different flipping error
settings are not very big. One possible explanation is that there may be some outliers with respect to classification boundary in the original data set, and in that case additional small flipping errors may not have enough influence on the classification boundary.
2.6.2
Orange Juice NIR Spectroscopy Data
Infrared spectroscopy is widely used to analyze diverse materials such as food, drink, and pharmaceutical products. The NIR (Near-Infrared) spectrum of a sample is a con-tinuous curve measured by modern scanning facilities at hundreds of equally spaced wavelengths. This curve contains information that can be used to predict the chemical composition of the sample.
model to estimate the level of saccharose of an orange juice sample from its observed NIR spectroscopy, see also Benoudjit et al. (2004). After removing an obvious outlier, the data set contains 149 spectra lines. Each functional data element corresponds to an orange juice sample measured at 700 equally spaced wavelengths ranging from 1100nm to 2500nm. The associated response variable is the corresponding level of saccharose for each orange juice sample.
Figure 2.6.2: Plot of standardized orange juice spectra
another group. A plot of standardized spectra is given in Fig. 2.6.2. Here we randomly split the data into a training set of size 130 and a test set of size 19 for each repetition. Regularization parameters are also tuned by a five-fold cross validation. Two flipping percentages 20% and 30% are considered. We repeat the process 20 times and report
the average classification errors in Table 2.6.2 with associated standard errors in the corresponding parentheses.
Table 2.6.2: Average classification errors (×104) for orange juice NIR spectroscopy data
RSVM SVM Logistic Flipping Error
CAM FLR CAM FLR CAM FLR
2026(196) 2895(237) 2026(184) 2868(224) 1947(187) 3421(155) 0% 2368(224) 3211(259) 2474(220) 3026(222) 2500(222) 3500(281) 10% 2789(273) 3368(271) 2868(299) 3474(260) 3000(202) 3526(209) 20% 3000(236) 3658(304) 3026(293) 3763(301) 3263(269) 3763(236) 30%
Abbreviations are explained in the legend of Table 2.5.1. Standard Errors(×104)are given in brackets.
The results in Table 2.6.2 show a similar pattern as we observed in the yeast gene expression data, i.e., in the presence of flipping errors, CAM outperforms functional linear regression, and when using CAM, the truncated-hinge-loss SVM outperforms the other two classifiers.
2.7
Conclusion
Chapter 3
Variable Selection in Functional
Nonlinear Models via Continuously
Additive Modeling
3.1
Introduction
regularization we manage to prove the consistency of functional variable selection. The rest of the chapter is organized as follows. Section 3.2 shows how to couple CAM with group SCAD penalty and smoothness penalty to form the regularization framework. Section 3.3 presents the detailed procedures for fitting the corresponding regularization problem and selecting the nonlinear functional variables. Section 3.4 illustrates perfor-mance of the proposed method with finite-sample simulation examples under different signal level settings. Section 3.5 gives asymptotic properties of the proposed penalized least squares estimator, their technical proofs are also provided. Section 3.6 concludes the whole chapter.
3.2
Build Regularization Framework via CAM and
Group SCAD Penalty
Due to CAM’s great model flexibility, we propose an extension to perform variable selection for functional nonlinear regression with multiple functional predictors. Specifi-cally, we assume each functional predictor’s contribution to be additive and model each predictor’s contribution using the aforementioned CAM and zero-degree B-spline approx-imation techniques. Variable selection is acheived by applying a group SCAD penalized esitmation procedure. Here coefficients associated with the B-spline approximation for each predictor is treated as a group. This regularization method is similar in spirit to the group Lasso studied by Yuan and Lin (2006). However, we propose to use the group SCAD penalty instead of the group Lasso penalty due to the desirable theoretical properties of estimates under the group SCAD penalty framework.
3.2.1
Model
As already introduced in chapter 2, the CAM assumes the form
E(Y|X) =E(Y) +
Z
T
Here we consider multivariable CAM with J functional predictors, X1, X2, . . . , XJ, and
one scalar response Y. We assume that
E(Y|X1, X2, . . . , XJ) = E(Y) + J
X
j=1 Z
Tj
gj{tj, Xj(tj)}dtj, (3.2)
where gj(·,·) (j = 1,2, . . . , J) are unknown parameter functions to be estimated. Each
functional predictor is assumed to be square-integrable, namely we assume Xj ∈L2(Tj)
for j = 1,2, . . . , J. A random sample {(Xi1, . . . , XiJ, Yi) :i = 1,2, . . . , n} of i.i.d.
obser-vations from model (3.2) is used to estimate the unknown parameter functions.
3.2.2
Zero-Degree Spline Approximation
The unknown parameter function gj(·,·) is modeled nonparametrically. Towards
esti-mation, we borrow the idea of using a zero degree B-spline approximation in Müller et al. (2013). The approximation procedure is similar to that described in chapter 2, except that we now have J functional variables. By rescaling if necessary, we assume without loss of generality thatTj =T = [0,1]and the range ofXj(·)is also[0,1]forj = 1, . . . , J.
We then approximate gj(t, x) by a step function gj(p,q)(t, x) = Pp
l=1 Pq
k=1γjlk1{(t,x)∈Blk},
where the binsBlk= [tl−1/(2p), tl+ 1/(2p)]×[xk−1/(2q), xk+ 1/(2q)]forl = 1,2, . . . , p
and k = 1,2, . . . , q form an equidistant partition of the whole domain [0,1]×[0,1]. Here
tl = (2l−1)/(2p)andxk = (2k−1)/(2q)are the midpoints of binsBlk over thetdomain
and Xj(·) range, respectively, and the coefficients γjlk = gj(tl, xk). Note that a same
equidistant partition is used here for all functional predictors for simplicity. In fact, if necessary, different partitions can be adopted.
We further need to provide an approximation to R
t∈T gj{t, Xj(t)}dt accordingly. For
predictor Xj(t), define
Ijlk ={t ∈[0,1] : (t, Xj(t))∈Bl,k} and Zjlk =
Z
1Ijlk(t)dt, (3.3)
for 1 ≤ l ≤ p and 1 ≤ k ≤ q. Here 1Ijlk(t) = 1 if t ∈ Ijlk and 0 otherwise. Analogous
for R
t∈T gj{t, Xij(t)}dt has the form
p,q
X
l,k=1
γjlkZijlk. (3.4)
Putting the above approximations together, for the ith observation the continuously additive model withJ functional predictors takes the approximation form
E(Yi|Xi1, Xi2, . . . , XiJ)≈µ+ J
X
j=1 nXp,q
l,k=1
γjlkZijlk
o
, (3.5)
where µ = E(Y). Up to this point we have J groups of parameters, with each group containing pq parameters. For the jth group, we denote it by a length-pq coefficient vector γj = (γj11, . . . , γj1q, . . . , γjp1, . . . , γjpq)T. Thus to implement nonlinear functional
variable selection we just need to find those groups with non-zero coefficient vectors, namely γj 6=0pq×1, where 0pq×1 is a (pq)×1 vector of zeros. This type of group variable selection problem was also considered in Yuan and Lin (2006).
3.2.3
Group SCAD Penalty and Smoothness Penalty
According to the above discussion, selecting the important functional predictors in model (3.2) boils down to the selection of non-zero groups of coefficient vectors γj for
j = 1, . . . , J. Towards this goal, we couple the least squares loss with the group level SCAD penalty (Fan and Li, 2001; Wang, Chen, and Li, 2007), which controls the sparsity of coefficients at group level, and the smoothness penalty, which controls the smoothness of the estimated bivariate function gj(·,·).
Specifically, let Y = (Y1, . . . , Yn)T and define Zj = (Z1j, . . . ,Znj)T as the associated
design matrix of size n×(pq) for the jth functional predictor after the aforementioned approximation, where Zij =vec{Zijlk}= (Zij11, . . . , Zij1q, . . . , Zijp1, . . . , Zijpq)T for fixed
i and j with Zijlk’s defined above. In order to select non-zero groups of coefficients we
minimize the following penalized least squares problem
l(γ1, . . . ,γJ, m) =
1
2nkY−m1n×1−
J
X
j=1
Zjγjk2+ J
X
j=1
p1,λ1(kγjk) + J
X
j=1
p2,λ2(γj), (3.6)
length-vector of 1’s, p1,λ1(·) is the SCAD penalty with tuning parameter λ1 and p2,λ2(·) is the smoothness penalty with tuning parameterλ2. The SCAD penalty is defined as (Fan and Li, 2001)
p1,λ1(θ) =
λ1|θ|, if 0≤ |θ|< λ1,
−θ2−2αλ1|θ|+λ21
2(α−1) , if λ1 ≤ |θ|< αλ1, (α+1)λ2
1
2 , otherwise,
(3.7)
where α >2 is a constant. Based on a Bayesian argument, Fan and Li (2001) suggested
to fix α = 3.7. We adopt this suggestion. Figure 3.2.1 shows the plots of SCAD penalty,
derivative of SCAD penalty, and the corresponding thresholding rule, respectively. The SCAD penalty corresponds to quadratic spline function with knots at αλ1 and λ1. It is continuous and differentiable on(−∞,0)∩(0,∞), but becomes singular at 0. Its
deriva-tives become zero outside the range [−αλ1, αλ1]. The SCAD penalty results in small (in absolute value) coefficients being set to zero, some less smaller coefficients being shrunk to zero while keeping the large coefficients unchanged. In this way, SCAD can produce sparse set of solution and approximately unbiased coefficients for large coefficients. In our context, kγjk is used as the argument of the SCAD penalty and in this way the SCAD
penalty will take sparsity effect at a group level.
Smoothness penalty p2,λ2(·) penalizes against discretized second-order differences of gj(t, xj)(Müller et al., 2013). It is given by
p2,λ2(γj) = λ2
(p−1),q
X
l=2,k=1
p2(γj,l−1,k−2γj,l,k+γj,l+1,k)2+ p,(q−1)
X
l=1,k=2
q2(γj,l,k−1−2γj,l,k+γj,l,k+1)2 , (3.8) or in quadratic matrix form
p2,λ2(γj) = λ2γ T
j Psγj, (3.9)
wherePs is the corresponding semipositive-definite(pq)×(pq)penalty matrix that is the
(a) SCAD Penalty for −8≤θ≤8. (b) Derivative of SCAD Penalty for
−8≤θ≤8.
(c) SCAD Penalty Thresholding Rule for
−10≤θOLS ≤10
Figure 3.2.1: SCAD Penalty, SCAD Penalty Derivative, and Thresholding Rule.
3.3
Estimation Algorithm and Tuning Parameter
Selection
Note that the regression coefficient function in the CAM is not identifiable. A mean zero identifiability condition ERT g{t, X(t)}dt = 0 can be enforced. Its finite sample
penalty terms to achieve identifiability as detailed next.
Because the group SCAD penalty is not differentiable at 0, we can’t apply the
com-monly used gradient method to optimize the above penalized least squares. To get around this issue we implement an iterative algorithm based on local quadratic approximation (LQA) of the non-concave group SCAD penalty p1,λ1(kγjk) as in Fan and Li (2001). Specifically, for a given group j, in a neighborhood of some initial value kγ0
jk > 0 the
LQA approximates the group SCAD penalty at value kγjkby a quadratic form
p1,λ1(kγjk)≈p1,λ1(kγ 0
jk) + 1/2{p
0
1,λ1(kγ 0
jk)/kγ
0
jk}{kγjk2− kγj0k
2}. (3.10) The algorithm is therefore outlined as follows:
Step 1 We initialize with a Ridge regression solution. Namely, we substitute the Ridge penalty for group SCAD penalty and solve the following penalized least squares problem
min
γ1,...,γJ,m
1
2nkY−m1n×1−
J
X
j=1
Zjγjk2+ J
X
j=1
λ1kγjk2+ J
X
j=1
λ2γjTPsγj. (3.11)
Since the penalized least squares loss (3.11) is group separable, the associated min-imization problem can be solved by block coordinate descent algorithm (Tseng, 2001). We denote the solutions byγˆ1ridge, . . . ,γˆJridge and mˆridge.
Step 2 Set initial values γˆ1(0), . . . ,γˆJ(0) and mˆ(0) to be γˆridge
1 , . . . ,γˆ
ridge
J and mˆridge,
respec-tively. We index the outer loop of block coordinate descent algorithm by r. Let
ˆ
γ1(r), . . . ,γˆJ(r) and mˆ(r) denote the estimates at the end of loopr. Set r= 1.
Step 3 At therth loop of the block coordinate descent algorithm.
(a) To update the estimate of γj corresponding to the jth predictor, we define
adjusted response vector Yˆ(r)
−j = Y−
Pj−1
j0=1Zj0γˆ(r)
j0 − PJ
j0=j+1Zj0γˆ(r−1)
j0 after adjusting the fitted contribution from predictors other than the jth one. We use the LQA to minimize (3.6) as follows and index its loop by s. Denote the estimates at the end of the sth loop of LQA by γˆj{s} and mˆ{s}.
(ii) At loop s, solve
min
γj,m
1 2n
Yˆ
(r)
−j −Zjγj −m1n×1 2
+p 0
1,λ1(kˆγ
{s−1}
j k)
2kˆγj{s−1}k (γj)
Tγ
j +λ2γjTPsγj.
Denote the minimizer by γˆj{s} and mˆ{s} and sets=s+ 1.
(iii) Repeat step (ii) until the solution difference between two consecutive it-erations becomes smaller than a threshold value. Then update estimate
ˆ
γj(r0) = ˆγ
{s}
j0 . Note if the L2 norm of the γˆ (r)
j0 is less than a threshold value, we treat group j0 as unimportant and remove it from fitting.
(b) Repeat step (a) for j = 1, . . . , J to complete one iteration of the block coor-dinate descent algorithm. Update by settingr =r+ 1.
Step 4 Repeat the iteration of block coordinate descent algorithm atStep 3 until the solu-tion difference between two consecutive iterasolu-tions becomes smaller than a threshold value. The coefficient groups still left correspond to those estimated important func-tional predictors.
Here we still need to choose tuning parameter pair (λ1, λ2) to implement the above algorithm. Usually these two tuning parameters can be selected simultaneously by gener-alized cross-validation (GCV). However, Wang, Li, and Tsai (2007) points out that in the setting of SCAD penalty the commonly used GCV suffers from an overfitting problem and thus can’t select tuning parameters satisfactorily. Instead, they propose to use the Bayesian information criterion (BIC) tuning parameter selector and further proved its consistency. Therefore here we apply BIC to select tuning parameter pair. Under the assumption that the model errors are independent and identically distributed according to a normal distribution, the BIC assumes the form
BIC(λ1,λ2) = log ˆσ 2
λ1,λ2 + DF(λ1,λ2)log (n)/n, (3.12) whereσˆ2
λ1λ2 =kY− ˆ
total degrees of freedomDF(λ1λ2)= PM
i=1tr
Si(λ1, λ2) + 1, where the smoother matrix Si(λ1, λ2) takes the form
Si(λ1, λ2) =Zi(ZTi Zi+n
p01,λ
1(kˆγik)
kˆγik
Ipq+ 2nλ2Ps)−1ZTi, (3.13)
where Ipq is an identity matrix of size pq and γˆi (i = 1, . . . , M) is the estimated
coef-ficient vector corresponding to ith estimated important functional predictor from block coordinate descent algorithm described above. In the case matrix(ZTi Zi+n
p01,λ 1(kγˆik)
kγˆik Ipq+
2nλ2Ps)does not have full rank, generalized inverse can be used in (3.13).
3.4
Simulation Study
3.4.1
Data Generation
• Functional predictor: Generate Xj(t) =
P4
k=1ξjkφk(t) for t ∈ [0, T] with T =
10 and j = 1, . . . ,10, where ξj1 = cos(U2j−1), ξj2 = sin(U2j−1), ξj3 = cos(U2j),
and ξj4 = sin(U2j). Here U1, . . . , U20 are independent and identically distributed
uniform random variables with support[0,2π]. Also letφ1(t) = sin(2πt/T),φ2(t) = cos(2πt/T), φ3(t) = sin(4πt/T)and φ4(t) = cos(4πt/T).
For the simulation study we use grid number p = q = 40, training data set size
ntrain = 200and test data set size ntest = 1000.
• Response: Set the first three functional predictors as important and generate Y =P3j=1R10
0 gj{t, Xj(t)}dt+, where ∼ N(0, σ
2) is Gaussian error with mean 0 and variance σ2 = 1. For the three bivariate smooth functions g
1(·,·), g2(·,·) and g3(·,·), we consider four scenarios with different coefficient of determination R2 values, namely (for simplicity, here we use g1, g
2 and g3 to represent g1(t, x), g2(t, x) and g3(t, x), respectively)
R2 = 0.90: g1 = 0.070(t−x−5)2, g2 = 0.140(tx+ 5), g3 = 0.750 cos(t+x+ 5); R2 = 0.85: g
1 = 0.050(t−x−5)2, g2 = 0.110(tx+ 5), g3 = 0.510 cos(t+x+ 5); R2 = 0.80: g1 = 0.045(t−x−5)2, g2 = 0.106(tx+ 5), g3 = 0.435 cos(t+x+ 5);
R2 = 0.75: g
3.4.2
Choosing Tuning Parameters
We choose tuning parameter pair over a two-dimensional grid λ1 ×λ2 = 2N1 ×2N2, whereN1 ={−5,−4,−3, . . . ,3}and N2 ={−20,−19,−18, . . . ,−9}. The tuning param-eters are chosen in a zigzag fashion. Namely, first fix the second tuning parameter to be 2−14 and loop over the first tuning parameters, choose the optimal λ
1 by minimizing the BIC. Then fix the first tuning parameter to be the above optimal λ1 and loop over the second tuning parameters, still choose the optimal λ2 by minimizing BIC and denote that λ2 by λ
optimal
2 . At the final step fix the second tuning parameter to be λ
optimal
2 and loop one more time over the first tuning parameters, denote the chosen one by λoptimal1 . Upon completion of three searching loops, we obtain the optimal tuning parameter pair
(λoptimal1 , λoptimal2 ).
3.4.3
Simulation Results Summary
Table 2.5.1 gives functional variable selection performance of the proposed method under different R2 values, the results are based on 100 replications. The table has 4 columns, the first column is R2 values, the second column is the percentage of correct model selection (select X1, X2 and X3) among 100 replications. The third column is the average ratio of mean squared prediction error (MSPE) to the variance of random error. In our caseσ2 = 1, therefore the average ratio is just the average MSPE over 100 replications.
Here MSPE =Pntest
i=1 (Yi−Yˆi)2/ntest. The fourth column is the average of relative mean
squared prediction error (rMSPE), and rMSPE=MSPE/n1
test
Pntest
i=1 (Yi−Y¯)
2 , where
¯
Y = 1
ntest
Pntest
Table 3.4.1: Nonlinear functional variable selection performance with ntrain = 200,
ntest = 1000. Standard errors are given in the parentheses.
R2 Correct Model Selection Average MSPE Average rMSPE
0.90 99% 1.4651 (0.0232) 0.1412 (0.0024)
0.85 98% 1.3812 (0.0264) 0.2392 (0.0049)
0.80 87% 1.5000 (0.0646) 0.3206 (0.0141)
0.75 55% 2.1307 (0.1009) 0.5402 (0.0255)
Based on the results in Table 3.4.1, we see that the variable selection performance gets better with increasingR2 value.
3.5
Asymptotic Properties
Assume the sample size n and the number of parameters in each coefficient group dn = pnqn are both diverging, also assume gird numbers pn and qn diverge with the
same order. Given J functional predictors, without loss of generality suppose the first M of them are important and the rest (J−M) are unimportant. The total number of
parameters is Dn = dn ×J. Denote the response vector by Y, under standardization
assume the true model has form
Y =
M
X
j=1 Z
T
gj{t,Xj(t)}dt+. (3.14)
Without loss of generalityY is ann×1centered vector of responses,Xj = X1j, . . . ,
Xnj
T
is thejth functional predictor vector fornsubjects, and the vectorRT gj{t,Xj(t)}dt
= hR
T gj{t, X1j(t)}dt, . . . ,
R
T gj{t, Xnj(t)}dt
iT
. Also = (1, . . . , n)T are independent
and identically distributed random errors with mean 0and finite variance σ2. To ensure identifiability of integral components, without loss of generality the integral component R
T gj{t, Xj(t)}dt is centered for j = 1, . . . , M.
corresponding approximation forE(Y|X1, . . . ,XM) = PMj=1
R
T gj{t,Xj(t)}dtof the form
(note that the values of elements inZj andγj will change asndiverges, but for simplicity
we omit the dependence of n in their notations)
E(Y|X1, . . . ,XM)≈ M
X
j=1
Zjγj. (3.15)
For notation simplicity let X(I) ≡
X1, . . . ,XM represent the set of M important
functional predictors for n subjects, also let θ = E(Y|X(I)) = E(Y|X1, . . . ,XM)
rep-resent the expected value of responses conditional on X(I). Müller et al. (2013) shows
that
kθ−
M
X
j=1
Zjγjk∞=O(1/
p
dn), (3.16)
where Zj = (Z1j, . . . ,Znj)T is a n × dn design matrix for functional predictor Xj,
Zij = vec{Zijlk} = (Zij11, . . . , Zij1qn, . . . , Zijpn1, . . . , Zijpnqn)
T for fixed i and j.
Un-der standardization, without loss of generality Zij is centered for each i and j. Also
γj = (γj11, . . . , γj1qn, . . . , γjpn1, . . . , γjpnqn)
T is a d
n×1 vector of coefficients for the jth
group, γjlk = gj(tl, xk) for l = 1, . . . , pn and k = 1, . . . , qn, where tl and xk are the
midpoints of the bins.
In the context of functional variable selection, givenJ functional variables and without loss of generality assume the first M of them are important. The penalized least squares function is formulated as
ln(γ) =
1
2kY−Zγk
2+n
J
X
j=1
p1,λ1n(kγjk) +n
J
X
j=1
p2,λ2n(γj)
= 1
2kY−θ+θ−Zγk
2+n
J
X
j=1
p1,λ1n(kγjk) +n
J
X
j=1
p2,λ2n(γj), (3.17)
whereZ= (Z1, . . . ,ZJ)is the model design matrix of sizen×Dn andγ= (γ1T, . . . ,γJT)T.
Let vector γ∗ = (γ∗T
(I),γ
∗T
(U))
T denote the “true” value of unknown coefficient vector. That
is to say, γ(∗I) = (γ∗T
1 , . . . ,γ
∗T
M )T is nonzero sub-vector of length M dn, each coefficient in
jth group γjlk∗ = gj(tl, xk) for j = 1, . . . , M, l = 1, . . . , pn and k = 1, . . . , qn. Thus γ(∗I)
of length (J −M)dn, namely γ(∗U) =0(J−M)dn×1.
Note that only functional predictors X1, . . . , XM are important. Therefore we can
decomposeZ= (Z(I),Z(U)), whereZ(I) is the submatrix of sizen×(M dn)corresponding
to the first M groups and Z(U) is the submatrix of size n×(J−M)dn corresponding to
the rest (J−M) groups. Also let Ps be the smoothness penalty matrix with dimension
dn×dn for each group of coefficients. Here we first give some conditions for developing
asymptotic properties of the estimated coefficient vectors.
(A0) Without loss of generality assume the domains of t and Xj (j = 1, . . . , J) are
both standardized to [0,1]. We assume the nonlinear function gj : [0,1]2 → R
(j = 1, . . . , M) is Lipschitz continuous. Also assume for all t ∈ [0,1], the random
variable Xj(t) (j = 1, . . . , J) has positive density on [0,1] and Xj(·) is continuous
int.
(A1) Design vectors Vi = (ZTi1, . . . ,ZTiJ)T, i= 1, . . . n, are i.i.d. and independent of
ran-dom error .
(A2) As n → ∞,pn → ∞ and qn→ ∞ with the same order larger than or equal to n 1 4, and thus dn→ ∞ with order larger than or equal to
√
n, wheredn=pnqn. At the
same time dn/n →0.
(A3) LetΣn denote the covariance matrix of Vi. Here we assumeΣn is positive definite,
and its eigenvalues are bounded, namely 0 < ρmin < ρmax < ∞, where ρmin =
ρmin(n) and ρmax = ρmax(n) are the minimum and maximum eigenvalues of Σn.
Note that the sample covariance matrixn−1ZTZ−−→a.s. Σ
n by the strong law of large
numbers, The assumptionΣnis positive definite guaranteesZis full rank whenn→
∞, thus model is identifiable. In a similar way, we have n−1ZT
(I)Z(I)
a.s.
−−→ Σn1 and n−1ZT(U)Z(U)
a.s.
−−→Σn2, whereΣn1 is the covariance matrix ofVi(I) = (ZTi1, . . . ,ZTiM)T
and Σn2 is the covariance matrix of Vi(U) = (ZTi(M+1), . . . ,Z
T
iJ)T. Note that under
assumption Σn is positive definite, Σn1 and Σn2 are also positive definite, their eigenvalues satisfy 0 < φmin < φmax < ∞ and 0 < ωmin < ωmax < ∞, where
φmin = φmin(n), φmax = φmax(n), ωmin = ωmin(n) and ωmax = ωmax(n) are the
minimum and maximum eigenvalues ofΣn1 and Σn2, respectively.
(A4) The smoothness penalty matrix Ps is semi-positive definite, its eigenvalues are
κmin(n) and κmax =κmax(n) are the minimum and maximum eigenvalues of Ps.
(A5) There exists a positive constantK such thatminj=1,...,Mkγj∗k≥K as n→ ∞.
(A6) The infinity norm of approximation errorkθ−Zγ∗k∞=O(1/√dn).
Theorem 3.5.1 (Estimation consistency). if λ1n → 0, dn/n → 0 and
√
nλ2n = O(1) when n → ∞, then there exists a local minimizer γˆ of ln(γ) such that kˆγ −γ∗k =
Op(d
1/2
n n−1/2).
Proof. It is equivalent to prove for any > 0, there exists a finite C > 0 such that
P robpn
dnkγˆ−γ
∗k< C ≥1−. Let’s consider a ballB =
γ :kγ−γ∗k ≤dn1/2n−1/2C .
SinceB is a compact set andln(γ)is a continuous function onB, there exists a minimum
ofln(γ)onB. Ifln(γ)> ln(γ∗)for every pointγon the boundary ofB, then the minimum
ofln(γ)onB is achieved at some pointγinside the ballB and suchγis a local minimizer
of ln(γ). Let γˆ denote the local minimizer, then we havekγˆ−γ∗k< d
1/2
n n−1/2C.
Through the above discussions, it is sufficient to prove that for any given >0, there
exists a finite C >0 such that for large enough n, we have
P robn inf kuk=Cln(γ
∗
+d1n/2n−1/2u)> ln(γ∗)
o
≥1−, (3.18)
where u = (uT
1, . . . ,uTJ)T is a vector of dimension Dn×1 with kuk = C , and uj is a