Some Regression Models and Algorithms for Functional Data.

(1)

ABSTRACT

USSET, JOSEPH LOUIS. Some Regression Models and Algorithms for Functional Data. (Under the direction of Armin Schwartzman and Arnab Maity.)

This dissertation consists of three projects in the area of Functional Data Analyses (FDA) that share a common methodological framework based on extensions of generalized additive

models fit with penalized regression splines. The first project proposes methods for estimation

and prediction in a functional regression model with a scalar response and multiple functional predictors that accommodates two-way interactions in addition to their main effects. The second

project presents methods for hypothesis testing for a two-way interaction in the functional linear

model. The final project develops a method for identifying a smooth function of inflection points that lie within a time series of curves.

Chapter 2 describes the first project. It considers a functional regression model with a

scalar response and multiple functional predictors that accommodates two-way interactions in addition to their main effects. We develop an estimation procedure where the main effects are

modeled using penalized regression splines, and the interaction effect by a tensor product basis.

For comparison we present a low-dimensional estimation approach based on an extension of functional principal components regression. Extensions to generalized linear models and data

observed on sparse grids or with error are also presented. Our proposed method can be easily

implemented through existing software. Through numerical studies we investigate the effects of model mis-specification in functional linear models. We find that fitting an additive model in the

presence of interaction leads to both poor estimation performance and loss of prediction power,

while fitting an interaction model where there is in fact no interaction leads to negligible losses. We illustrate our methodology by analyzing a brain tractography data set and the AneuRisk65

data.

Chapter 3 describes the second project. It deals with the issue of hypothesis testing for a two-way interaction effect in a functional linear model with multiple functional predictors.

We develop three hypothesis testing procedures of a two-way interaction effect between two

functional predictors in a functional linear model with a scalar response. First, we propose a novel application of a recently developed Wald statistic for testing in generalized additive models

fit with penalized regression splines. Next, we consider two testing procedures that rely on low-dimensional approximations to the functional linear interaction model where the coefficients for

interaction are treated as either fixed or random effects. We provide code examples for each

procedure, and extensions to generalized linear models and data observed on sparse grids or with error are also presented. All the methods are compared through extensive simulation studies

(2)

power of the proposed tests, while the low-dimensional approach that incorporates interaction

coefficients as random effects performs second best. We illustrate our methodology on the brain tractography and AneuRisk65 data.

Chapter 4 describes the third project. It develops a statistical algorithm to identify a smooth

path of inflection points within a time series of curves. The scientific motivation is to build an automated way to quantify the retreat of mountain glaciers from a sequence of 1-D intensity

profiles extracted from 2-D Landsat images. The guiding assumption we make is that the glacier

terminus lies near the point of highest inflection in the image profiles. The framework we propose has three main steps. First, the intensity profiles are smoothed with penalized regression splines,

and first derivatives of the smoothed profiles are extracted. Second, the first derivative profiles

are input into a pilot path algorithm which outputs an initial estimate of the terminus location. Third, the pilot path is smoothed to identify long-term trends in the terminus movement. The

viability of our method is shown through simulation studies and applications to real data from

(3)

(4)

Some Regression Models and Algorithms for Functional Data

by

Joseph Louis Usset

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fulfillment of the requirements for the Degree of

Doctor of Philosophy

Statistics

Raleigh, North Carolina

2014

APPROVED BY:

Ana-Maria Staicu David Dickey

Armin Schwartzman Co-chair of Advisory Committee

Arnab Maity

(5)

DEDICATION

(6)

BIOGRAPHY

Joseph Usset was born in St. Paul, MN to Mary and Edward Usset. He graduated from Henry Sibley High School in Mendota Heights, MN in 2004 and attended St. Olaf College in Northfield,

MN where he graduated in 2008 with a B.A. in Mathematics. He moved to Raleigh, NC to join the Department of Statistics at North Carolina State University and completed a Master’s

(7)

ACKNOWLEDGEMENTS

First, I would like to particularly thank my research advisors Arnab Maity, Armin Schwartzman, and Ana-Maria Staicu. This dissertation would not have been possible without each of your time

and patience. Thank you for your encouragement, ideas and debates, and professional guidance. I also want to thank my committee member David Dickey for a willingness to share time and

perspective; and the many other amazing faculty, staff, and fellow students who have made my

time at NC State worthwhile.

This dissertation started on a hill in Minnesota, and so I want to thank my faculty advisors

from St. Olaf College. Thank you Kristina Garrett for helping me see creativity and excitement

in mathematical research, and Paul Roback for introducing me to statistics and encouraging me to apply to graduate school.

Finally, I want to thank my parents for dropping me off at Mendota Elementary many years

ago. Mom, thank you for worrying for me, and always pushing me to do my best. Dad, thank for your example, and teaching me everything from chess to baseball. I could not have asked

for better support, and I am in huge debt to both of you for the wonderful opportunities it has

(8)

TABLE OF CONTENTS

LIST OF TABLES . . . vii

LIST OF FIGURES . . . viii

Chapter 1 Introduction . . . 1

Chapter 2 Estimation of Interaction in Functional Linear Regression . . . 4

2.1 Introduction . . . 4

2.2 Methodology . . . 6

2.2.1 Fitting . . . 9

2.2.2 Standard Error Estimation . . . 10

2.2.3 Smoothing parameter selection . . . 11

2.3 Extensions . . . 12

2.3.1 Generalized Functional Interaction Models . . . 12

2.3.2 Functional Principal Components Analysis . . . 14

2.4 Functional Principal Components Regression . . . 15

2.5 Simulation . . . 16

2.5.1 Design and Assessment . . . 17

2.5.2 Simulation Results . . . 18

2.6 Applications . . . 19

2.6.1 DTI data . . . 19

2.6.2 Application to AneuRisk study . . . 21

2.7 Implementation . . . 23

2.8 Discussion . . . 24

Chapter 3 Testing for Interaction in Functional Linear Regression . . . 26

3.2.1 Wald Test . . . 28

3.2.2 Low-Dimensional Framework . . . 30

3.3 Simulations . . . 32

3.3.1 Simulation Design . . . 32

3.3.2 Implementation . . . 32

3.4 Data Analyses . . . 34

3.4.1 DTI . . . 34

3.4.2 Aneurisk . . . 34

Chapter 4 Glacier Terminus Estimation from LandSat Images . . . 36

(9)

4.2.2 The Model . . . 40

4.3 Spatial Smoothing . . . 42

4.3.1 Spline Smoothing . . . 42

4.3.2 Basis and Penalty Specification . . . 43

4.4 Pilot Path Algorithm . . . 44

4.5 Temporal Smoothing . . . 45

4.5.1 Weights . . . 45

4.5.2 Estimation . . . 47

4.6 Simulation Study . . . 48

4.6.1 Data Generation . . . 48

4.7 Real Data Analyses . . . 51

4.8.1 Alternative Directions . . . 52

References. . . 56

Appendices . . . 65

Appendix A Supplementary Material for Estimation of Interaction in Functional Linear Regression . . . 66

A.1 DTI Results . . . 66

A.2 Aneurisk Results . . . 68

A.3 Simulation Results . . . 71

Appendix B Supplementary Material for Testing for Interaction in Functional Linear Regression . . . 78

B.1 Simulation Settings . . . 78

B.2 Simulation Results . . . 79

Appendix C Supplementary Material for Glacier Terminus Estimation from Land-Sat Images . . . 84

C.1 Data Analyses . . . 84

C.1.1 Gorner . . . 85

C.1.2 Franz Josef . . . 86

(10)

LIST OF TABLES

Table 4.1 Simulation results for g1(t) . . . 50

Table 4.2 Simulation results for g2(t) . . . 50

Table A.1 Mis-classification rates for PFR interaction model in Aneurisk Data. . . 69

Table A.2 Mis-classification rates for FPCR interaction model in Aneurisk Data. . . 69

Table A.3 Simulation results when predictors are observed without error. . . 75

(11)

LIST OF FIGURES

Figure 4.1 Nigardsbreen results. . . 38

Figure 4.2 Motivation for model assumptions. . . 41

Figure 4.3 Motivation for estimation criterion. . . 42

Figure 4.4 Visual explanation of pilot path algorithm. . . 44

Figure 4.5 Simulation settings. . . 48

Figure 4.6 Jackknife confidence intervals. . . 53

Figure A.1 DTI data. . . 66

Figure A.2 PFR estimates for DTI study. . . 67

Figure A.3 FPCR estimates for DTI study. . . 67

Figure A.4 Aneurisk Data. . . 68

Figure A.5 PFR estimates for Aneurisk65 data. . . 68

Figure A.6 FPCR estimates for Aneurisk65 data. . . 69

Figure A.7 PFR estimated probabilities of group membership for Aneurisk65 data. . . . 70

Figure A.8 FPCR estimated probabilities of group membership for Aneurisk65 data. . . 70

Figure A.9 Results for situation where Gaussian data was generated from the interaction model at sample size 100. . . 71

Figure A.12 Results for situation where Gaussian data was generated from the interaction model at sample size 200, and interaction model is mis-specified. . . 74

Figure B.1 Settings for testing simulation. . . 78

Figure B.2 Quantile-Quantile plots for empirical null distribution of P-values compared to a uniform(0,1) distribution for Gaussian data simulation with n= 100. . . 79

Figure B.3 Power results for simulation from Gaussian data simulation withn= 100. . . 80

Figure B.4 Quantile-Quantile plots for empirical null distribution of P-values compared to a uniform(0,1) distribution for Gaussian data simulation with n= 250. . 80

Figure B.5 Power results for Gaussian data simulation withn= 250. . . 81

Figure B.6 Quantile-Quantile plots for empirical null distribution of P-values compared to a uniform(0,1) distribution from Binomial data simulation withn= 250. . 81

Figure B.7 Power results for the Binomial data simulation with n= 250. . . 82

Figure B.8 Quantile-Quantile plots for empirical null distribution of P-values compared to a uniform(0,1) from Binomial data simulation withn= 500. . . 82

Figure B.9 Power results for the Binomial data simulation with n= 500. . . 83

Figure C.1 Gorner results. . . 85

Figure C.2 Franz Josef results. . . 86

(12)

Chapter 1

Introduction

Functional Data Analyses (FDA) has attracted a lot of attention recently. This interest is due to modern technology that generates unprecedented amounts of high-dimensional data

across scientific disciplines. Often, these high-dimensional data arise from smooth and unknown

underlying processes; and are considered functional data. FDA are extensions of multivariate data analyses to situations where one or more of the variables of interest are functions, as

opposed to vector valued. Broadly, FDA methods leverage smoothness in the functional data

to obtain more effective dimension reduction and information extraction.

This dissertation consists of three FDA projects. Each project is motivated by data for which

standard statistical tools are not well suited. Chapters 2 and 3 consider a study that seeks to

relate the presence or absence of aneurysms on the internal carotid artery (ICA) to the ICA geometry. For each individual, the ICA geometry is described by curvature and radius profiles

evaluated along the artery centerline. Chapters 2 and 3 also include a neurological study that

relates cognitive exam scores, to brain tractography profiles which are indicators of multiple sclerosis. Chapter 4 seeks to quantify glacier recession from satellite imagery, and the data

are 1-D image profiles extracted from a 2-D Landsat images over time. The curvature, radius,

tractography, and image profiles are all functional data, and the analyses of these data sets require FDA tools.

A common theme is that proposed methods are based on extensions of tools developed for

generalized additive models (GAMs). GAMs are generalized linear models in which at least one component of the linear predictors is explained by a smooth function of the predictor variables.

An extended definition of GAMs includes situations where the unknown functions relate to the

response through a pre-specified linear functional; and this broader definition of GAMs includes regression models where the predictors are thought of as functions. Both GAMs and FDA have

(13)

The models we consider take the general form

Yi =α+

X

j

Lijfj+i, (i= 1, ..., n), (1.1)

whereY1, ..., Ynare independent responses and thefj are smooth unknown functions of interest

of one or more variables. The Lij represent bounded linear functionals which are described by

Wahba (1990). These models provide a general framework that encompasses both functional

linear regression (focus of Chapters 2 and 3) and non-parametric regression (used in Chapter

4). Each project also shares a common basis for estimation from model (1.1). The unknownfj

are expanded in terms of known bases functions

fj =

X

k

φjk(·)βjk

where theβjk are unknown coefficients of interest. The smoothness for each function is controlled

by an extra roughness penaltyβ_jTP˜jβj, where βj is the corresponding vector of coefficients and

˜

Pj is a pre-specified penalty matrix. The model can be written asYi =Xiθ+i, where X is a

design matrix whose ith row contains the evaluated Lijφjk, and θ is vector of coefficients that

contains α and the sub-vectors βj. For penalty matrices Pj whose only non-zero elements are

blocks ˜Pj, such thatβjTPjβj =βjTP˜jβj,θis estimated by

ˆ

θ= arg max_θ

 



l(θ)−

J

X

j=1

λjβjTPjβj

 



, (1.2)

where l(θ) is the log-likelihood and the λj are roughness penalties that control the tradeoff

between the smoothness of the function estimates and the fit to the data. The roughness

penal-ties drive model fitting, and are selected by prediction based criteria such as cross-validation,

or from a mixed model representation of the fitting procedure where the smoothing parame-ters enter as variance components. From the estimated coefficients ˆθ, the individual functional

estimates are then obtained by ˆfj =P_kφjk(·) ˆβjk.

A recurring challenge is producing reliable inferential procedures for the function estimates, both in terms of point-wise intervals and hypothesis tests. This difficulty is the cost of flexibility

provided by GAM estimation with penalized splines. While GAMs provide a rich framework for estimation, the roughness penalties induce bias into the fitting procedure. This bias disrupts

standard inferential tools for linear models and generalized linear models (GLMs), and if left

un-accounted for, leads to under-estimation of the uncertainty in parameter estimates. Therefore, in each project we follow a Bayesian approach to obtain interval estimates and develop hypothesis

(14)

Chapter 2 focuses on estimation in a model with a scalar response and two functional

predictors that accommodates two-way interactions in addition to their main effects. The main effects of the functional predictors on the mean response is quantified by the inner product

of the individual predictors with unknown main effect functions β1 and β2; while the effect

of interaction is modeled as the inner product of the point-wise product of the functional predictors with an unknown bi-variate function γ. Extensions to generalized linear models and

data observed on sparse grids or with error are also presented. For comparison purposes, we

consider a low-dimensional approach based on an extension to Functional Principal Components Regression (FPCR). We evaluate our methodology through extensive simulation study, and also

illustrate the procedure on a brain tractography data set and the AneuRisk65 study data.

Chapter 3 considers the issue of hypothesis testing for a two-way interaction effect in between two functional predictors in a functional linear model. We develop three procedures for testing

hypotheses a two-way interaction effect between two functional predictors in a functional linear

model with a scalar response. The first approach we propose is an natural extension of the penalized spline based estimation procedure described above, where a test statistic is developed

that is based on a Bayesian approach to uncertainty estimation. For comparison we also develop

testing procedures based on a low-dimensional representation of the functional linear model, where the coefficients for interaction enter as either fixed or random effects. The test procedures

are extended to the generalized linear models settings, and to scenarios where the functional covariates are measured with error or on a sparse grid.

Chapter 4 develops a method to detect a smooth path of change points from a time series

of continuous curves. This goal of this project is to develop an automated way to track glacial recession from satellite imagery. The proposed method simplifies the problem from 2-D to

1-D by extracting spatial image intensity profiles along the inlets in the glacier, and tracking

the glacier terminus over time. For each intensity profile we apply non-parametric regression and inflection points in the regressor functions are seen as potential candidates for the terminus

location. A pilot algorithm is used to identify the best candidate for the glacier terminus location

over time. Finally, nonparametric regression is applied to the temporal sequence of estimated terminus locations in order to quantify the long-term trend in the glacier’s retreat. Our method

is shown to be effective through both simulation study, and real data analyses of the Gorner,

(15)

Chapter 2

Estimation of Interaction in

Functional Linear Regression

2.1 Introduction

Functional linear regression models with scalar response and functional covariates have received

a significant amount of attention in literature since its introduction by Ramsay and Dalzell (1991). A typical functional linear model with a single functional predictor quantifies the effect

of the predictor as an inner product between the functional predictor and an unknown

coeffi-cient function. Estimation of the coefficoeffi-cient is done using basis expansions using pre-specified basis functions, e.g., spline or Fourier bases, or empirical eigenbasis functions. Estimation and

inference on this model is well studied, see for example, Frank and Friedman (1993), Cardot

et al. (2003b) and Hall and Horowitz (2007). There have been several extensions to the func-tional linear models, including nonparametric dependence for the predictors (Ferraty and Vieu,

2002); parametric models with quadratic dependence (Yao and M¨uller, 2010), additive

regres-sion models accounting for linear main effects of multiple predictors (James, 2002; Goldsmith et al., 2011; Gertheiss et al., 2013) as well as nonlinear additive models (Ferraty and Vieu, 2009;

Chen et al., 2011; Fan and James, 2012). However, all the above mentioned literature consider only main effects, whether linear or nonlinear, of the functional predictors, and do not account

for a possible interaction effect between two different functional covariates. The main focus of

this chapter is a functional linear regression model that accounts for a two-way interactions in addition to the main effects of the functional variables.

The model we consider is defined as follows. Suppose for i= 1, . . . , n, we observe a scalar

(16)

X1i(·) andX2i(·) observed without noise, on dense grids. We propose the model

E[Yi|X1i, X2i] =α+

R

X1i(s)β1(s)ds+

R

X2i(t)β2(t)dt+

R R

X1i(s)X2i(t)γ(s, t)dsdt, (2.1)

where α is the overall mean, β1(·) and β2(·) are real-valued functions defined on τ1 and τ2 respectively, and γ(·,·) is a real valued bi-variate function defined on τ1×τ2. The unknown functions β1 andβ2 capture the main effects of the functional covariates, while γ captures the interaction effect. To gain some insight, consider the particular case β1(·) ≡β01, β2(·) ≡ β02, γ(·,·) ≡γ0, for scalars β01, β02,and γ0. This case reduces to the common two-way interaction model, with covariates ¯Xji =

R

Xji(s)ds, which act as a sufficient summaries, Xji, j = 1,2.

Thus the proposed model is an extension of the common two-way interaction model from scalar

covariates to functional covariates. The denseness of the sampling design and the noise free assumption are made for simplicity and will be relaxed in later sections.

Recently, Yang et al. (2013) introduced a class of functional polynomial regression models

of which model (2.1) is a special case. They showed that accounting for a functional interac-tion effect between of depth spectrograms and temperature time series improved predicinterac-tion of

sturgeon spawning rates in the Lower Missouri river. The proposed methodology relies on an

orthonormal basis decomposition of the functional covariates and parameter functions, com-bined with stochastic search variable selection in a fully Bayesian framework. This approach

requires full prior specification of several parameters, along with implementation of an MCMC

algorithm for model fitting.

Due to the developments of Yang et al. (2013), the contribution of this chapter is a novel

approach to estimation, inference and prediction in a functional linear model that includes a two-way interaction. We consider a frequentist view and model the parameter functions using pre-determined spline bases and achieve dimension reduction with shrinkage as opposed to

se-lection. The proposed method is close in spirit to Goldsmith et al. (2011), who consider only

additive effects of the functional covariates. The inclusion of an interaction term between the functional predictors involves additional computational and modeling challenges, which we

dis-cuss throughout the chapter. The main advantage of our approach is that it can be implemented

with readily available software, that accomodates 1) responses from any exponential family, 2) functional covariates observed with error, or on a sparse or dense grid, and 3) produces

p-values for individual model components, which include the interaction term. We also include a

numerical comparison between the additive and interaction functional models involving scalar response. Our findings can be summarized as follows. When the true model contains an

interac-tion between the funcinterac-tional covariates, as specified in (2.1), then fitting a simpler additive model (Goldsmith et al., 2011) leads to biased estimates and low prediction performance compared

(17)

then with sufficient sample size, fitting the more complex functional interaction model does not

harm the estimation, inference or prediction performance.

The remainder of this chapter is as follows. In Chapter 2.2, we motivate and develop the

es-timation framework for model (2.1). Chapter 2.3 extends the methodology to handle outcomes

from any exponential family and situations where predictors are measured sparsely or with er-ror. In Chapter 2.4 we briefly present an alternative low-dimensional framework for estimation,

that is used for comparison in data analyses. Chapter 2.5 consists of a simulation study that

focuses on the effects of model mis-specification. Chapter 2.6, we apply the interaction model to a publicly available tractography data set, and the AneuRisk65 data. Details about

imple-mentation are provided in Chapter 2.7. We end with a discussion that emphasizes directions

for future research in Chapter 2.8.

2.2 Methodology

We first discuss the case when the response variableY is continuous and the covariates,X1(·) and X2(·), are observed on a dense design and without noise. In chapter 2.3, we generalize our procedure to accommodate noisy and/or sparely observed predictors as well as generalized

response variables. The central idea behind our approach is to model the parameter functions using pre-specified spline bases and then use a quadratically penalized estimation procedure to

control smoothness of the estimates.

We consider basis function decompositions of the parameter functions using known spline bases. Specifically, let{ψ1k(s)}Kk=1 and {ψ2l(t)}Ll=1 be spline bases in L2(τ1) andL2(τ2) respec-tively, and furthermore let {φkl(s, t) = ψ1k(s)ψ2l(t)}1≤k≤K,1≤l≤L be the corresponding tensor

product basis in L2₍_τ1_×_τ2_{). We assume the representations:}

β1(s) =

K

X

k=1

ψ1k(s)η1k; β2(t) =

L

X

l=1

ψ2l(t)η2l, γ(s, t) = K

X

k=1

L

X

l=1

φkl(s, t)νk,l;

where η1k’s,η2l’s, and νk,l’s are the corresponding coefficients, which are unknown. Thus

esti-mation of the parameter functions is reduced to estiesti-mation of the unknown coefficients. Using

the basis function expansions we write

R

X1i(s)β1(s)ds=PK_k₌₁η1k

R

X1i(s)ψ1k(s)ds≈PK_k₌₁η1ka1k,i

where a1k,i ≈

R

X1i(s)ψ1k(s)ds is calculated by numerical integration techniques; see for

ex-ample Goldsmith et al. (2011) who employ a similar technique. Similarly, we have RX2i(t)

β2(t)dt ≈ PL

l=1η2la2l,i and

R

X1i(s)X2i(t)γ(s, t)dsdt ≈ PK_k₌₁PL_l₌₁νk, la3k,l,i, where a2l,i ≈

R

X2i(t)ψ2k(t)dt and a3k,l,i ≈ {

R R

(18)

nu-merically. The assumption that the functional covariates are observed on dense grids of points

ensures that the integrals are approximated accurately.

To control the smoothness of the parameter functions, we take the established approach

of considering rich spline bases to model the parameter functions and drive fitting with added

“roughness” penalties to the least squares fitting criterion (Eilers and Marx, 1996; Ruppert, 2002; Cardot et al., 2003b; Eilers and Marx, 2005; Reiss and Ogden, 2010). The key assumption

in this approach is that the bases are flexible enough to capture the true parameter functions,

and that the penalties are large enough to avoid over-fitting. Letη1= (η11, . . . , η1L)T; similarly

defineη2and ν. Then the parametersα, η1, η2 andν are estimated by minimizing the penalized

criterion:

Pn

i=1(Yi−α−a1T,iη1−aT2,iη2−aT3,iν)2+P1(λ1, η1) +P2(λ2, η2) +P3(λ3, λ4, ν), (2.2)

whereYiis the observed response for individuali,a1,iis theK-dimensional vector ofa1k,i,a2,iis

theL-dimensional vector ofa2l,i, anda3,iis theKL-dimensional vector ofak,l,i;P1(λ1, η1), P2(λ2, η2),

andP3(λ3, λ4, ν) are penalty terms, andλ1, λ2, λ3, andλ4are corresponding smoothing

parame-ters. As is common in functional data we control smoothness by penalizing the derivatives of the

parameter functions. Generally speaking, we use penalties based on integratedpthorder deriva-tives, that is, Pj(λj, ηj) =λj

R {∂p_β

j(s)/∂sp}2ds,j = 1,2 are the penalty terms corresponding

to the main effects of the functional covariates, and

P3(λ3, λ4, ν) =λ3

Z Z {∂

p

∂spγ(s, t)}

2_dsdt₊_λ 4

Z {∂

p

∂tpγ(s, t)}

2_dsdt _(2.3)

is the tensor product penalty corresponding to the interaction term. Defineψ(p)(·) =dpψ(·)/dtp for some generic functionψ(·). Then it is easily seen thatP1(λ1, η1) =λ1η1TP1pη1andP2(λ2, η2) = λ2ηT₂P2pη2; where P1p =

R

ψ₁(p)(s){ψ₁(p)(s)}T_ds_and _P2 p =

R

ψ₂(p)(t){ψ₂(p)(t)}T_dt_{; with}_ψ(p)

1 (s) = (ψ₁₁(p)(s), ..., ψ(₁p_K)(s))T _and _ψ(p)

2 (t) = (ψ (p)

21(t), ..., ψ (p)

2L(t))T. For the tensor product penalty for

interaction, it can be shown (see for example Wood (2006a) Chapter 4.1.8) thatP3(λ3, λ4, ν) =

νT{λ3P1p⊗IK+λ4IL⊗P2p}ν .

Many authors have chosen to penalize integrated squared second derivatives, i.e. p= 2, for

fitting (2.2); see for example Ramsay (2005). However, in this chapter we favor penalties on the

integrated squared first derivatives, i.e. p = 1; see also Gertheiss et al. (2013) who considered this idea. Our main argument for this choice is that the first derivative penalty directly penalizes

deviations from a simpler model where the functional predictors are summarized in terms of their

average values. Infinite penalties enforce constant parameters, say β01, β02 and γ0, and revert the model back toYi=α+ ¯X1iβ01+ ¯X2iβ02+ ¯X1iX¯2iγ0+i - a standard two-way interaction

(19)

penalizing the first derivatives shows preference for the simpler interaction model. Moreover, we

have found in simulation, that with an interaction term in the model, penalties on the second derivatives tend to produce under-smoothed estimates.

Tensor Product or Thin Plate Regression Spline for Interaction?

In this chapter, we propose a tensor product basis and penalty combination to model the

interaction contour. This approach has precedence in the bi-variate functional linear regression (Marx and Eilers, 2005). Tensor product bases are attractive because they are computationally

efficient, can be constructed from any univariate bases, and easily extend to higher dimensions

(de Boor, 1978; Wood, 2006a). However, their key characteristic is separate smoothing ofγ(s, t) in each direction s and t. A drawback is that separate smoothing requires an extra penalty

parameter, which adds complexity to the model.

A computationally viable alternative to tensor products is to incorporate interaction with a thin plate regression spline. This approach has also been applied successfully in the bi-variate

functional linear regression (Reiss and Ogden, 2010). Thin plate regression splines (Wood,

2003) are reduced rank approximations to thin plate spline bases (Duchon, 1977); that are more computationally efficient. Thin plate regression splines have several appealing qualities

that make them attractive for incorporating interaction. First, due to their derivation from

thin-plates splines they avoid the problem of knot placement. Second, they require the use of only one smoothing parameter. This stems from the main assumption of thin-plate regression

splines - γ(s, t) is equally smooth in each direction s and t. For penalizing the second order

roughness functional the penalty takes the form

P(γ, λ) =λ

Z Z {[∂

2

∂s2γ(s, t)]

2_{+ 2[} ∂2

∂s∂tγ(s, t)]

2_{+ [}∂2

∂t2γ(s, t)] 2_}_dsdt

One limitation of the thin plate regression splines is that for technical reasons the derivative penalty should be greater than one when penalizing a bi-variate surface. Therefore penalization

of integrated first derivatives, that penalize deviations from simpler models where the functional

predictors are summarized as continuous covariates, are not allowed.

Whether the interaction contour requires separate smoothing in each direction sand tis a

question for the data modeler. As a rule of thumb, if the functional predictorsX1(s) andX2(t)

have different levels of wiggliness, or are observed on different scales, it is advisable to use the tensor product approach. However, the primary reason we have focused on the tensor product

approach in this chapter is a practical one. The tensor product approach was the first that we

(20)

splines can be used to fit the model, we do not know how to extract estimated parameters.

In the data applications we have attempted both approaches has given highly similar results. Moreover, model selection criteria such as the generalized cross validation (GCV), Aikaike’s

criterion, etc. can be used to aid decisions. For the rest of this chapter assume we use the tensor

product basis and penalty.

Limitations of Model Fitting

It is clear we can estimate β1(·), β2(·) and γ(·,·) as far as they lie in the span of the pre-specified bases, but another important limitation exists for the model fitting. The unknown

parameter functions can be identified uniquely only up to the projections onto the respective spaces that generate the X1i, X2i, and X1iX2i. For example, the true β1(·) may not be re-covered completely; instead only it’s projection on the space defined by the curves X1i(·) will

be estimated. To see this, imagine a case where all X1i(·) lie in a finite dimensional space, say

X1i(s) =Pq_`₌₁ξ1i`Φ`(s) for some orthogonal basis inL2(τ1),{Φ`(·)}`. Ifβ1(s) =β10(s)+ζΨq0(s)

such that<Ψq0,Φ_` >_L2= 0 for all 1≤`≤q, then we have RX₁_i(s)β₁(s)ds=RX₁_i(s)β₁0(s)ds.

The situation is similar for the other two smooth effects, β2 and γ. While this is a conceptual limitation it relates to a practical issue in model fitting. Over regions where X1i(·),X2i(·), and

X1i(·)X2i(·) tend to be flat, it is difficult to pick up anything but constant signals. Whereas

regions over which the covariates are highly variable have the potential to identify richer signals.

2.2.1 Fitting

The criterion in (2.2) has an available analytical solution. Stack the column vectors defined

from (2.2) into individual design matrices A1 = [a11|...|a1n]T, A2 = [a21|...|a2n]T, and A3 =

[a31|...|a3n]T. Then combine these into an overall model design matrix A = [1|A1|A2|A3], and defineSλ to be a block diagonal matrix with blocks [0,λ1P1,λ2P2,{λ3P1⊗IK+λ4IL⊗P2}]. By the standard ridge regression formula we obtain parameter estimates ˆθ = ( ˆα,η1,ˆ η2,ˆ νˆ) = (ATA+Sλ)−1ATY, and by extracting ˆη1,ηˆ2, and ˆν we obtain

ˆ β1(s) =

K

X

k=1

ψ1k(s)ˆη1k; βˆ2(t) =

L

X

l=1

ψ2l(t)ˆη2l; ˆγ(s, t) = K

X

k=1

L

X

l=1

φkl(s, t)ˆνk,l. (2.4)

Predicted values for the response are obtained by

ˆ

Y =A(ATA+Sλ)−1ATY =HλY. (2.5)

Here Hλ represents the hat or influence matrix, which is important in Chapter 3.1 when

(21)

choice of the smoothness parameters. We discuss this issue in Section 2.2.3.

2.2.2 Standard Error Estimation

Estimation of confidence bands using penalized splines is a delicate issue (see Ruppert et al. (2003), Chapter 6). A straightforward approach is to construct approximate point-wise errors

bands on the frequentist covariance matrix Cov(ˆθ) = (ATA+Sλ)−1ATA(ATA+Sλ)−1σ2.This

is the approach presented by Ramsay (2005) (Chapter 15) in their treatment of the simple

functional linear model, and has also been used in the context of non-parametric regression

(Hastie and Tibshirani, 1990; Eilers and Marx, 1996; Goldsmith et al., 2011). However, we find in the simulation study of section 2.5 that confidence bands based on this covariance

often provide point-wise under-coverage. This problem has been noticed previously for

non-parametric additive models (Wood, 2006a), and for functional linear models (Goldsmith et al., 2011; McLean et al., 2012). Such under-coverage can be attributed to several important factors.

First, the penalized fitting procedure provides biased estimates of θ whenever θ 6= 0. Second, the fitting is conditional on the smoothing parameters whose uncertainty is not taken into account. Third, the level of bias induced by the penalty parameters can vary over the domain

of the functional parameters. One possible alternative to account for bias is to use the Bayesian

standard errors first proposed for smoothing splines by Wahba (1983) and cubic splines in Silverman et al. (1985). By specifying an improper prior, fθ(θ) ∝ e−θ

T_S

λθ_{, it can be shown} that θ|Y, λ ∼N(ˆθ,(ATA+Sλ)−1σ2) (see Wood (2006a), Section 4.8). The matrix CovB(ˆθ) =

(ATA+Sλ)−1σ2 is known as the Bayesian covariance matrix. This matrix can be decomposed:

CovB(ˆθ) =



    

Σαˆ Σα,ˆηˆ1 Σα,ˆηˆ2 Σα,ˆˆν

Σα,ˆηˆ1 Σηˆ1 Σηˆ1,ηˆ2 Σηˆ1,ˆν

Σα,ˆηˆ2 Σηˆ1,ηˆ2 Σηˆ2 Σηˆ2,ˆν

Σα,ˆνˆ Σηˆ1,νˆ Σηˆ2,νˆ Σˆν



    

= (ATA+Sλ)−1σ2, (2.6)

to obtain standard errors for the parameter estimates. The Bayesian formulation to the co-variance for the estimates of ˆθ is important because we use it to obtain confidence intervals.

For example, if we consider φ(s, t) = [φ1(s, t), ..., φKL(s, t)] we can obtain the covariance for

interaction Σγˆ(s,t)=φ(s, t)TΣνˆφ(s, t). Intervals are found from ˆγ(s, t) ∼N(E[ˆγ(s, t)],Σγˆ(s, t)) by standard linear models tools.

Wahba (1990) (Chapter 5) presents Bayesian confidence intervals in a general framework that contains the functional linear models; but theoretical and numerical studies of the finite

sample properties of these intervals have focused on non-parametric regression. Nychka (1988)

(22)

gaus-sian non-parametric setting. Marra and Wood (2012) studied coverage properties in the multiple

component case, where intervals are allowed to have variable width, and discussed the effects of identifiability issues. Gu and Wahba (1993) is the only work that studies coverage properties for

an interaction term, but again this focuses on the non-parametric regression setting (functional

ANOVA). Broadly, these studies conclude the main issue for proper interval coverage, in the across-the-function sense, is that the bias must only represent a modest fraction of the

over-all mean squared error. The finite sample properties of Bayesian intervals in functional linear

models is open research. We evaluate and compare the performance of both the Frequentist and Bayesian intervals in Section 2.5.

2.2.3 Smoothing parameter selection

Smoothing parameter selection is pivotal to model fitting. The smoothing parameters deter-mine the trade-off between the model’s fidelity to the data and the smoothness of the estimated

parameter functions. There are three main classes of approaches to select the smoothing

pa-rameters available within the mgcv software implementation described in Section 2.7. In this section we describe each approach and discuss which we prefer.

The first general approach is based off minimizing the expected mean square error of the

model. The un-biased risk estimation (UBRE) and equivalently Mallows’Cp criterion fall into

this class of approaches. As discussed in Wood (2006a) (Chapter 4.5.1) the goal is to minimize

E(M) =E(kµ−HλYk2)/n=E(kY −HλYk2)/n−σ2+ 2tr(Hλ)σ2/n,

whereM represents the expected mean squared error andHλ represents the hat matrix shown

in (2.5). The challenge is that the above expression contains the residual variance, σ2, which is

unknown. This criterion performs poorly when σ2 is unknown, which is often the case.

A second class of approaches attempt to minimize the mean square prediction error P =

M+σ2 (Wood (2006a) Chapters 4.5.2). To estimateP the standard is to use cross validation.

The most straightforward form of this is the leave-one-out (LOO) cross-validation procedure

Vo(λ) =

1 n

n

X

i=1

(Yi−HλY(−i))2,

where Y₍−i) represents the response vector with the ith subject removed. The LOO cross-validation criteria can be rewritten

Vo(λ) =

1 n

n

X

i=1

(Yi−HλYi)2

(1−Hλ,ii)2

(23)

In the functional regression literature this approach was first used by Marx and Eilers (1999)

for simpler model with only one smoothing parameter. The main problem is that minimizing Vo for model (2.2) requires a computationally intensive grid search over multiple smoothing

parameters. A second weakness of ordinary cross-validation is a lack of invariance to orthogonal

rotations of the fitting procedure (Golub et al. (1979), Craven and Wahba (1978), Wood (2006a) Chapter 4.3). These issues motivate generalized cross validation (GCV):

Vg(λ) =

nkY −HλYk2

[n−tr(Hλ)]2

. (2.7)

For functional linear regression GCV was proposed by Cardot et al. (2003b). GCV can be

seen as ordinary cross-validation on an nice orthogonal rotation of the fitting problem that

gives equal leverage to each observed point in the data (Wahba, 1990; Wood, 2006a). The primary advantage of GCV is that (2.7) can be optimized numerically over multiple smoothing

parameters. This optimization is done in mgcv using a combination Newton’s method backed by steepest descent (for details see Wood (2004), Wood (2006a) Chapters 4.6-4.7).

The final approach to smoothing parameter selection is based on transforming the

penal-ized criterion in (2.2) into an analogous mixed model, where the smoothing parameters enter

as variance components. The variance parameters are then estimated by restricted maximum likelihood (REML) (Ruppert et al., 2003). While asymptotically prediction error criteria

pro-vide superior predictor error performance than likelihood based methods (Wahba, 1985), it is

generally known that GCV can often lead to severe under smoothing in finite samples (Wahba, 1985; Gu, 2002; Ruppert et al., 2003). Reiss and Ogden (2007) found REML yielded superior

results to GCV in simulation studies in their study of penalized functional principal components

regression, and this motivated Reiss and Ogden (2009) to analytically compare the GCV and REML criteria for finite samples. The comparison showed that relative to GCV; REML

esti-mates were less prone to multiple minima, had reduced smoothing parameter variability, and

offered overall greater resistance to over-fit. We use REML to select smoothness parameters for the Gaussian data in our simulation in Section 2.5.

2.3 Extensions

2.3.1 Generalized Functional Interaction Models

Consider the case when the outcomeYi is generated from an exponential family EF(ϑi, %) with

dispersion parameter % such that E{Y|X1i(·), X2i(·)} = g−1(ϑi), where the linear predictor

ϑi = α+

R

X1i(s)β1(s)ds+

R

X2i(t)β2(t)dt+

R R

X1i(s)X2i(t)γ(s, t)dsdt and g(·) is a known

(24)

used for the unknown parameter functionsβ1, β2, andγ. The linear predictor can be simplified

to ϑi = α+PK_k₌₁η1ka1k,i+PL_l₌₁η2la2l,i+PK_k₌₁PL_l₌₁νk,la3k,l,i, where K and L are chosen

sufficiently large to capture the variability in the parameter functions. We estimate the model

components by minimizing (2.2) with the understanding that the sum of squares criteria is

replaced by the model deviance. For given smoothing parameters there is a unique solution which can be obtained by a penalized version of iteratively re-weighted least squares (Wood,

2006a).

Inference for the non-Gaussian case can proceed from large sample results, that again can take either a frequentist or Bayesian viewpoint. Asymptotic normality of these estimators based

on the frequentist view can be obtained (Cox and Hinkley, 1979). However, this approach

suffers from the bias in the estimation procedure discussed in Section 2.2.2. The ‘Bayesian’ standard errors of Wahba (1983) and Silverman et al. (1985) have been extended to

non-Gaussian settings in non-parametric regression by (Gu, 1992; Gu and Wahba, 1993; Ruppert

et al., 2003; Wood, 2006c; Gu, 2013). Wood (2006c) shows that in the asymptotic limit θ|Y ∼

N(ˆθ,(ATW A+Sλ)−1σ2), where ˆθ is the minimizer of an analog to (2.2) and the matrix W

has diagonal entries Wii =g0(ϑi)2V(ϑi))−1: where V is the function such that V(ϑi)σ2 is the

variance ofY, whereσ2is the scale parameter. In the context of generalized additive models fit by penalized regression splines, the ‘across-the-function’ coverage properties of these Bayesian

intervals was studied by Marra and Wood (2012). These intervals have not been studied in the context of any functional linear regression. As in the gaussian case, these intervals are

conditional on the estimated smoothing parameters.

Smoothing Parameter Selection for GLMs

Recall in Section 2.2.3 we preferred REML for smoothing parameter selection in Gaussian

models. In generalized linear models (GLMs) the analog to this is PQL, where the REML (or ML) criteria is applied iteratively to working linear approximations of the likelihood. However,

a problem with this approach is that iterations fail to converge in many applications (Wood, 2011). This non-convergence issue is especially bad for binary data (Breslow and Clayton, 1993;

Wood, 2004).

A viable alternative is to base smoothness parameter selection on GCV. For GLMs, the analog to (2.7) is based on

Vg(λ) =

nD(ˆθ) [n−τ]2,

(25)

least squares at convergence. Wood (2008) provides technical optimization details and shows

through simulation and practical examples that it performs superior to PQL.

Recently, Wood (2011) proposed an efficient and stable methodology to select the smoothing

parameters for GLMs based on REML. The main idea is to avoid convergence issues of PQL

by first applying a Laplace approximation to the model likelihood, and then select smoothing parameters with REML; as motivated by Reiss and Ogden (2009). This approach was shown

to have huge advantages over PQL; and modest advantages over GCV of Wood (2008); in

simulation studies. We apply this method to determine smoothness for the logistic regressions performed in the simulation and data analyses in Sections 2.5 and 2.6.

2.3.2 Functional Principal Components Analysis

In real world situations, the true functional predictors, X1i(s) and X2i(t), are unknown and

observed discretely. Often these discretely observed curves lack smoothness due to either

mea-surement error; or because they have been evaluated at sparse and/or irregular design points

such that the set of all evaluated points is on a dense grid. Following the established frame-work of penalized function regression (Goldsmith et al., 2011); in this situation estimation is

a two-stage procedure in which the first stage is to predict the true functional predictors with

Functional Principal Component Analysis (FPCA). The basic aim of FPCA is to decompose the space of the functional predictors into basis functions which explain principal directions of

their variation. These derived basis functions can be used to predict the true unobserved

pre-dictors. There are multiple approaches to FPCA in the literature (Staniswalis and Lee, 1998; Yao et al., 2003; Di et al., 2009; Goldsmith et al., 2013)). For completeness we describe the

general framework for FPCA, and the specific approach we prefer.

For j= 1,2, given predictorsXji(s) define the covariance operator ΣjX(s, s0) = Cov[Xji(s),

Xji(s0)]. By Mercer’s theorem this covariance operator can be expressed by the spectral

de-composition ΣX_j (s, s0) =P∞

k=1λjkϕjk(s)ϕjk(s0), where {λjk, j ≥1} are decreasing eigenvalues

withP∞

j=1λjk <∞and ϕjk(s) are the eigen-functions that represent the principal component

bases of interest. By Karhunen-Loeve (KL) we can expandXji(·) =Pk∞=1ξikϕjik(·), where the

corresponding functional principal component scores ξjik =

R

Xji(s)ϕjk(s)dt have mean zero

and are uncorrelated with variance λjk.

FPCA approaches are diverse due to various scenarios over which the functions are observed;

such as with measurement error or on sparse grids at the subject level. For example, measure-ment error makes the estimate of ΣX

j (s, s0) too rough, and so a common first step is to smooth

the estimated covariance matrix. Moreover, in practice, the KL expansion is truncated and

pre-dictors are approximated by Xji(·) ≈ PK_k₌₁j ξikϕjik(·); a common criterion for the truncation

(26)

ξjik =

R

Xji(s)ϕjk(s)dt are approximated over the evaluated grid points. For data observed on

dense grids Riemann sum approximations work well, but for sparse data at the subject level a better alternative is to predict the scores as random effects in a mixed model framework

(Crainiceanu et al., 2008). We implement FPCA with the fpca.sc function in refund; this approach smooths the covariance matrices with bi-variate penalized splines, chooses by default the truncation lag according to 99% variance explained, and approximates principal component

scores as best linear unbiased predictors in a mixed model framework (Di et al., 2009; Goldsmith

et al., 2013). For a fuller review of FPCA see both Ramsay (2005) and Di et al. (2009).

2.4 Functional Principal Components Regression

Functional principal components regression (FPCR) is an alternative approach to our method that can be extended to incorporate interaction into the functional linear model. Traditionally

FPCR uses a low-dimensional principal component bases to model both the predictors and

main effect coefficient functions. To incorporate interaction, we use the tensor product of the principal component bases derived fromX1(s) andX2(t). We describe how to extend FPCR to

incorporate interaction more formally here.

For i = 1, ..., n we assume that X1i(s) = PK_k₌₁ξikϕ1k(s) and X2i(t) = PL_l₌₁ξikϕ2l(s).

Here for k = 1, ..., K and l = 1, ..., L the functions ϕ1k(s) and ϕ1l(t) represent the principal

component bases andξikandξiktheir corresponding scores that are estimated by FPCA. Instead

of spline modeling, the principal component basesϕ1k(s) andϕ2l(t) are used to directly model

the parameter functions

β1(s) =

K

X

k=1

ϕ1k(s)η1; β2(t) =

L

X

l=1

ϕ2l(t)η2; γ(s, t) =

K X k=1 L X l=1

ϕ1k(s)ϕ2l(t)γkl;

where theηjk’s andγjk’s are unknown coefficients of interest. Since the the principal component

bases are orthonormal, one can show the original regression model from (2.1) can be re-expressed

E[Yi|X1i, X2i] =α+ K

X

k=1

ξikη1k+ L

X

l=1

ξilη2l+ L X l=1 K X k=1

ξilξikγkl. (2.8)

From this simplified linear model representation, the standard tools for estimation, inference,

and prediction apply. In these applications it is recognized that fitting is conditional on the

estimated principal component bases and scores. FPCR is an un-penalized and low dimensional approach to functional linear regression. Instead of extra roughness penalties, the truncation

lags K and L tune the trade-off between bias and variance in the model. The truncation lags

(27)

2000). FPCR can also be easily extended to the generalized linear model setting, and to handle

extra covariates in the model. Finally, using the covariance matrix for estimated coefficients, and the fixed principal component bases ϕ1k(s) and ϕ2l(t), point-wise confidence intervals can

be estimated for the parameter functions using the methods described in 2.2.2.

To the best of our knowledge, FPCR has not been extended to incorporate interaction in the functional linear model, where estimation follows a frequentist framework. However, the

recent work of Yang et al. (2013) proposed methodology that combines FPCR with stochastic

search variable selection in a fully Bayesian framework. Also, Sangalli et al. (2009) propose the use of a classification procedure that incorporates FPC scores and their point-wise products.

It is also worthwhile to note that Reiss and Ogden (2007, 2010) proposed a penalized

version of FPCR for functional linear regression. This hybrid methodology starts with the fitting framework in 2.2, but then performs on PCA on the design matrix from 2.2. The principal

components rotation is then used to rotate the pre-specified quadratic penalties. A drawback

of this approach is that both the roughness penalties and the number of FPCs are tuning parameters in the model, that need to be chosen in combination. When the model contains

multiple functional predictors, and an interaction term is included, the computational burden

of tuning parameter selection makes this approach unappealing.

2.5 Simulation

In this section we perform a numerical study of our method. The primary objective of this simulation is to evaluate methodology described in 2.2, in terms of both parameter estimation

and predictive performance. The functional parameter estimates are evaluated in terms of the

1) bias, 2) consistency, and 3) confidence interval coverage. Prediction is assessed in terms of estimates of the residual variance and mis-classification rates. A secondary objective of this

study is to demonstrate the effects of model mis-specification. The results show that fitting a

purely additive model when interaction is present may lead to biased estimates but fitting our approach when the true model is in fact additive does not result in significant loss of accuracy

in estimation.

A couple of notes before we go into details about this simulation. A motivation for this

simulation study is the difficulty obtaining of theoretical results for the model in 2.1. Also, this

simulation has limitations. To be cautious inferences should only be made broadly to the data generative settings described next. Regardless, this numerical study serves as a rough tool for

(28)

2.5.1 Design and Assessment

The functional covariates Xji(s) =φTj(s)ξji, j = 1,2, are generated so thatξ1i ∼M V N(0,Σ)

andξ2i ∼M V N(0,Σ) with Σ = diag(8,4,4,2,2,1,1), andφ1(s) = [1,sin(πs),cos(πs),sin(3πs),

cos(3πs),sin(4πs),cos(4πs)] andφ2(t) = [1,sin(πt),cos(πt),sin(2πt),cos (2πt),sin(4πt),cos(4πt)]. We generate the observed functional covariates both with and without independent

measure-ment error, according to the model W1i(s) =X1i(s) +δ1i(s) and W2i(t) =X2i(t) +δ2i(t), such

that forj = 1,2,δji is a white noise process with σ_δ2= 0, 1, or 4. For the parameter functions,

the main effects are defined as β1(s) = 2cos(3πs), a truly functional signal, and β2(t) = 0.5,

constant and non-dependent ont. We consider two interaction parameters: γ1(s, t) = 0,

corre-sponding to an additive model, and γ2(s, t) = sin(πs)sin(πt), a non-trivial interaction effect. All functions are evaluated at H= 100 equally spaced points overs, t∈[0,1]. We used Rie-mann sums to approximateµji =

R

Xji(s)βj(s)ds,j= 1,2, andµ3i =

R

X1i(s)X2i(t)γ(s, t)dsdt.

We consider two cases:

(A)Yi ∼ N(α+µ1i+µ2i+µ3i,1), and,

(B)Yi ∼ Bern{(eα+µ1i+µ2i+µ3i)/(1 +eα+µ1i+µ2i+µ3i)}.

We use sample sizes n = 100,200, and 500 for (A); and n = 300 and 500 for (B). For each generated sample, we observe{Yi, W1i(s), W2i(t)}ni=1. In all our simulations, we chose Ψ1(s) and Ψ2(t) to be cubic B-spline basis functions with 10 equally spaced internal knots, and penalize

integrated squared first derivatives. The penalty parameters were estimated using REML, or with the Laplace approximation to REML for Gaussian and Bernoulli data, respectively. For

comparison purposes, we also fit the additive functional linear model with the same model

specifications for bases, penalty, and roughness penalty selection procedure.

We ran 1000 Monte Carlo simulations for each setting described above. Performance was

assessed on the aggregate over all Monte Carlo runs, and the entire grids s, t∈[0,1], for each functional parameter. We evaluated estimates in terms of mean integrated squared error:

M ISE( ˆβ1) =P1000j=1

PH

h=1{βˆ1j(sh)−β1(sh)}2/(1000·H),

where ˆβ1j is the estimated parameter for thej

th _{simulated dataset. Also reported are}

‘across-the-function’ (1−α)100% confidence interval coverages:

M CI( ˆβ1) =P1000i=1

PH

h=1I

h

β1i(sh)∈ {βˆ1i(sh)±zα/2SEˆ ( ˆβ1i(sh))}

i

(29)

Predictive performance for the Gaussian data is evaluated by average prediction error (APE):

APE =P1000

j=1

Pn

i=1(yi−yˆi) 2

/(1000·n).

The optimal APE equals the residual variance of 1, APEs below 1 indicate over-fitting of the

model to the data, and APEs above 1 suggest under-fitting of the model. For the Bernoulli data

we focus on the mis-classification (MC) rate:

MC =P1000

j=1

Pn

i=1I(yi = ˆyi)/(1000·n),

where ˆyi = 0 if ˆπi ≤.5 and ˆyi = 1 otherwise.

2.5.2 Simulation Results

The full simulation results are presented in Tables A.3-A.5 and Figures A.9-A.10 in the appendix

to the chapter.

We focus first on the situation where Gaussian data is generated with the interaction termγ2 (non-trivial interaction effect). When the correct model with interaction is used, the parameter

function estimates have monotonically decreasing MISEs with increasing sample size. The APEs are all below 1 which suggests over-fitting on the average, however this over-fitting is only

moderate and decreases with sample size.

In contrast, when the incorrect additive model is used, the estimates are affected adversely for all metrics of evaluation. There is a marked increase in the MISEs for estimation ofβ1 and

β2, and a large loss of prediction power even for increasing sample size. We compare these results

of mis-specification to the situation where data is generated with γ1 (an additive model). At sample size n= 100, fitting an interaction model resulted in moderately increased MISEs and

lower APEs, due to more over-fitting. Nevertheless, application of the additive and interaction

model gave highly similar results for sample sizes of 200 and 500. The key is that with sufficient sample size to empower selection of the smoothing parameters, the model chooses the additive

fit on its own.

The frequentist confidence intervals tend to provide under-coverage, while the Bayesian intervals tend to give over-coverage, at the 95% nominal level. This challenging issue is not

specific to the interaction model however; it persists when there is no interaction and an additive

model is correctly fit. Further investigation indicates that on average, the empirical Monte Carlo standard errors of the parameter estimates are sandwiched between the average estimated

frequentist and Bayesian standard errors. The over-coverage of the Bayesian intervals is a result

of an over-correction for the bias caused by the penalized regression procedure.

(30)

A.5). This makes sense. At this magnitude the measurement error is close to as large as the

variation in the scores generating the functional covariates. The measurement error adversely affected all estimates in terms of MISE, bias, and interval coverage.

The reduced information in the Bernoulli responses led to less efficient estimation of all

parameters. One difference from the results of the Gaussian data, is that there is noticeable bias in the estimation ofγ2, and poor confidence interval coverage for interaction. However, the

effects of mis-specification tell a similar story. When γ2 is the truth and the additive model is

fit, we have inflated biases, almost non-existent confidence interval coverage, and larger mis-classification rates. In contrast, if the data is generated from γ1 and the interaction model is

fit, the results are highly similar to those found when the additive model is applied.

2.6 Applications

2.6.1 DTI data

The first illustration of our method is to a neurological data set of the white matter tracts in the

brains of multiple sclerosis (MS) patients using magnetic resonance imaging (MRI) techniques; the study has been previously described in Goldsmith et al. (2012); Swihart et al. (2013).

This is the data set is publicly available and can be found in the refund package R. MS is a brain disease that affects the central nervous system and in particular damages white matter tracts in the brain through myelin loss, lesions, and axonal damage. Diffusion tensor imaging

(DTI) has been successfully used for examining white matter tracts through modalities that

measure water diffusivity along the tracts; see for example, Basser et al. (1994); Mori and Barker (1999); Le Bihan et al. (2001). From the DTI images, continuous summaries of the white matter

structures called tract profiles are produced.

The full data consists of a longitudinal study 100 MS subjects measured over to 2 to 8 visits; but for coherency we focus on a subset of data collected for these MS subjects at their first

visit. At this initial visit, each individual was administered a Paced Auditory Serial Addition Test (PASAT) which is a commonly used examination of the cognitive function of the brain

and takes integer values between 0 and 60. Also, tract profiles were extracted along the corpus

callosum (CCA, connecting the left and right hemispheres of the brain) and the right intracranial corticospinal tract (RCST, connecting the right side of the brain and the spinal cord), yielding

two functional predictors measurements per subject. The CCA and RCST tract profiles are

observed on a regular dense grid of points, but involve some missingness, as a few subjects have incomplete tract profiles. Also, the predictors are observed with measurement error. Figure A.1

shows the data, and depicts the CCA and rCST profiles for two subjects on their first visit.

(31)

the PASAT score in MS subjects.

We use the methodology described in Section 2.3.2 to reconstruct the smooth and de-meaned trajectories; FPCA using 99.9% explained variance, as carried out with the fpca.sc function in the R package refund (Crainiceanu et al., 2012). Then the two-way interaction functional linear model with normal response is employed; the parameter functions are modeled via cubic B-spline bases with 7 equally spaced knots (K = L = 9) and the penalty criterion is based

on the norm of the first order derivative. For comparison an additive model, as proposed by

Goldsmith et al. (2011) is considered as well. The additive model fit for two functional covariates is analogous to that fit in the refund package in Rcan be used to fit the additive model with two functional covariates; this procedure uses p-splines, and a penalization based on a second

order difference penalty. For consistency reasons we fit the additive model using the same basis functions (K = L = 9) and employed a penalization based on the first order derivatives as

used by the functional interaction model, so that the difference between the results reflects the

difference in the two modeling approaches solely.

For further comparison we apply the FPCR approach to both the additive and interaction

model described in 2.4. To obtain the necessary principal component bases and corresponding

principal component scores we perform FPCA with the fpca.sc in R, and carry out the re-gression withglm. Deciding by cross-validation, we truncated the KL expansions for CCA and RCST at the top two eigen-bases which explained a total of 99.1% and 97.3% respectively.

Figure A.2 displays the parameter estimates for the proposed interaction model fit from the

PFR framework. The CCA main effect estimates from the additive and interaction model are

highly similar. The additive model parameter fit has been shrunk to a negative constant. The interaction model estimated main effect for CCA is also negative, but has a slight upward slope,

and larger confidence bands that contain 0 across the domain of the tract. For the RCST, the

estimated main effects from the additive and interaction model are almost exactly the same; they have both been penalized to a constant positive value, whose magnitude is small such

that that the Bayesian confidence bands clearly contain 0. The interaction contour has been

completely flattened in the CCA direction, and slopes gently from positive to negative in the direction of the RCST. Nonetheless, all the points on the contour are not significant when

measured point-wise using the Bayesian confidence bands.

Figure A.3 displays the FPCR parameter estimates for the additive and interaction model. Noticeably, the FPCR estimates are more wiggly than those estimates from PFR; but overall the

results are qualitatively similar. The main effect estimates for CCA are both negative across the

tract domain, while the main effect estimates for RCST are positive near the beginning of the tract and then level to zero. The estimated interaction contour is highly similar qualitatively to

that fit with PFR, but more wiggly. The interaction contour is nearly flat in the CCA direction,

(32)

no coordinates on the contour are significant when measured point-wise using the Bayesian

confidence bands.

Based on these fits, is it advantageous to include an interaction effect in this model? The

Bayesian intervals for the estimated interaction contours suggest not. However to further aid

explanation, we use learning from our simulation experiment (see Section 2.5), and consider the average prediction error (APE) corresponding to both fitting approaches. Calculations give

average prediction error (APE) for the functional interaction model fit of 150.3 and APE =

154.2 for the additive model for FPCR the APEs were 150.2 and 155.0 for the interaction and additive models. Close estimates of the main effects, and similar APEs for the additive and

interaction models show support in favor of the lack of an interaction effect between the CCA

and RCST. Point-wise confidence bands, similarity of the main effect estimates, and prediction error are ad-hoc indicators of whether interaction is present in the model. The motivates our

work in Chapter 3, where we pursue formal hypothesis tests for the functional interaction effect.

2.6.2 Application to AneuRisk study

The second application of our method is to the AneuRisk65 data described in Sangalli et al.

(2009). The goal of this study is to identify the relationship between the geometry of the

internal carotid artery (ICA) and the presence or absence of an aneurysm on the ICA. The study contains a collection of 3D angiographic images taken from 65 subjects thought to be

affected by a cerebral aneurysm. Of these 65 subjects, 33 have an aneurysm located on the

internal carotid artery (ICA), 25 have an aneurysm not located on the ICA, and 7 have no aneurysm. Since the presence or absence of an aneurysm on the ICA is of primary of interest,

subjects in the latter two groups are combined. For each subject, the images are summarized

to describe the geometry of the ICA. Piccinelli et al. (2007) approximate the centerline of the artery in 3D space and estimate the corresponding width of the artery along this centerline in

terms maximum inscribed sphere radius (MISR). Sangalli et al. (2007) provide a measure of

curvature of the artery in 3D space along the artery centerline. The curvature and MISR profiles observed along the ICA centerline serve as our functional predictors. In this situation, the 3D

geometries of the arteries are more thoroughly described by the combination the curvature and

MISR values taken along the ICA centerline, and therefore it makes sense to include a two-way interaction term in the model. Our interest is to infer whether a including a two-way interaction

term between the curvature and MISR profiles helps better explain the presence or absence of an aneurysm on the ICA.

Before applying the proposed procedure we use the registration method described in Staicu