• No results found

2.3 Principal components analysis for functional data

2.3.2 PCA for functional data

Functional principal components analysis (fPCA) was first developed by C. Radhakrishna Rao in 1958 [Rao, 1958]. It is used to analyze the geometry of the functions, capture the principal modes of variation and reduce the di- mension of the data. Let X1(t), . . . , Xn(t) denote independent and identically

distributed random functions on a compact interval T such that each func- tion Xi(t) belongs to the functional space of all real valued square integrable

functions defined on [T ], i.e. the L2[T ] space, with the true mean function defined as µ(t) = E[Xi(t)] and their corresponding covariance function defined

as Σ(s, t) = Cov{Xi(s), Xi(t)}. For simplicity, let us assume that the functions

are observed fully on T and without noise.

Similar to PCA, fPCA is based on finding principal component scores of maximum variance that highlight features of the smooth underlying curves. Specifically, to find the first functional principal component, we find the prin- cipal component weight function (eigenfunction) ψ1(t) for which the set of

values

ξ1i =

Z

T ψ1(t)Xi(t)dt

= hψ1, Xii i = 1, . . . , n

has the largest variance, subject to the constraint hψ1, ψ1i = 1. The second

functional principal component finds the eigenfunction ψ2(t) for which the set

of values

ξ2i =

Z

T ψ2(t)Xi(t)dt

= hψ2, Xii i = 1, . . . , n

has the largest variance, subject to the constraint hψ2, ψ2i = 1 and hψ1, ψ2i =

0. Continuing in this fashion, the kth functional principal component score

finds the eigenfunction ψk(t) for which the set of values ξki =

Z

T ψk(t)Xi(t)dt

= hψk, Xii i = 1, . . . , n

(2.1)

has the largest variance, subject to the constraint hψk, ψki = 1 and hψk, ψji = 0

for all j < k.

In order to calculate the eigenfunctions {ψk(t), k = 1, . . . , n} and rep-

resent any given function X(t) in terms of these eigenfunctions, we use the theories from Mercer’s theorem and the Karhunen-Lo`eve expansion

[Happ and Greven, 2015]. Mercer’s theorem allows for the eigen-decomposition

of a covariance function Σ(s, t) into eigenvalues λk and eigenfunctions ψk(t).

interval T and square integrable, Mercer’s theorem states that there exists an orthonomal sequence ψk of continuous functions in L2[T ] with unit norm and

a non-increasing sequence of positive numbers λ1 ≥ λ2 ≥ ... > 0 such that

Σ(s, t) =

X

k=1

λkψk(s)ψk(t) s, t ∈ T ,

with the eigenvalues and eigenfunctions being solutions to

Z

T Σ(s, t)ψk(s)ds = λkψk(t).

A complete proof of this theorem can be found in [Bosq, 2000]. When Mercer’s theorem holds, the Karhunen-Lo`eve theorem states that using the basis func- tions determined by the eigenfunctions of the covariance function, the curves

Xi have the following representation

Xi(t) = µ(t) +

X

k=1

ξikψk(t),

where the basis coefficients are the principal component scores ξik defined

similarly as in equation 2.1:

ξik =

Z

T{ψk(t)(Xi(t) − µ(t))}dt

(2.2)

such that ξik ∼ N (0, λk) and they are uncorrelated for different k. Recall that

we are interested in finding the set of K orthogonal functions {ψ1, . . . , ψk}

functions, then the mean integrated squared error (MISE) criterion M ISE = n X i=1 ||Xi− ˆXi||2 = n X i=1 Z T{Xi(t) − ˆXi(t)} 2 dt (2.3)

is minimized. [Ramsay and Silverman, 2006] show that the set of basis func- tions that minimizes equation2.3has the additional property that it maximizes the amount of variation explained in the random functions Xi(t). Hence, the

collection of the first K eigenfunctions in the sample of curves {Xi(t), i =

1, . . . , n} forms a set of basis functions that minimizes the above MISE cri- terion. Since these basis functions are derived directly from the functional data instead of being chosen like the Fourier or B-spline basis, they can be considered as empirical basis functions.

Application of functional data:

the NHANES data analysis

3.1

Introduction

The National Health and Nutrition Examination Survey (NHANES) is a cross- sectional, nationally representative survey designed to evaluate the health and nutritional status of adults and children in the United States [CDC, 2016]. The survey samples around 5000 non-institutionalized civilians annually to represent the US population. In particular, NHANES oversamples underrep- resented groups, including elderly people 60+ years old, African Americans, Asians, and Hispanics. The survey involves a 4-stage process to sample par- ticipants, which indicates that the sample is not a simple random sample from the US population. To make the sample representative for the US population each individual sampled in the NHANES has a survey weight, which is defined as the number of individuals in the US population represented by that individ- ual. These survey weights need to be incorporated in any analysis to ensure

that results are generalizable to the US population. The survey collects demo- graphic, socioeconomic, dietary, and health-related information through home interviews, and medical, dental, and physiological measurements through phys- ical examinations in mobile centers [CDC, 2016]. Moreover, NHANES started to monitor participants’ physical activity using an accelerometer during a 1- week study for its 2003-2004 and 2005-2006 cohorts. The National Center for Health Statistics also provides a mechanism for linking NHANES cohorts with death certificate records from the National Death Index (NDI) [NCHS, 2015]. This allows us to investigate the associations between participants’ activity and other non-activity related characteristics and future mortality.

For our research, we are interested in: 1) exploring the associations between participants’ physical activity, demographic, and health-related characteristics and 5-year all-cause mortality; 2) identifying the ranking of the most predic- tive predictors and their relative effects on mortality; 3) comparing derived measures of physical activity (PA) to established predictors of 5-year all-cause mortality.

Related documents