Dimension Reduction - Statistical Techniques for Functional Data Analysis

Chapter 3 Statistical Techniques for Functional Data Analysis

3.3 Dimension Reduction

As mentioned earlier, functional data can be considered as extending multivariate techniques to a functional domain [64]. Aside the obvious size-constraints, it is pos- sible that we are encoding redundant, unrelated or even misleading information in a high-dimensional dataset. It is therefore to a modeller’s benefit to extract features or modes of variations that are informative and less prone to corrupted information. Especially in the cases of two- or three-dimensional data the visualization of a complex dataset is by definition harder than that of a simple dataset and for that reason one would strive to have a more succinct dataset to display; moreover even higher dimensional datasets might have an adequate, in terms of variation explained, representation in two or three dimensions thus allowing their previously impossible visualization. Finally exactly because of the redundancy of information we are expecting, a reduced dimension representation of the dataset could be used as a surrogate dataset for the original high-dimensionality dataset analysed, as not only would it present an obvious “space-complexity” advantage but it could potentially “filter” unstructured information out of original dataset.

Dimension reduction is based on the notion that we can produce a compact low-dimensional encoding of a given high-dimensional dataset. The current work utilizes one main methodology to achieve this task: Functional Principal Compo- nents Analysis (FPCA) [121]. FPCA is inherently linear and unsupervised [99]; also it is known to be used in FDA on a number of different application fields. By linear one means that the dataset at hand lies close to a linear subspace and such an accurate approximation of the data can be obtained by using a coordinate system that spans that linear subspace alone [17]. As such in the case of FPCA the original zero-meaned dataset Y of N observations is assumed to be approximated by the form:

αν,n=

Z T

φν(t)yn(t)dt (3.36)

whereφν(t) is the functional principal component of the ν-th order andαν,n is the

corresponding FPC score where as in Eq. 3.5, V ar(αν,n) = λν. These scores be-

ing the projections of the dataset Y into the coordinate systems defined by their respective components or in layman’s terms the mixing coefficients dictating “how much” of each components is used to reconstruct thei-th instance of sample Y. In contrast with linear methods, the archetypal non-linear (but still unsupervised) dimension reduction algorithm is that of kernel PCA [284], a number of other popular non-linear algorithms (eg. Locally linear embedding (LLE) [275] and Semi-definite Embedding (SDE) [323]) can also be cast as kernel PCA [99]. In brief in the case of kernel PCA each point Yi of the original dataY is projected onto a point ψ(yi) by

employing a non-linear transformψ(·). Then “standard” PCA is performed at that possibly high-dimensional domain; while we will not explore this in any detail, we need to stress that the whole “trick” behind kernel PCA is that one does not need to explicitly computeψ(Yi) but rather to compute theψ(Yi)Tψ(Yj) directly through

the use of a valid kernelK(·,·) such thatK =ψ(Yi)Tψ(Yj). As mentioned FPCA is

unsupervised; by that ones means that there is no prediction variable ˆY (as it would be for instance in the case of linear regression) that it can be used as a “supervi- sor” indicating the goodness of the solution. On the contrary if for example one had access on some notion of class information in forming the projection, then that information would be beneficial because it would allow informed within-class covariance estimates that would themselves inform the across-class sample covariance. Fisher’s Linear Discriminant analysis is a typical example of a supervised linear dimensionality reduction algorithm [17]. As final note, a problem we have re-iterated through the text, is that of the selection of the number of dimensions to retain. This is still an open problem but it is effectively a model selection problem addressed by multiple researchers [141; 208]. The basic solutions stem by reformulating the di-

mensionality determination task as the optimization of an equivalent information criterion10; these Information Criteria materialize even in simple truncation-based heuristics (eg. thebroken stick model). More formally though the work of Tipping & Bishop in Probabilistic PCA serves as the back-bone framework for this dimension determination tasks in some cases [309].

3.3.1 Functional Principal Component Analysis

Castro et al. [51] work is one of the first to formalize the concept of dimensionality reduction via covariance function eigendecomposition for functional data as it was first presented on Eq. 3.4. This, as with the standard PCA, provides not only a convenient transformation for dimensionality reduction but also as a way to built characterizations of the sample’s trajectories around an overall mean trend function [337]. The functional principal components act as the building blocks of our sample. Given a vector processY = (y1, y2, .., yp)T, wherey1, y2, ..., yp are scalar vectors, an

expression of the form:

ˆ Y =M+ m X ν=1 ανZν(t), (3.37)

is called a m-dimensional model of Y, where M denotes the mean vector of the process (M =E{Y}),Z1, Z2, ..., Zmare fixed unit lengthpvectors andα1, α2, ..., αk

are scalar variates dependent on Y. Where the mean squared errorS_k2= minE{||Y−

Y||} is minimized by the vectors Zi then ˆY is called the bestm-dimensional linear

model forY.

If a process Y(t) is observed at p distinctive times t1, t2, ..., tp it then yields

the analogous random vectors y(t), describing the stochastic process Y = (y(t1),

y(t2), ..., y(tp))T, fitting perfectly with the theoretical notions of longitudinal data

being a variation of repeated measurements. Returning to the original notion of a stochastic processY(t), them-dimensional linear model for such process is:

yj(t) =µ(t) + m

ν=1

αν,jφν(t), (3.38)

whereαν are once more the uncorrelated random variables with zero mean and refer

to the ν-th principal component score of the j-th subject and φν are linear inde-

pendent basis-functions, of the random trajectoriesYj. This expansion (Eq. 3.38)

is referred to as the Karhunen-Lo`eve or FPC expansion of the stochastic processY

[121] where nowφν(t) refers to continuous pairwise orthogonal real-valued functions

inL2[0, T], as before µ(t) =E{y(t)}, t[0, T]. Similarly the mean squared error is reinstated here as the integrated square error ||yj(t)−yˆi(t)||2 =

[y(t)−yˆ(t)]2_ds_, 10_{Information Criteria will be discussed in detail in the related} _{Model selection} _{section (Sect.}

with the choice of optimalφ’s encoding the bestm-dimensional model forY. Empir- ically finding these unit normφrequire first the definition of the sample covariance function ˆCY(s, t) in a way similar to Eq. 3.3:

ˆ CY(s, t) = 1 N N X i=1 (Yi(s)−µˆ(s))(Yi(t)−µˆ(t)) (3.39) where ˆµ(t) = _N1 PN

i=1Yi(t)11and then the subsequent eigendecomposition of ˆCY(s, t)

for the zero-meaned sampleY as: ˆ CY(s, t) = N X ν=1 ˆ λνφˆν(s) ˆφν(t) (3.40)

or equivalently in matrix notation ˆCY = ΦΛΦT, the later being also known as the

principal axis theorem [156]. Ultimately, exactly because of the optimality of the FPC’s in terms of variance explained, these modes of variations will be the ones explaining the maximum amount of variance in the original sample.

It must be noted here, that as Rice and Silverman emphasized, the mean curve and the first few eigenfunctions are smooth and the eigenvalues λi tend to

zero rapidly so that the variability is predominantly of large scale [267]. A further important qualitative view of the FPC’s is as representing a rotation of the original dataset in order to diagonalize the covariance matrix of the data; thus making the new coordinates of the dataset uncorrelated [28]. This functionality of PCA even allowing it to be reformulated within a phylogenetic framework [266].

In physical terms, smoothness of data is critical so that the discrete sample data can be considered functional [253]. For example as seen in the work on Chen & M¨uller [55] in the case of two-dimensional functional data, the discretisation and the subsequent interpolation can have significant implications in one’s results (the authors advocating atwo-way FPCA to counter these issues).

As noted in the previous section a number of smoothing techniques have been proposed over the years concerning FPCA; basis function methods such as wavelet or regression splines bases, or smoothing by local weighting using local polynomial smoothing or kernel smoothing, being some of the most frequently encountered. Ker- nel presmoothing, considered to be the optimal choice in the case of local weighting [83], is the one applied in all the cases of this work due to its simplicity and compu- tational ease, yielding smooth sample curves. Finally we draw upon the fact that we do work with a discretised version of a functional sampleY and that a core requirement for FPCA to be applied directly is that the sampleYi has the same number of

equi-spaced readings (see [337] for a case where one can apply FPCA in sparse and

11_C_ˆ

irregularly sampled data by employing a conditional expectation procedure). This requirement being easily fulfilled by the smoothing and concurrent interpolation of the sample.

Interestingly a number of regularized or smoothed functional PCA approaches have appeared over the years. In such cases smoothness is imposed in multiple ways. Either by penalizing the roughness of the candidatesφbased by means of their integrated squared second derivative [254] or by projecting the original sample down to a lower dimensional domain where the data appear smoother, probably by tak- ing advantage of a periodic basis like Fourier polynomials and carry out standard FPCA in that domain. The basic qualitative difference between the two approaches being that in the first case smoothing occurs directly on the FPCA step, while in the later we smooth the data directly. In our primary work withF0 we smooth and

interpolate the data beforehand but we do not impose secondary smoothing techniques as the ones mentioned above; an initial smoothing is adequate and additional smoothing will only draw attention away from our true sample dynamics. On the contrary when working with spectrograms (chapter 6) exactly because we do not smooth the data originally, we do smooth the spectrogram’s readings after initial interpolation (section 6.2.1).

In document Functional data analysis in phonetics (Page 61-65)