Model Selection - Modelling Functional Principal Component scores

Chapter 3 Statistical Techniques for Functional Data Analysis

3.4 Modelling Functional Principal Component scores

3.4.3 Model Selection

“All models are wrong but some are useful” by George E. P. Box is probably one of the most worn statistical quotes. It does highlight though the obvious intuition that a (statistical) model is a simplification of reality that allows the modeller to infer the dynamics behind the model’s components. If one is therefore presented with multiple models it is essential he can estimate the performance of different models “in order to choose the best one” [129]; this procedure being commonly referred as

model selection.

Given we have a sampley from an unknown parametric modelm(x;θ), and estimates from an associated predictive model ˆy = ˆm(x;θ), an obvious test for the goodness of our estimation is how well our estimate ˆy can predict y in terms of mean squared error [62]. For example, assuming a squared loss function given as:

C= (ˆy−y)2, the expected C equals:

E[C] =E[(ˆy−y)2] (3.54) =E[(ˆy−y−E[ˆy] +E[ˆy])2] (3.55) =E[ˆy−E[ˆy]]2+E[y−E[ˆy]]2+ 2E[(E[ˆy]−y)(ˆy−E[ˆy])] (3.56)

where the final term equates to zero and we get:

=E[ˆy−E[ˆy]]2+E[y−E[ˆy]]2 (3.57) =var(ˆy) +bias2(ˆy) (3.58) We see that the more we decrease the bias of our predictor, the more we increase its variance; the more we overfit our data the closer we get to our actual estimation points. This results in a modelm that has poor predictive power for unseen data and poor explanatory power for the population dynamics from which the sample was taken from. This maximum likelihood approach succeeded in giving us the model that maximizes the likelihood function p(D|m), whereD are our observed dataset and m a model from our model space M. Unfortunately direct maximization of the likelihood function p(D|m) results in choosing increasingly larger models. To alleviate this limitation of direct maximum likelihood estimation we are using two different approaches: one data-driven, and one based on analytical results. The data-driven approach is based on cross-validation and resampling principals while the analytical approach works by making meaningful approximations between our estimated distribution and the “true” distribution of the data.

As mentioned in section 3.1, cross-validation is based on the idea that you exclude a portion of your data as a validation set [28]. In the case of a k-fold cross-validation one randomly partitions his dataset in k (usually of equal size) partitions, uses k−1 available partitions to train his model and then the model’s fitting is evaluated using thek-th partition excluded. We thus use a (k−_k1)% of our available data each time. This procedure is executed k times, and at the end of it (usually) the performance scores from the k runs are averaged in order to get the final estimate for the model’s performance. We then proceed in comparing the different model performances and choose the best one. A similar approach based on resampling the data isjackknifing. During jackknifing instead of using a validation and a test set we generateksub-samplesyjack by resampling our original sampley,

runs to give the final performance estimate.

A second approach data-driven approach would be to usebootstrapping[129]. Focusing on the parametric bootstrap, we first fit the parametric model for which we want to assess the performance of our data. We then resample from that model in order to produce “bootstrapped samples” yboot of size N, N being our original

sample size. Repeating this procedurektimes we re-fit each time our model using the new yboot produced. Similarly to cross-validation we then average the performance

scores from thekruns in order to get the final estimate for the model’s performance

14_.

Without looking into theoretical problems stemming from resampling, a com- mon problem encountered by all resampling-based approaches is that of computational costs. Both in terms of memory and CPU time, resampling and/or refitting a large number of models is an expensive procedure. Even a simple ordinary least squares model requires usually the Cholesky decomposition of theXTX matrix or the QR decomposition of the design matrix X; these procedures being of approxi- mate asymptotic order 1₃N2 and 4₃N2 respectively [101] (the obvious time trade-off between the two being at the computational time of the matrix-matrix multiplica- tionXTX). Repeating this millions of times can become extremely time-consuming. Finally stating almost the obvious, this inferential procedure is based on random sampling, these results are not strictly deterministic, another sample gives slightly different values.

An optimal solution could be to find a procedure that you can use onlyonce

and access the “goodness of the model”. This is achieved by a series of approximations; the intuition for these approximations comes from two directions. First we want an approximation that tells us how good we do based on “population estimates”; this is why we used resampling after all. Second we recognize that a problem with using a maximum likelihood approach stems from failing to penalize unneces- sarily complex models; a problem relating directly to the parsimony principal of Oc- cam’s razor [62]; “it is vain to do with more what can be done with fewer”. Occam’s razor is the driving force behind a number of information criteria (IC). The current study relies almost exclusively to Akaike’s Information Criterion (AIC) [5], which is established as the “standard” among ICs. A second almost equally popular IC is the Bayesian or Schwarz Information Criterion (BIC or SIC respectively). These information-based model selection criteria aiming to essentially balance model complexity and predictive power, providing a way to rationally penalize each parameter

Resampling can also be formulated in a Bayesian context; there sampling is done from the posterior distribution of the parameters estimated. For each parameter a higher posterior density (HPD) interval over some value q% can be created from the empirical cumulative distribution function of the sample as the shortest interval for which the difference in the empirical cumulative distribution function values of the endpoints equates withq; ie. they“minimize the volume among

added to the model with the respect to the “explanatory power” it provides. The theoretical machinery behind AIC isKullback-Leibler (K-L) divergence15. K-L divergence is a distance between an unknown distributiont(x) and an approxi- mate distributionq(x) in terms of additional amount of information one needs to use to specifyx due to the fact of usingq(x) instead oft(x) [28]. Thus K-L divergence is given by: KL(t||q) =− Z t(x) logq(x)dx−(− Z t(x) logt(x)dx) =− Z t(x) logq(x) t(x)dx (3.59) It needs to be stressed that K-L divergence concept is akin to a likelihood ratio statistic. Exactly because AIC reflects “additional” information the smaller it is the better [62]. Clearly for the application of AIC the main issue is that one does not know the t(x) beforehand. Akaike’s solution was to estimate it; AIC score is an asymptotically unbiased estimate of the cross-entropy risk. In other words as the sample size n → ∞, the model with the minimum AIC score will possess the smallest Kullback-Leibler divergence. Interestingly despite its rather involved theoretical justification, AIC is computed as :

AIC =−2L(θ) + 2p (3.60) where p is the number of parameters in the model and the L(θ) is the likelihood of the model used with respect toθ. A general comment is that when the number of parameters in a model, is not significantly smaller than the number of available samples (40≥ n_p), then using a version of AIC correct for smaller samples is desirable [44]: AICc. AICc is defined as:

AICc=−2L(θ) + 2p+ 2p(p+ 1)

n−p−1 (3.61) where evidently as n_p → ∞ one gets back the original AIC (nbeing the number of available samples).

AIC (and AICc) approach takes a full frequenist approach regarding model selection and is based on asymptotic behaviour properties of the estimator used (K-L divergence); a Bayesian approach was proposed by Schwarz [287] leading to the formulation of BIC. In brief one assumes that all candidate models are equali- probably (essentially having an un-informative prior) and that the “true” model is among the candidate models; then by finding the model that gives the higher

one finds the “most-probable” model that generated the data (from the subset of examined models of course). Two caveats immediately arise: 1. we are fairly certain that some models are more probable than others and 2. we have no reason to believe that the “true” model is among our candidate models. Nevertheless in a Bayesian approach there is no need to explicitly penalize model complexity as that is incorporated by the integral over the posterior parameter distribution. Given these initial assumptions BIC is calculated as:

BIC =−2L(θ) +plog(n) (3.64) An important note is that while BIC selection is consistent AIC is not [62]; where one by consistency means that the probability of the “true” model being selected tends to 1 asn→ ∞. As Davison [62] shows if a modelm0 is close tomtrue, where the respective number of parameters in each model ispandq, ifp−q is small (<10) then it not improbable that one choosesm0 instead of mtrue.

Returning to our original LME model case we already identified that “generic” model estimation should occur within a REML framework. Nevertheless exactly because the matrixK is reformulating the response vector ainto KTa in a model specific way, the residuals associated with two LME models with a different number of fixed effects will not be directly comparable. In practice that can be seen as K changing the “residual” term of the likelihood (ΩTΨ−1Ω). This being even more obvious if we see the alternative formulation of AIC as n₂log(RSS) +p[123];

RSS=Pn

i=1(yi−yˆi)2.

Both ICs can be inconclusive; as a general rule of thumb when the absolute difference between two models is less or equal to two (2), there is no obvious reason to select one model over another. In relation to that Burnham and Anderson note:

“A substantial advantage in using information-theoretic criteria is that they are valid for non-nested models. Of course, traditional likelihood ratio tests are defined only for nested models, and this represents another substantial limitation in the use of hypothesis testing in model selection.”[44].

In practical terms constructing and finding thebest model relating to a process of interest in somewhat heuristic. Two main methodologies are usually em-

ployed; forwards and backwards model selection. In the case of forwards model selection one starts with the smallest (or least complex) relevant model for the re- lationship between independent and dependent variables and through consecutive comparison among the candidate variables the variable that most substantially “bet- ters” the model’s fit is added. The process being iterated until convergence; ie. no variable can be added that improves the model (based on some information criterion or LR test). Backwards model selection is effectively the opposite. One defines the largest (most complex) relevant model for the dataset at hand and then removes the “least helpful” variable based on the definition of helpfulness. An important point to be made here is that we need to always remember that the interoperabil- ity of the model is of interest; forgetting that and employing a stepwise selection technique as forwards or backwards selection process will ultimately result in data dredging; essentially discovering causally irrelevant associations between otherwise disassociated physical terms.

It is worth mentioning that a completely different approach to model selection is to conceptually merge the estimation and the selection procedure as this is exhibited in the case of SCAD or LASSO [146]. In these cases one effectively builds in the complexity penalization procedure within the model estimation step. Within a linear mixed effect modelling framework Lan [180] presents the penalization ofβ

while Bondell et al. [36] present an even more generalized approach where β and

γ are penalized in an iterative manner. Finally it is notable that following the in- creasing popularity of ensemble approach in statistical learning [41; 88], ensemble approaches for variable selection have also started to appear [342; 332].

An obvious matter that is often either ignored or left unattended is data quality. Bad data will give bad results irrespective of how successful the results might look in explaining the original research question. Missing or corrupted data are an aspect of an analyst’s research life and that should never be forgotten. Nev- ertheless one might advocate, model under-fitting is more damaging than model over-fitting when it comes to model-based inference given a set of fixed quality data [44]. Ultimately the usage of AIC, BIC, MDL or any other model selection method guarantees that under certain assumptions the best candidate model will be chosen from a set of candidate models. If a “good” model is not part of the set of candidate models, it will not be discovered by model selection algorithms.

In document Functional data analysis in phonetics (Page 71-76)