Phylogenetic Gaussian process regression

Chapter 6 Phylogenetic analysis of Romance languages

6.2 Methods & Implementations

6.2.3 Phylogenetic Gaussian process regression

As already noted, FPCA returns the estimated mixing coefficients at tip taxa, ˆQas well as the basis ˆΦ. The next step in our linguistic phylogenetic study is to perform PGPR [157] separately on each 5-member mixing coefficient set associated with a basis φj(u, f). We assume knowledge of the phylogeny T (as constructed in the

previous step), in order to obtain posterior distributions for all mixing coefficients throughout the treeT. This means that we get estimates for the internal linguistic

Italian American Spanish Iberian Spanish Portuguese French ω0 ω1 ω2 ω3

Romance Language Heuristic Phylogeny

1.4 1.2 1 0.8 0.6 0.4 0.2 0

Figure 6.4: Median branch length consensus tree for the Romance language dataset, gray ωi circles corresponding to protolanguages, Italian emerging as being clearly

the modern language closer to a “universal” Romance protolanguage indexed asω0.

All 10 digits were used to construct this tree. taxa in Tas well as the leaves themselves.

Gaussian process regression (GPR) [263] is a flexible Bayesian technique in which prior distributions are placed on continuous functions. Its range of priors includes the Brownian motion and Ornstein-Uhlenbeck (O-U) processes, which are by far the most commonly used models of character evolution [125; 46]. (Gaussian and the Mat´ern kernels enjoying popularity in spatial statistics literature [68].) Its implementation is particularly straightforward since the posterior distributions are also Gaussian processes and have closed forms. Using notation standard in the Machine Learning literature (see, for example, [263]), a Gaussian process may be specified by its mean surface and its covariance functionK(θ), whereθis a vector of parameters. Since the components ofθparameterize the prior distribution, they are referred to ashyperparameters. The Gaussian process prior distribution is denoted:

f ∼ N(0, K(θ)).

Ifx0 is a set of unobserved coordinates andxis a set of observed coordinates, the posterior distribution of the vectorf(x0) given the observations f(x) is:

where

A=K(x0, x, θ)K(x, x, θ)−1f(x), (6.20)

B =K(x0, x0, θ)

−K(x0, x, θ)K(x, x, θ)−1K(x0, x, θ)T (6.21) andK(x0, x, θ) denotes the |x0| × |x|matrix of the covariance functionK evaluated at all pairs x0_i ∈ X0, xj ∈ X. Equations 6.20 and 6.21 convey that the posterior

mean estimate will be a linear combination of the given data and that the posterior variance will be equal to the prior variance minus the amount that can be explained by the data. The interpretation of these in the context of a phylogenetic tree is twofold: first that all ancestral states can be expressed as linear combinations of the observed leaf states and second that the covariance among this data will be only due to phylogenetic associations. If phylogenetic associations are not present a phylogenetic model will return a simple arithmetic mean as its estimateA(Eq. 6.20) and the covariance structureB will be zero-th as the non-phylogenetic fluctuations are considered independent to each other in your generative model (Eq. 6.23). Additionally, the log-likelihood of the samplef(x) is

logp(f(x)|θ) =− 1 2f(x) T_K₍_{x, x, θ}₎−1_f₍_x₎₋1 2log(det(K(x, x, θ)))− |x| 2 log 2π. 7 (6.22) It can be seen from Eq. 6.22 that the maximum likelihood estimate is subject both to the fit it delivers (the first term) and the model complexity (the second term). Obviously this does not constitute a full model selection procedure as with the cases examined in Sect. 3.4.3. AIC scores are occasionally used but given one fixes the number of parameters of the model a priori (as it will be shown immediately afterwards we fix that number to 2), AIC score changes are exclusively due to changes in the model’s likelihood/parametersθ. Thus, Gaussian process regression is non-parametric in the sense that no assumption is made about the structure of the model: the more data gathered, the longer the vectorf(x), and the more intricate the posterior model for f(x0). This views GPR as non-parametric, stemming from Machine-Learning literature [290]. A counter-argument could be that GPR is an extremely parametric procedure that just “fitsk numbers to dataset”, kbeing the number of hyperparameters utilized in θ. Complementary to this approach, where the results of a two-dimensional FPCA are utilized, is the work of Shi et al. [292]

While we do not focus on computational matter explicitly we draw attention to the fact that one does not need to compute the inverse covariance matrix K(x, x, θ)−1 _{nor the determinant}

det(K(x, x, θ)) directly. For the purposes of evaluating logp(f(x)|θ), in a manner similar to section 5.3.8 one utilizes the Cholesky decomposition of the matrix.

on Gaussian Process Functional regression, there the mean function is modelled by functional regression model and the covariance structure by a Gaussian process. To that extend the current methodology is more simplified and does not explicitly model the mean function separately. On the other hand the methodology of Shi et al. by using B-splines in order to model the mean structure assume that un- derlying structure can be assumed to be piecewise polynomial while the proposed methodology based onF P CAoffers an empirical alternative to that assumption.

Phylogenetic Gaussian Process regression (PGPR) extends the applicability of GPR to evolved function-valued traits as spectrograms. Aphylogenetic Gaussian process is a Gaussian process indexed by a phylogenyT, where the function-valued traits at each pair of taxa are conditionally independent given the function-valued traits of their common ancestors. When the evolutionary process has the same covariance function along any branch of T beginning at its root (called the marginal covariance function), these assumptions are sufficient to uniquely specify the covariance function of the PGP, KT. As we assume that T is known in our inverse

problem based on the tree-estimation step presented above, the only remaining modelling choice is therefore the marginal covariance function. As can be seen from Eq. 6.23, K is a function of patristic distances on the tree rather than Euclidean distances as standard in spatial GPR.

In phylogenetic comparative studies, where one has observations at the leaves ofT, the covariance functionKT may be used to construct a Gaussian process prior

for the function-valued traits, allowing functional regression. In the model that we use, this is equivalent to specifying a Gaussian prior distribution for the set of mixing coefficients used. This may be done by regarding those coefficients as observations of a univariate PGP. As noted in [157], if we assume that the evolutionary process is Markovian and stationary, then the modelling choice vanishes and the marginal covariance function is specified uniquely: it is the stationary O-U covariance function. If we also add explicit modelling of non-phylogenetically related variation at the tip taxa, the univariate prior covariance function has the unique functional form presented in Eq. 6.23. We do not assume knowledge of the parameters of Eq. 6.23 however. To estimate them we use the consensus tree generated in section 6.2.2, shown in Fig. 6.4, and the two-dimensional basis functions generated in section 6.2.1, shown in Fig. 6.3. This fixes the experimental design for our simulation and inference.

Commenting on the specific parameters chosen for the phylogenetic O-U processes, as in [124] we refer to the strength of selection parameter α and the

random genetic drift σn: we add superscripts j to these parameters to distinguish

for a specific basis have the following covariance function: K_Tj(ti,tg) =(σj_f)2exp −2αjPT(ti,tg) + (σ_nj)2δ_te_i,_t_g (6.23) where σ_fj = q (σj₎2

2αj , PT(ti,tg) denotes the phylogenetic or patristic distance (that

is, the distance inT) between the ith andgth tip taxa, σn is defined as above, and δe_ti,tg =

(

1 iffti =tg and ti is a tip taxon,

0 otherwise

adds non-phylogenetic variation to extant taxa as discussed above, ie. δe evaluates to 1 only for extant taxa, thusσnquantifies within-species genetic or environmental

effects and measurement error in thei-th mixing coefficient. As a direct consequence the patristic distance which is effectively the sum of the evolutionary time between the ith and gth tip taxa and their common ancestor offers the space upon which evolutionary differences are defined. This is an important modelling assumption: estimates for latent ancestral states will account only for phylogenetic variation between the taxa. All non-phylogenetic variation has to be accounted for in the extant taxa level. Therefore, we see from Eq. 6.23 that the proportion of variation in the mixing coefficients attributable to the phylogeny is (σ

j f) 2 (σ_fj)2₊₍_σj n)2 . Clearly if this ratio tends to 0, non-phylogenetic variation dominates our sample and phylogenetic inference is impossible. In the Gaussian process regression literature in Machine Learning, ₂1_α is equivalent to `, the characteristic length-scale [263] of decay in the correlation function and in the following work we work with the latter.

Aiming to provide the best possible basis in terms of an RSS reconstruction criterion along with the minimal amount of prior assumptions, we use the FPCA- generated basis. In general, there is no reason for our inference procedure to be sensitive to the particular shape of the basis functions; indeed other bases eg. ICA- based [145] could easily be employed. Concerning inference for a specific digitd(eg.

one) the four simple two-dimensional orthogonal functions shown in Fig. 6.3 were therefore chosen as examples. For computational purposes each basis function was stored numerically as a matrix of dimensions 81 by 100, so that the basis matrixφd

was in this case size 4×8100, each row storing a different basis function. This is in accordance with standard methodology used in spectrogram and face recognition analysis where an image is represented as a concatenated vector [241; 19]. As we will discuss in the final section, given that someone is willing to make certain assumptions about the noise structure applicable, a variety of different models is also available [19; 197; 142].

The mixing coefficients generated by FPCA are stored inQd. Our modelling assumption is that the mixing coefficients for distinct basis functions φd

are statistically independent of each other as they are produced using standard FPCA. It is therefore sufficient to describe the stochastic process generating the mixing coefficient for each basis independently using the phylogenetic model proposed above (Eq. 6.23). We need to emphasize again at this point that we focus on one digitd, whered= 1 in this case. The only instance where all 10 digits were combined, was in the previous subsection for the construction of theMBL tree.

The “extant” function-valued trait at tip taxoniis thusP4

j=1Qi,jφj (a vec-

tor of length 8100), while the ancestral function-valued trait at internal taxong is

j=1Hg,jφj,H storing the values of the mixing coefficients in the ancestral (histor-

ical) states. As commented above, the ancestral function-valued traits exhibit only phylogenetic variation, while the extant function-valued traits exhibit both phylogenetic and non-phylogenetic variation. Of course, it is not possible to reconstruct non-phylogenetic variation using phylogenetic methods. Non-phylogenetic variation is nevertheless a “fact of life” concerning the data at the extant taxa and we need to account for it explicitly. As Hadjipantelis et al. [119] have demonstrated though, this noise does not prevent the reconstruction of the phylogenetic part of variation for ancestral taxa.

Commenting further on the role of parameters in the phylogenetic O-U process described above in Eq. 6.23, exceptionallysmall characteristic length-scales `

relative to the tree patristic distances, practically suggest taxa-specific phylogenetic variation, ie. non-phylogenetic variation. This holds also in its reverse: exception- allylarge characteristic length-scales suggest a stable, non-decaying variation across the examined taxa that is indifferent to their patristic distances, again suggesting the absence of phylogenetic variance among the nodes.

Since the posterior distributions returned by PGPR depend on the hyperparameter vector θ, we must estimateθ in order to reconstruct ancestral function- valued traits; the estimation procedure correcting for the dependence due to the phylogeny. Maximum likelihood estimation (MLE) of the phylogenetic variation, non-phylogenetic variation and characteristic-length-scale hyperparameters σ_fj, σnj

and`j respectively may be attempted numerically using the explicit prior likelihood function (Eq. 6.22).

Estimating hyperparameters is commonly hindered by problems of non-identifiability [263; 160] and, as a direct consequence, concurrent estimation of all components of

θj = (σ_fj, σjn, `j) is problematic. As commented by Beaulieu et al. [23], the influ-

ence of sample size on the bias and precision ofα is particularly pronounced, in our setting this problem is even more evident. In particular given that we have only 5 languages estimating 3 hyperparameters we realize that our estimation procedure is going to suffer. Thus we propose fixing the length scale`. This does not mean that we enforce phylogenetic variation but rather that we fix the distance over which

the covariance can meaningfully occur. If there is “nothing but non-phylogenetic variation”, that will be reflected by the (₍σf_σn)₎22 → 0. Hadjipantelis et al. [119] have

shown that overallθ estimates may be further improved if one knows a priori the value of the ratio (₍_σnσf)₎22, which is closely related to Pagel’sλ[228]. We do not exam-

ine this possibility here though as we have little prior knowledge over the sample’s phylogenetic dynamics in a linguistic application. The final estimated parameters ˆ

θ are shown in Table 6.2. It is immediately seen that only one FPC, the second FPC, encapsulates plausible phylogenetic associations. This is not surprising, given our small sample; plausible associations might not be provided “enough structure” from the tree itself for them to come forward as significant effects. In particular, seeing the hyperparameter estimates for the first FPC of digitone,θ1, it is striking that all variation is considered to be non-phylogenetic; we expect that because, as mentioned, F P C1 appears to encapsulate mostly the presence of the initial vowel

in each word. Given that all words in the sample start with variants of “u” there are not enough differences to be accounted within a phylogeny. To a lesser extent the opposite can be attributed forF P C3. It encodes a highly specialized pattern of

counter-balanced variation between high-frequency early-timed and low-frequency later-timed vocal excitations within the same word, however being so “specialized” there is not enough structure for it to come across as phylogenetically relevant (if indeed it is). Seeing both these FPC’s we see that our decision to fix ` does not seem unreasonable, given the gross absence of any phylogenetic signal. On the con- trary, examiningF P C2we witness a noisy but plausible phylogenetic variation. The

length-scale` here might not be optimal but it does not preclude the detection of phonetic associations due to a phylogeny. In a way we expected this; as mentioned in the earlier section,F P C2 is mostly modelling the possible interplay between the

second and the first vowel of a word (and if there is no second vowel it comes out close to zero). This is a strong association which is not so specific as in the case of F P C3. We can not say thatF P C2 is certainly encapsulating phylogenetic vari-

ations, if anything the σ

j f σnj

being close to unity signifies the significant presence of non-phylogenetic variation; it nevertheless seems to offer plausible insights.

The final FPC analysed,F P C4, gives peculiar hyperparameter choices. One

could naively even say that it is encapsulating only phylogenetic signal. This is clearly not the case as we very well understand that it is practically impossible for a phylogenetic trait (assuming that one is encoded byF P C4) to have retained

absolutely no “non-phylogenetic” variability across a phylogeny. What is more, if we investigate the actual number (700.814), we see it is significantly higher than the standard deviation of the sample coefficients it tried to model originally (279.430). In effect it “amplifies” the phylogenetic variation to such a level so that it acts as “non-phylogenetic” variation with strong practically constant variational amplitude

across all nodes. θ4 are just artefacts of the numerical optimization procedure used. We do not expect any phylogenetic variation in F P C4 and clearly this choice of

θ0s reflects just that. It does draw attention though to the fact that if numerical optimization methods are employed one has to always question the significance of their results not only technically but also conceptually. As mentioned above sample size is an issue that has significant impact on the power of this analysis; we revisit this point in section 6.4.

θi # σi_f σ_ni σ i f σi n 1 2.802∗10−4 2095.101 1.337∗10−7 2 363.358 370.084 0.982 3 1.074∗10−5 473.240 2.270∗10−8 4 700.814 7.988∗10−8 1.383∗109

Table 6.2: The MLE estimates for the hyperparameters in Eq. 6.23 for digit one. Each row corresponds to a given estimate of the vectorθi. These estimates provide the maximum likelihood value for Eq. 6.22. When `is denoted as non-applicable, it is because there is no phylogenetic variation in the sample.

In document Functional data analysis in phonetics (Page 144-151)

Chapter 6 Phylogenetic analysis of Romance languages

6.2 Methods &amp; Implementations

6.2.3 Phylogenetic Gaussian process regression

6.2 Methods & Implementations