Gaussian Processes & Phylogenetic Regression

Chapter 3 Statistical Techniques for Functional Data Analysis

3.5 Stochastic Processes and Phylogenetics

3.5.1 Gaussian Processes & Phylogenetic Regression

A Gaussian Process (GP) is defined as a probability distribution over functionsY(s) such that the set of values ofY(s) evaluated at an arbitrary set of pointss1, . . . , sN

jointly have a Gaussian distribution. Importantly this means that if one chooses to work with a zero-meaned Gaussian processes, these GPs will be completely spec- ified by their second-order statistics [28]. Drawing analogy with spatial statistics methodology and kriging, these inference would be referred as simple kriging. To formulate this function-space view of Gaussian processes we can write a GP as a functionf(x) such that:

where as stated beforehand ifm(x) = 0, the covariance functionk(x, x0) can be seen as :

K(, x0) =E{(f(x)−m(x))(f(x0)−m(x0))} (3.70) =E{(f(x))(f(x0))} ifm(x) = 0 (3.71) where f(x) is the value of the function f at point x [263]. Having this very basic formulation in place it is interesting to look more specifically on the covariance functions’ level. Covariance functions encode not only the covariance of the sample points among the observed points but also they offer an insight in the dynamics of the whole process.

In addition, the realization of the covariance function K as the covariance matrix K between all the pair of pointsxandx0 specifies a distribution on functions and is known as the Gram matrix. Importantly, because every valid covariance function is a scalar product of vectors, by construction the matrixK is a non-negative definite matrix. Equivalently, the covariance function K is a non-negative definite function in the sense that for every pairxandx0 ,K(x, x0)≥0, ifK(·,·)≥0 thenK

is called semi-positive definite. Importantly the non-negative definiteness ofK en- ables its spectral decomposition using the Karhunen-Loeve expansion. Basic aspects that can be defined through the covariance function are the process’ stationarity, isotropy and smoothness [17].

Stationarity refers to the process’ behaviour regarding the separation of any two points x and x0. If the process is stationary, it depends on their separation,

x−x0, while if non-stationary it depends on the actual position of the points xand

x0; an example of a stationary process is the Ornstein-Uhlenbeck (O-U) process. If the process depends only on |x− x0|, the Euclidean distance (not the direction) betweenx andx0 then the process is considered isotropic. A process that is concurrently stationary and isotropic is considered to be homogeneous[110]; in practice these properties reflect the differences (or rather the lack of them) in the behaviour of the process given the location of the observer.

Ultimately Gaussian processes translate as taking priors on functions and the smoothness of these priors can be induced by the covariance function [17]. If we expect that for “near-by” input pointsx and x0 their corresponding output points

y and y0 to be “near-by” also, then the assumption of smoothness is present. If we wish to allow for significant displacement then we might choose a rougher covariance function. Extreme examples of the behaviour, is the Ornstein-Uhlenbeck covariance function and the squared exponential where the former is never differentiable and the latter infinitely differentiable.

process. Formally, this is achieved by mapping the input x to a two dimensional vector u(x) =(cos(x), sin(x)). As outlined earlier a stochastic process with great biological interest is the O-U process. This is because we recognize that what we ultimately want is a Gaussian-Markov process; a stochastic process that satisfies the requirements of both a Gaussian (in terms of changes) and a Markovian (in terms of finite memory) process. With this in mind using a standard noisy measurement O-U kernel in the context of a phylogenetic GP (f(L)∼ N(0, K(L, L, θ)) ) would therefore be resulting in the following covariance structure:

K(li, lj) =s2fexp(−|li−lj|/λ) +s2nδli,lj (3.72)

where for a given traitf(L) on a finite set of co-ordinates “leaf”L,K(L, L, θ) is the matrix of covariances of pairs (li, lj ) with hyperparameters θ; θ being in this case

composed by three components:

• s2_f : intensity of random fluctuations in evolution due to balance between the restraining forces / amplitude of function variation

• λ: phylogenetic horizon (how many generations back a trait is influence by) / characteristic length scale ( “roughly the distance you have to move in input space before the function value can change significantly” [263] and

• s2

n: interspecies variation, changes unaccountable from the relations conveyed

by the phylogeny / Gaussian noise.

With this at hand the final estimation due to the predictive distribution is found under a standard maximum likelihood framework where one maximizes the phylogenetic GP’s LogLikelihood: logp(f(L)|θ) =−1 2f(L) T_K₍_{L, L, θ}₎_f₍_L₎₋ 1 2log|K(L, L, θ)| − |L| 2 log(2π) (3.73) in order to find the optimal values ofθ,θopt.

Throughθopt one is immediately able to answer questions regarding the evo-

lutionary properties of the sample at hand. For example ifs2_f << s2_nthen is almost obvious that the phylogeny at hand is able to account only for a very small propor- tion of the observed variance and thus probably the phylogeny is not useful. The same insight being conveyed when λ → 0, where effectively this mean that each node is in practice agnostic of all other nodes in the phylogeny and no “information transferral” takes place. In any case the fact is that if one fixes the values ofθ the posterior distribution for ancestral states A is immediately available through the

posterior of the Gaussian distribution that the (univariate) traits describe. Namely:

f(A)|f(L)∼ N(K(A, L)K(L, L)−1f(L), K(A)−K(A, L)K(L, L)K(A, L)T).

(3.74) While phylogenetic GPR will be revisited in chapter 6, we need to immediately highlight the fact that we not only get a posterior mean estimate forf(A), we are also able to quantify our uncertainty about that estimate by variance attributed to that point in the phylogeny that is independent of the actual observations value

f(L) [157].

Combining this notation with the previously presented concept of an O-U process translates the covariance structureK into a reflection of the perturbations due to selective demands from unconsidered selective factors. These being due, in the case of a language, to semantic correlations of the sounds produced, voicing correlations between the biomechanics of phonation, environmental fluctuations, and obviously random ”mutations” that materialize as ”corruption” of the initial sounds. Expanding on this, phylogenetic time is the concept that serves as the continuum over which data are observed (in the case of observed leaf nodes) or assumed to exist (unobserved ancestral nodes). To that extend Xopt cannot be defined as having a

single physical notion but as (under simplifying assumptions) conceptual optimal state where a language conveys using speech perfectly all the information required by its speakers.

In document Functional data analysis in phonetics (Page 79-82)

Gaussian Processes & Phylogenetic Regression

Chapter 3 Statistical Techniques for Functional Data Analysis

3.5 Stochastic Processes and Phylogenetics

3.5.1 Gaussian Processes &amp; Phylogenetic Regression

3.5.1 Gaussian Processes & Phylogenetic Regression