• No results found

Tree Estimation

In document Functional data analysis in phonetics (Page 142-144)

Chapter 6 Phylogenetic analysis of Romance languages

6.2 Methods & Implementations

6.2.2 Tree Estimation

As first mentioned in section 2.4 we begin with an unrooted linguistic phylogenetic treeTwhich has arbitrary branch-lengths where only the branching events are set. As previous work has commented [119], branch length distributions are surprisingly consistent across organisms [318]; with that in mind, we make the assumption that

the same effect is prominent in a linguistic phylogeny. Utilizing the scalar mix- ing coefficients associated with each FPC in ˆφ(u, f) one treats these coefficients as “the data at the tips”. One then constructs a maximum likelihood consensus tree. In particular, based on the work of Hansen [124] that was later popularized by the work of Butler and King [46], the evolutionary model assumed is that of an Ornstein- Uhlenbeck stochastic model. We find the ML tree associated with each coefficient by doing a random search. To generate candidate branch lengths we assumed that the distribution of branch lengths approximated that of a log-normallog(b)∼ N(µ, σ2); this assumption is supported by empirical investigation of tree contained in Tree- fam [192]. We do this because as we assume the notions ofglottoclock in Linguistics andmolecular clock in Biology to share the same intrinsic meanings in their respec- tive fields, we consider that the observed diffusion patterns will also be similar in a qualitative level. For the sake of generality we do not assume that the tree at hand is ultrametric (in an ultrametric tree all the extant taxa are on the same time- depth in the tree; this being formally expressed as d(ti, tj) ≤ d(ti, tk) = d(tj, tk)

for every triplet i, j, k of extant taxa nodes). After finding the ML-optimal trees for each of the k projections utilized, we construct the consensus tree for the lan- guages at hand. The consensus tree is constructed by applying the median branch length (MBL) rationale [82]: given that we have k candidate trees with the same branching topology, the consensus tree is constructed by assigning to each edge of the consensus tree, the median branch length from the k candidate “ML-optimal” trees associated with each branch. One in effect computes the “median tree”. A number of complementary methodologies have also been proposed with variants of the majority-rule consensus tree being the most popular [186; 137]. We do not advocate a majority-rule consensus tree on the grounds that our sample is quite small and therefore bootstrapping techniques (as those are extensively used for the generation of majority-rule consensus trees) are not reliable. Additionally we also do not examine a possible clustering of correlation effects in the phylogenies and a subsequent clustering that they might induce [82]. Finally we do not explicitly examine the possibility of a multifurcating tree, ie. a trifurcation or higher degree branching events. While such events might have some gravity in the case of small population linguistic phylogenetic studies, where one can assume rapid branching of different groups of people [38], we do not find it plausible for cases of widely spoken languages as the ones found in the Romance language family. We do though allow for arbitrary small edge lengths, so we can in effect facilitate this possibility as one trifurcation would be associated with a zero branch length for an internal edge.

Implementing these assumptions we begin with the unrooted linguistic phy- logenetic tree T with 5 leaves, shown in Fig. 2.6. This tree is based on [106]; American Spanish have been added though as a distinct language. We make the

assumption that American Spanish share a common ancestor with Iberian Spanish, with that “Spanish protolanguage bifurcation” occurring more recently than any other linguistic bifurcation event in the examined Romance language phylogeny. Having fixed the branching structure of the tree we assign at its leaves the FPC scores. Each 5-language FPC score grouping is considered independent not only along the scores associated with the same digit but also with the scores from the other digits. As we are using digitsonetoten, having generated 4 FPC surfaces for each digit, we test 40 different sets of “data at the tips”. Using the O-U model we tested against 5120 candidate trees and reported as the optimal tree, the tree with the maximum likelihood for that given set of FPC scores. To conduct this testing step the functionfitContinuous() from the R package geiger [128] was utilized; for each candidate tree branch sample “fitting”, 700 random initializations of the routine were tested. As mentioned, while the branching events are treated as fixed, the branch lengths are not. Candidate branch lengths b were sampled from a log- normal such thatlog(b)∼ N(−2.29,1.662); the actual values ofµandσ shown here were estimated by using the trees publicly available in Tree-fam6. Tree-fam ver. 8 [192] contains 16604 trees in total; for this task though, trees with less than 5 or more than 20 nodes were excluded from the analysis because we assumed that they do not present plausible exemplars for a Romance languages linguistic phylogeny. The reasons behind this heuristic rule are three: First, smaller trees may often con- vey domain-specific relations even within a biological setting. Second, larger trees are also less plausible as linguistic exemplars because they often aggregate different families of organisms with well understood distinctions in a way that is irrelevant for linguistics. Third, based on existing literature [221; 224; 106] Romance languages are not assumed to incorporate more than approximately 20 leaves. Based on these points this cut-off resulted in a 3593 tree sub-sample that was ultimately used to es- timate muˆ and ˆσ. Having estimated the 40 “ML-optimal” trees, the “median tree” (show in Fig. 6.4) was constructed and assumed to be the tree that most accurately reflects our modelling assumptions as well as the universal linguistic phylogenetic association between the languages examined.

In document Functional data analysis in phonetics (Page 142-144)