Received 30 October 2013; reviews returned 11 July 2014; accepted 14 July 2014 Associate Editor: Tanja Stadler
Abstract.—This article reviews the various models that have been used to describe the relationships between gene trees and speciestrees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and speciestrees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor speciestrees are known. Only a few studies have attempted to jointly infer gene trees and speciestrees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree–species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree–species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution. [Algorithm; amalgamation; Bayesian inference; birth–death model; coalescent; dynamic programming; gene duplication; gene loss; gene transfer; gene tree; hybridization; maximum likelihood; phylogenetics; species tree.]
For the simulation of large data sets, we employed a pipe- line that follows (Schrempf et al. 2016). First, ten speciestrees with 100 leaves were randomly generated under the Yule birth model (Yule 1925). Each of the ten speciestrees is referred to as one of ten replicates. The height of the speciestrees measured in number of generations was 6 times the effective population size which is assumed con- stant. The Yule birth rate was set such that the expected number of species for the given height is 100. Second, for each replicate, 1,000 gene trees were simulated under the multispecies coalescent model. SimPhy (Mallo et al. 2016) was used for these steps. Finally, for each gene tree, DNA sequences with 1,000 sites were generated with Seq-Gen (Rambaut and Grassly 1997) under the HKY mutation model (Hasegawa et al. 1985). The transition to transver- sion ratio was j ¼ 6:25, the stationary nucleotide fre- quencies were p A ¼ 0:3; p C ¼ 0:2; p G ¼ 0:2, and
Towards an general solution for the species tree/gene tree problem
As I have noted above, reticulation can be important for the evolutionary process and has
implications for the performance of species tree inference methods: it should not simply be ignored. Ideally, we would want to apply coalescence-based methods that take advantage of the information contained within gene tree differences, including differences in branch lengths (i.e. the relative timing of gene tree divergences) to infer both the sequence and timing of speciation events. We would also want to recover and represent reticulate evolutionary processes accurately and precisely, thus avoiding distortions to speciestrees caused by gene flow, but without having to know a priori anything about either the species tree or to what degree it might in fact be a species network. Currently, there is no single method available with which we can achieve this.
We simulated the evolution of the gene trees within the model species tree using our C++ implementation of the duplication-loss model of . We applied LGT events on the evolved gene trees, using the standard subtree transfer model of LGT. One LGT event causes the subtree rooted at a vertex c to be pruned and regrafted at an edge (a, b), where a and b together are not in the path from the root (of the tree) to c. We used gene duplication and loss (D/L) rate of 0.002 events/gene per tyrs and LGT rate of 0 to 2 events per gene tree. Note that the gene tree simulations without LGT follow a molecular clock model (equal rates of molecular evolution along all branches of the gene tree), but the simulations with LGT violate the molecular clock. We generated gene trees based on four evolutionary sce- narios: i) no duplications, losses, or LGT (called none), ii) D/L rate 0.002 and no LGT (called dl), iii) no duplication or loss, and LGT rate 2 (called lgt), and iv) D/L rate 0.002 and LGT rate 2 (called both). The parameter values (evolu- tionary scenario and model tree size) for each simulation are called the model condition; 20 model speciestrees were generated for each model condition. We deleted 0 to 25% of leaves (selected at random) from each gene tree to represent missing data or unsampled, which is common in almost all phylogenomic studies. For each gene tree, we used Seq-Gen  to simulate a DNA sequence align- ment of length 500 based on the GTR+Gamma+I model. The parameters of the model were chosen with equal probability from the parameter sets estimated in  on three biological data sets, following . We estimated maximum likelihood trees from each simulated sequence alignment using RAxML , performing searches from 5
The results (Figure 4a and 4b) suggest that as the number of genes increases, the proportion of trials in which MP-EST has successfully recovered the true spe- cies trees increases to 1, and the MSE of the branch length (in coalescent units) appears to decrease to 0, indicating that MP-EST is statistically consistent in esti- mating the speciestrees (topology and branch length in coalescent units) generated in the simulation. The results for STAR and RT show the same pattern that as the number of genes increases, the proportions of trials yielding the true species tree for both methods increase to 1. Overall, STAR performs slightly better than the other two methods, while MP-EST and RT have the similar performance in recovering the true species tree (Figure 4a). A low proportion in Figure 4a does not necessarily imply a large topological difference between the true species tree and the tree estimated by a species tree reconstruction method. For example, the proportion of the MP-EST trees matching the true species tree is 0.47 for the case of 10 genes (Figure 4a), but across all replicates the average Robinson and Foulds (RF) topolo- gical distance  between the MP-EST tree and the true species tree is 1.35, indicating that on average only one or two internodes (usually with short branches lead- ing to their ancestral nodes) are not successfully recovered.
anomaly zone has also been useful for designing simulation studies to test species tree inference methods in challenging regions of parameter space (Kubatko and Deg- nan, 2007; Liu and Edwards, 2009; Liu et al., 2009c; DeGiorgio and Degnan, 2010; Shekhar et al., 2018). Although the theoretical possibility of anomalous gene trees has motivated many methods, the extent that they arise in practice is less clear. We address this question by estimating how often anomalous gene trees occur under the widely-used birth-death models of speciation. We consider three types of anomaly zones, each corresponding to different types of gene trees: unrooted, unranked, and ranked gene trees. The study of various types of anomaly zones can lead to the discovery of the cases when such zones do not overlap with each other. Because the number of possible tree topologies grows faster than exponentially with the number of species, it is necessary to propose reasonable heuristic approaches to infer whether larger speciestrees (i.e., more than eight taxa) are in anomaly zones.
Our contribution In this work, we address the prob- lem of locus tree inference when populational effects are negligible. This allows addressing the locus tree infer- ence problem in a parsimony framework, and to adapt a more general approach than presented in . We assume that incongruence between gene and speciestrees can be caused by locus acquisition events of any kind, including duplications and horizontal gene transfers. We propose to solve the locus tree inference problem by decompos- ing a binary gene tree into a forest of subtrees that can be embedded into a possibly polytomic species tree, in a way that minimizes the weighted sum of the forest size and the number of loss events. We propose two variants of the problem: the Locus Tree Inference, LTI, in which forest elements are subtrees of the species tree, and the Conditional Locus Tree Inference, CLTI, where each forest element is a subtree of some binarization (full refinement) of the species tree. We show a dynamic pro- gramming algorithm that solves LTI in O(|G||S|m) time and O(|G||S|) space, where m is the maximal degree of a node from the species tree. To solve CLTI, we pro- pose a new mapping, called the highest separating rank. Based on the mapping, we show an O(d | G | + | S | ) time and O( | G | + | S | ) space algorithm, where d is the height of S, for inferring required and conditional duplications in gene trees, which improves an O( | G | (d + m) + | S | ) time solution from . Finally, we propose an efficient heu- ristic to solve CLTI, and present a comparative study on simulated and empirical data.
Median tree problems provide a powerful tool to synthesize large-scale species tree estimates from collections of discordant gene trees [Bininda-Emonds (2004)]. Given a collection of gene trees, median tree problems (also referred to as supertree problems [Bininda-Emonds (2004)]) seek a tree, called median tree, that is minimizing the overall distance to the input trees using a problem-specific measure. Median tree problems that are typically used in practice are NP- hard, and thus have been addressed by standard local search heuristics, which have produced some credible estimates of speciestrees [Maddison and Knowles (2006); Than and Nakhleh (2009a)]. However, such heuristics are challenged to find a globally optimal species tree in a highly complex solution landscape whose size is double factorial in the size of the searched tree. In addition, this landscape has typically numerous local optima that can trap heuristic approaches [Bansal and Eulenstein (2013)].
gene tree to differ from the species tree (Degnan and Rosenberg 2006).
The fact that trees built from different genes can vary has been recognized for some time. The ﬁrst method for constructing species tree from incongruent gene trees was published nearly 30 years ago (gene tree parsi- mony; Goodman et al. 1979). Inference of speciestrees from gene trees then received relatively little attention except for important developments in the gene tree parsimony approach reconciling gene duplications and losses (Page and Charleston 1997; Page 1998). Reasons for this may include a dearth of data sets with large numbers of genes and also lack of computational power to implement complex high-dimensional models of gene duplication and coalescent stochasticity. The focus shifted to the development of sophisticated models of nucleotide substitution and application of these models to likelihood and Bayesian inference of individual data sets. Also important was the hope that concatenation of many genes would solve the problem; that with enough sequence data, a predominant signal would emerge and this signal would be equal to the species tree (the “su- permatrix” approach; de Queiroz and Gatesy 2007). The landmark phylogenomics paper of (Rokas et al., 2003) described the inference of a fully resolved and perfectly supported phylogeny of yeast using a supermatrix ap- proach, despite incongruence between the 106 individ- ual genes trees. This was followed by many studies of the yeast alignment and other data sets, showing that the simple concatenation of genome-scale data could mislead the inference of species phylogenies due to the effects of so called “nonphylogenetic” signal (Phillips et al. 2004; Jeffroy et al. 2006) or presence of incomplete lineage sorting (Kubatko and Degnan 2007).
The results for simulated data with a varying amount of noise shows a substantial dependence of the accuracy of the reconstructed speciestrees on the noise model. The results are most resilient against noise model (ii), i.e., overprediction of orthology (see 2nd column in Figure 27 and 28). Even in this case of 25% of orthologous noise, 72% of the speciestrees could be re- constructed correctly and 93% are reconstructed almost correct, i.e., having a TT distance less then 0.1. For this data set, a TT distance less then 0.1 corre- sponds to more then 93% of all triples being correct. However, missing edges in ˜R o , as present in noise model (i) and (iii), have a larger impact. This behav- ior can be explained by the observation that many false orthologs (overpre- dicting orthology) lead to an orthology graph, whose components are more clique-like and hence, yield few informative triples. Incorrect species triples thus are reduced, while missing species triples often can be supplemented through other gene families. On the other hand, if there are many false par- alogs (underpredicting orthology) more false species triples are introduced, resulting in inaccurate trees. Xenologous noise (model (iv)), simulated by changing gene/species associations with probability p, while retaining the original gene tree, amounts to an extreme model for HGT. The ParaPhylo ap- proach, in particular in the weighted version, is quite robust for xenologous noise of 5% to 10%. Although some incorrect triples are introduced, they are usually dominated by correct alternatives observed from multiple gene families, and thus, excluded during computation of the maximal consistent triple set. Only large scale concerted HGT, which may occur in long-term en- dosymbiotic associations (Keeling and Palmer, 2008), pose a serious problem. The complete results for the 2,000 simulated data sets of ten species and 100, respectively 1,000 gene families with a varying amount of noise are depicted in Figure 27 (first simulation method) and Figure 28 ( ALF simulations).
Third, observations of the extent of concordance or discordance of speciestrees on a chromosome can pro- vide information about past episodes of natural se- lection. W iuf et al. (2004) have shown that balancing selection will substantially increase the length of the chromosomal region over which gene trees are concor- dant or discordant with the species tree. Although the problem has not received formal analysis, it is clear that a selective sweep occurring in the species represented by the internal branch of the species tree (Figure 1, species S4) would have a similar effect. If an allele with selective advantage s is substituted, then loci within a recombination distance of roughly r , s will be likely to coalesce at approximately the same time (M aynard S mith and H aigh 1974; K aplan et al. 1989), thus ensuring concordance of the gene tree with the species tree. This hitchhiking effect is different from the effect of balancing selection in that it should lead only to concordance of the gene tree with the species tree on a large genomic scale and not to discordance over a com- parably large genomic scale, as can balancing selection. With increased availability of genomic data and the rapidly increasing number of species for which whole- genome sequences are available, it will be possible to examine variation in gene trees across genomes. The fine- scale variation in gene trees can reveal aspects of evolu- tionary history that are not accessible by other means.
Species play an important role for numerous aspects of biology . Representing their evolutionary relationships in speciestrees as in Fig. 1.1 is crucial for numerous analytical purposes, for instance to rise awareness to human induced mass extinction . For over 100 years, scientist have estimated speciestrees based on so-called apomorphies. They considered external characteristics, that groups of species share or that are unique to a single species. In 1895, Ernst Haeckel estimated speciestrees for primates among other species. He distinguished between, for example, species with claws or fingernails and how the nostrils are arranged . Fig. 1.1a shows one of his speciestrees.
Species networks generalize the notion of speciestrees to allow for hybridization or other lateral gene transfer. Under the network multispecies coalescent model, individual gene trees arising from a network can have any topology, but arise with frequencies dependent on the network structure and numerical parameters. We propose a new algorithm for statistical inference of a level-1 species network under this model, from data consisting of gene tree topologies, and provide the theoretical justification for it. The algorithm is based on an analysis of quartets displayed on gene trees, combining several statistical hypothesis tests with combinatorial ideas such as a quartet-based intertaxon distance appropriate to networks, the NeighborNet algorithm for circular split systems, and the Circular Network algo- rithm for constructing a splits graph.
study, we had two objectives. Firstly, to investigate the cause of decline and death of Acacia speciestrees around Windhoek. This is in order to develop management strategies to reduce the impact of the disease. Secondly, to survey for fungal pathogens
8. Merkle D, Middendort M: Reconstruction of the cophylogenetic history of related phylogenetic trees with divergence timing information. Theory Biosci 2005, 123(4):277-299.
9. Doyon JP, Scornavacca C, Gorbunov KY, Szöll ősi GJ, Ranwez V, Berry V: An effcient algorithm for gene/speciestrees parsimonious reconciliation with losses duplications and transfers. In Research in Computational Molecular Biology: Proceedings of the 14th International Conference on Research in Computational Molecular Biology (RECOMB). Volume 6398. LNCS, Springer, Berlin/Heidelberg, Germany; 2010:93-108, Software downloadable at http://www.atgc-montpellier.fr/Mowgli/.
Cryptococcus uzbekistanensis is a non-capsulated yeast that was first isolated from a desert soil sample from near Bukhara, Uzbekistan, in 1999 by Chernov et al. Afterwards, Fonseca et al. identified that this species causes glossy, smooth, cream to pinkish-cream colonies on yeast mold agar, with soft and butyrous texture. Review of veterinary and medical papers reveals that C. uzbekistanensis has never been isolated from an infection in humans or animals  until Powel et al.  reported the first case of cryptococcosis due to C. uzbekistanensis from the bone marrow of an immunocompromised patient with pancytopenia. Further, isolating C. uzbekistanensis from dust in US military samples has been reported in the Middle East . We also isolated one case of C. uzbekistanensis from pigeon droppings in pet shops. We also reported an isolate of Filobasidium uniguttulatum from pet shops. Filobasidium uniguttulatum, is a teleomorphic fungus, which was first isolated in 1934 from an infected human nail, and then it was identified as Eutorulopsis uniguttulata . On the basis of physiological and morphological similarities with C. neoformans, Eutorulopsis uniguttulata was renamed to Cryptococcus neoformans var. uniguttulatus, but with less capsule formation .
Abstract: New Zealand forests grow under highly oceanic climates on an isolated southern archipelago. They experience a combination of historical and environmental factors matched nowhere else. This paper explores whether the New Zealand tree flora also differs systematically from those found in other temperate and island areas. A compilation of traits and distributions from standard floras is used to compare the New Zealand tree flora with those of Europe, North America, Chile, southern Australia, Fiji and Hawaii. New Zealand has a large number of trees (215 species ≥6 m in height). It is more tree-rich than temperate North America and Europe having up to 50% more species at a quadrat scale of 2.5º latitude x 2.5º longitude. However, this richness is due to a greater abundance of small trees (≤15 m in height) and we argue that it is a legacy of allopatric speciation and radiation during the late Neogene (2.5–10 million yrs ago) when the New Zealand landmass was repeatedly split into smaller island groups and mountain building occurred. The leaves of New Zealand trees, along with those of southeast Australia, are smaller and narrower than those of the temperate northern hemisphere. Dominance of the canopy by small-leaved evergreen conifers and angiosperms may have facilitated the persistence of small tree species in the lower canopy. The proportion of tree species with a deciduous or divaricating habit, and toothed-margin leaves, increases with latitude, suggesting a link with lower winter temperatures in the south. Tree species richness decreases with increasing latitude and, in conformity with Rapoport’s Rule, latitudinal range width increases. Wide-range trees are mainly bird-dispersed, fast-growing seral small trees, or long-lived, tall podocarps. Wide-range trees appear to have no greater tolerance of climate extremes than narrow-range trees, and their persistence at high latitudes derives from their enhanced colonization ability.
Since the Mesoretic annotation is supposed to mark the structure of every verse unambiguously, we expect to parse every verse successfully with exactly one tree assigned to it, given that (1) the annotation is perfectly correct and (2) the CFG grammars correctly encoded the annotation rules. The actual results were close to our expectation: all the 23213 verses were successfully parsed, of which 23099 received exactly one complete tree. The success rate is 99.5 percent. The 174 verses that received multiple parse trees all have words that carry more than one cantillation mark. This can of course create boundary ambiguities and result in multiple parse trees. We have good reasons to believe that the grammars we used are correct. We would have failed to parse some verses if the grammars had been incomplete and we would have gotten multiple trees for a much greater number of verses if the grammars had been ambiguous.