We apply boosting, a machine learning approach, to train regression tree for the inference of de-mographic history of IM model. There are five main parameters in IM model to infer: population splitting time T , two population size parameter θ1 and θ2, migration rate in two directions m1 and m2. We use simulated multi-locus SNP data with known parameters to collect a list of summary statistics that can describe two populations, and then train five boosted regression trees to infer population splitting time T , two population size parameter θ1 and θ2, migration rate in two direc-tions m1 and m2. We find that the estimate of T , θ1 and θ2 can help the inference accuracy of migration rates. In the results, we present five basic trained boosted regression trees for inference of those parameters. Meanwhile, there is a lot to explore and improve for future work.
Speaking of large genomic data, it is believed that it can help to increase the inference accuracy as we have more information. Here we use multi-locus SNP data, the inference of demographic parameters can be improved by analyzing them together. As shown in Methods, the boosted re-gression tree gives an estimate for each locus, then a simple way to improve accuracy would be taking the average over multiple loci.
Now we only focus on the basic two-population IM model. In real data, it is often rare that two
Figure 4.4: Training and Testing accuracy for migration rates with various training parameters:
learning rate and number of rounds. For various learning rate, number of rounds are fixed to be 100. Then we use the best learning rate to optimize the number of rounds.
Figure 4.5: Boxplot for migration rate M1 and M2.
populations are completely isolated with other populations. Thus, an extension to multiple popula-tion based on k-populapopula-tion (k > 2) IM model is necessary. Thus, we aim to develop a framework that can deal with multiple population multi-locus genetic data. Then extensive experiments on both simulated data and real data need to be conducted, including comparing to existing method IM a2 and other tools.
Chapter 5
Conclusion
Inference of population history is the central problem in population genetics. In this thesis, we focus on three different inference problems in population genetics and aim at developing accurate method and fast algorithm to resolve problems.
The first problem we focus on is a problem of ancestry inference. The inference of ancestry often forms the basis of many standard population genetic analyses. However, the commonly used methods for admixture inference do not allow estimation of ancestry components separately for the two parents. We develop a method for inference of admixture proportions of recent ancestors such as parents and grandparents. To the best of our knowledge, there are no other methods for inferring admixture proportions for grandparents or great grandparents. Our method uses a two-stage model for maximum likelihood estimation (MLE) based on a Hidden Markov Model (HMM). The two-stage model includes perfect pedigree, which is a reasonable model for recent genealogical history of a single individual. The key idea of method is using the distribution of admixture tracts which is influenced by the ancestral admixture proportions. Admixture tracts capture the important linkage disequilibrium information. We can show that treating SNPs as independent sites is insufficient for inferring ancestral admixture proportions in general. Inference accuracy is affected by various parameters in population genetic process. Among them, it can be significantly influenced by re-combination rate and more ancient admixture. Our HMM-based method can be extended to more
distant ancestors. However, as the number of generations increases, the amount of information for each ancestor loses and the computational burden increases.
We then focus on a problem of species delimitation. Species are considered to have fundamen-tal importance in ecological and evolutionary studies. Existing methods for species delimitation work less well for “harder” cases when the genetic isolation timescale is near the evolutionary threshold and when migration event involves in the evolutionary process. We propose a classi-fication based method, called CLADES (which stands for CLAssiclassi-fication based DElimitation of Species), for species delimitation. It is designed to answer two questions: (i) Can one design more accurate and efficient methods that give more accurate species delimitation results than existing methods at these harder cases? (ii) Can such method also reliably find the underlying genetic isolation with gene flow? CLADES works well when the gene flow between two populations is moderate. It is efficient and can be applied to genomic-level data. It is also flexible to be extended and designed for specific cases.
Then we explore the possibility to use machine learning approach to infer demographic history under Isolation-with-migration (IM) model. We propose a machine learning based method to infer the divergence of two population and their migration pattern based on IM model. We apply boosting to train regression trees for estimate of five parameters with continuous values. The testing accuracy of boosted regression trees seem reasonable. A more complete framework needs to be constructed for multiple closely related populations. And this will be our future work.
Bibliography
Alexander, D. H., Novembre, J., and Lange, K. (2009). Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9):1655–1664.
Anderson, C. A., Pettersson, F. H., Clarke, G. M., Cardon, L. R., Morris, A. P., and Zondervan, K. T. (2010). Data quality control in genetic case-control association studies. Nature protocols, 5(9):1564.
Beaumont, M. A., Cornuet, J.-M., Marin, J.-M., and Robert, C. P. (2009). Adaptive approximate bayesian computation. Biometrika, 96(4):983–990.
Beaumont, M. A., Zhang, W., and Balding, D. J. (2002). Approximate bayesian computation in population genetics. Genetics, 162(4):2025–2035.
Becquet, C. and Przeworski, M. (2007). A new approach to estimate parameters of speciation models with application to apes. Genome research, 17(10):1505–1519.
Beerli, P. and Felsenstein, J. (2001). Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. Proceedings of the National Academy of Sciences, 98(8):4563–4568.
Birky Jr, C. W. (2013). Species detection and identification in sexual organisms using population genetic theory and dna sequences. PLoS One, 8(1):e52544.
Carstens, B. C., Pelletier, T. A., Reid, N. M., and Satler, J. D. (2013). How to fail at species delimitation. Molecular ecology, 22(17):4369–4383.
Chang, C. and Lin, C. (2011). Acm transactions on intelligent systems and technology. ACM Trans Intell Syst Technol, 2(3):27.
Chen, G. K., Marjoram, P., and Wall, J. D. (2009). Fast and flexible simulation of dna sequence data. Genome research, 19(1):136–142.
Chu, C., Li, X., and Wu, Y. (2015). Splicejumper: a classification-based approach for calling splicing junctions from rna-seq data. BMC bioinformatics, 16(17):S10.
Chu, C., Li, X., and Wu, Y. (2017). Gappadder: A sensitive approach for closing gaps on draft genomes with short sequence reads. In Computational Advances in Bio and Medical Sciences (ICCABS), 2017 IEEE 7th International Conference on, pages 1–1. IEEE.
Chu, C., Nielsen, R., and Wu, Y. (2016). Repdenovo: inferring de novo repeat motifs from short sequence reads. PloS one, 11(3):e0150719.
Chu, C., Wu, Y., et al. (2014a). An svm-based approach for discovering splicing junctions with rna-seq. In 2014 IEEE 4th International Conference on Computational Advances in Bio and Medical Sciences (ICCABS), pages 1–1. IEEE.
Chu, C., Zhang, J., and Wu, Y. (2014b). Gindel: accurate genotype calling of insertions and deletions from low coverage population sequence reads. PloS one, 9(11):e113324.
Consortium, . G. P. et al. (2015). A global reference for human genetic variation. Nature, 526(7571):68.
Cortes, C. and Vapnik, V. (1995). Support vector machine. Machine learning, 20(3):273–297.
Ence, D. D. and Carstens, B. C. (2011). Spedestem: a rapid and accurate method for species delimitation. Molecular Ecology Resources, 11(3):473–480.
Ewing, G. and Hermisson, J. (2010). Msms: a coalescent simulation program including recombina-tion, demographic structure and selection at a single locus. Bioinformatics, 26(16):2064–2065.
Excoffier, L., Dupanloup, I., Huerta-S´anchez, E., Sousa, V. C., and Foll, M. (2013). Robust demo-graphic inference from genomic and snp data. PLoS genetics, 9(10):e1003905.
Excoffier, L., Estoup, A., and Cornuet, J.-M. (2005). Bayesian analysis of an admixture model with mutations and arbitrarily linked markers. Genetics, 169(3):1727–1738.
Fei, L., Hu, S., Ye, C., Huang, Y., et al. (2009). Fauna sinica. amphibia vol. 2 anura.
Frost, D. R. et al. (2011). Amphibian species of the world: an online reference. Version, 5(31):01.
Gillespie, J. H. (2010). Population genetics: a concise guide. JHU Press.
Gravel, S. (2012). Population genetics models of local ancestry. Genetics, 191(2):607–619.
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H., and Bustamante, C. D. (2009). Inferring the joint demographic history of multiple populations from multidimensional snp frequency data. PLoS genetics, 5(10):e1000695.
Hartigan, J. A. and Wong, M. A. (1979). Algorithm as 136: A k-means clustering algorithm.
Journal of the Royal Statistical Society. Series C (Applied Statistics), 28(1):100–108.
Hartl, D. L., Clark, A. G., and Clark, A. G. (1997). Principles of population genetics, volume 116.
Sinauer associates Sunderland.
Hausdorf, B. (2011). Progress toward a general species concept. Evolution, 65(4):923–931.
Hellenthal, G. and Stephens, M. (2007). mshot: modifying hudson’s ms simulator to incorporate crossover and gene conversion hotspots. Bioinformatics, 23(4).
Hey, J. (2009). Isolation with migration models for more than two populations. Molecular biology and evolution, 27(4):905–920.
Hey, J. and Nielsen, R. (2007). Integration within the felsenstein equation for improved markov chain monte carlo methods in population genetics. Proceedings of the National Academy of Sciences, 104(8):2785–2790.
Hudson, R. R. (2002a). Generating samples under a wright–fisher neutral model of genetic varia-tion. Bioinformatics, 18(2):337–338.
Hudson, R. R. (2002b). Generating samples under a wright–fisher neutral model of genetic varia-tion. Bioinformatics, 18(2):337–338.
Idury, R. M. and Elston, R. C. (1997). A faster and more general hidden markov model algorithm for multipoint likelihood calculations. Human heredity, 47(4):197–202.
Jackson, N. D., Carstens, B. C., Morales, A. E., and O’Meara, B. C. (2017). Species delimitation with gene flow. Systematic Biology, 66(5):799–812.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An introduction to statistical learning, volume 112. Springer.
Jones, G., Aydin, Z., and Oxelman, B. (2014). Dissect: an assignment-free bayesian discovery method for species delimitation under the multispecies coalescent. Bioinformatics, 31(7):991–
998.
Kruglyak, L. and Lander, E. S. (1998). Faster multipoint linkage analysis using fourier transforms.
Journal of Computational Biology, 5(1):1–7.
Kuhner, M. K., Beerli, P., Yamato, J., and Felsenstein, J. (2000). Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics, 156(1):439–447.
Lander, E. S. and Green, P. (1987). Construction of multilocus genetic linkage maps in humans.
Proceedings of the National Academy of Sciences, 84(8):2363–2367.
Leache, A., Fujita, M. K., Minin, V., and Bouckaert, R. R. (2014). Species delimitation using genome-wide snp data. Syst. Biol., 63:534–542.
Leuenberger, C. and Wegmann, D. (2009). Bayesian computation and model selection without likelihoods. Genetics.
Li, J. Z., Absher, D. M., Tang, H., Southwick, A. M., Casto, A. M., Ramachandran, S., Cann, H. M., Barsh, G. S., Feldman, M., Cavalli-Sforza, L. L., et al. (2008). Worldwide human relationships inferred from genome-wide patterns of variation. science, 319(5866):1100–1104.
Li, X., Chu, C., Pei, J., Mandoiu, I., and Wu, Y. (2017). Circmarker: A fast and accurate algorithm for circular rna detection. bioRxiv, page 134411.
Liang, M. and Nielsen, R. (2014). The lengths of admixture tracts. Genetics, 197(3):953–967.
Lin, K., Futschik, A., and Li, H. (2013). A fast estimate for the population recombination rate based on regression. Genetics, 194(2):473–484.
Lu, B., Bi, K., and Fu, J. (2014). A phylogeographic evaluation of the amolops mantzorum species group: cryptic species and plateau uplift. Molecular phylogenetics and evolution, 73:40–52.
Maples, B. K., Gravel, S., Kenny, E. E., and Bustamante, C. D. (2013). Rfmix: a discriminative modeling approach for rapid and robust local-ancestry inference. The American Journal of Human Genetics, 93(2):278–288.
Margalit, Y., Baran, Y., and Halperin, E. (2015). Multiple-ancestor localization for recently ad-mixed individuals. In International Workshop on Algorithms in Bioinformatics, pages 121–135.
Springer.
Mayr, E. (1976). Species concepts and definitions. In Topics in the Philosophy of Biology, pages 353–371. Springer.
Mirzaei, S. and Wu, Y. (2016). Rent+: an improved method for inferring local genealogical trees from haplotypes with recombination. Bioinformatics, 33(7):1021–1030.
Naduvilezhath, L., Rose, L. E., and Metzler, D. (2011). Jaatha: a fast composite-likelihood ap-proach to estimate demographic parameters. Molecular Ecology, 20(13):2709–2723.
Pas¸aniuc, B., Sankararaman, S., Kimmel, G., and Halperin, E. (2009). Inference of locus-specific ancestry in closely related populations. Bioinformatics, 25(12):i213–i221.
Pei, J., Chu, C., Li, X., Lu, B., and Wu, Y. (2018a). Clades: A classification-based machine learn-ing method for species delimitation from population genetic data. Molecular ecology resources.
Pei, J., Nielsen, R., and Wu, Y. (2018b). Inferring the ancestry of parents and grandparents from genetic data. bioRxiv, page 308494.
Pei, J., Wang, J., et al. (2017). A fast optimization algorithm for k-coverage problem. In Interna-tional Conference on Intelligent Computing, pages 703–714. Springer.
Pei, J. and Wu, Y. (2017). Stells2: fast and accurate coalescent-based maximum likelihood infer-ence of species trees from gene tree topologies. Bioinformatics, 33(12):1789–1797.
Pons, J., Barraclough, T. G., Gomez-Zurita, J., Cardoso, A., Duran, D. P., Hazell, S., ..., and Vogler, A. P. (2006). Sequence-based species delimitation for the dna taxonomy of undescribed insects.
Systematic biology, 55(4):595–609.
Pool, J. E., Hellmann, I., Jensen, J. D., and Nielsen, R. (2010). Population genetic inference from genomic sequence variation. Genome research, 20(3):291–300.
Prado-Martinez, J., Sudmant, P. H., Kidd, J. M., Li, H., Kelley, J. L., Lorente-Galdos, B., ..., and Marques-Bonet, T. (2013). Great ape genetic diversity and population history. Nature, 499(7459):471.
Price, A. L., Tandon, A., Patterson, N., Barnes, K. C., Rafaels, N., Ruczinski, I., Beaty, T. H., Mathias, R., Reich, D., and Myers, S. (2009). Sensitive detection of chromosomal segments of distinct ancestry in admixed populations. PLoS genetics, 5(6):e1000519.
Pritchard, J. K., Stephens, M., and Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155(2):945–959.
Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., Maller, J., Sklar, P., De Bakker, P. I., Daly, M. J., et al. (2007). Plink: a tool set for whole-genome association and population-based linkage analyses. The American Journal of Human Genetics, 81(3):559–575.
Rannala, B. (2015). The art and science of species delimitation. Current Zoology, 61(5):846–853.
Rannala, B. and Yang, Z. (2003). Bayes estimation of species divergence times and ancestral population sizes using dna sequences from multiple loci. Genetics, 164(4):1645–1656.
Rannala, B. and Yang, Z. (2013). Improved reversible jump algorithms for bayesian species de-limitation. Genetics, 194(1):245–253.
Rona, L. D., Carvalho-Pinto, C. J., Mazzoni, C. J., and Peixoto, A. A. (2010). Estimation of divergence time between two sibling species of the anopheles (kerteszia) cruzii complex using a multilocus approach. BMC evolutionary biology, 10(1):91.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. (2002a). Genetic structure of human populations. science, 298(5602):2381–
2385.
Rosenberg, N. A., Pritchard, J. K., Weber, J. L., Cann, H. M., Kidd, K. K., Zhivotovsky, L. A., and Feldman, M. W. (2002b). Genetic structure of human populations. science, 298(5602):2381–
2385.
Royal, C. D., Novembre, J., Fullerton, S. M., Goldstein, D. B., Long, J. C., Bamshad, M. J., and Clark, A. G. (2010). Inferring genetic ancestry: opportunities, challenges, and implications. The American Journal of Human Genetics, 86(5):661–673.
Sankararaman, S., Sridhar, S., Kimmel, G., and Halperin, E. (2008). Estimating local ancestry in admixed populations. The American Journal of Human Genetics, 82(2):290–303.
Sheehan, S. and Song, Y. S. (2016). Deep learning for population genetic inference. PLoS compu-tational biology, 12(3):e1004845.
Stephens, M. and Donnelly, P. (2000). Inference in molecular population genetics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(4):605–635.
Sukumaran, J. and Knowles, L. L. (2017). Multispecies coalescent delimits structure, not species.
Proceedings of the National Academy of Sciences, 114(7):1607–1612.
Takahata, N., Satta, Y., and Klein, J. (1995). Divergence time and population size in the lineage leading to modern humans. Theoretical population biology, 48(2):198–221.
Tang, H., Coram, M., Wang, P., Zhu, X., and Risch, N. (2006). Reconstructing genetic ancestry blocks in admixed individuals. The American Journal of Human Genetics, 79(1):1–12.
Tang, H., Peng, J., Wang, P., and Risch, N. J. (2005). Estimation of individual admixture: analytical and study design considerations. Genetic epidemiology, 28(4):289–301.
Wakeley, J., King, L., Low, B. S., and Ramachandran, S. (2012). Gene genealogies within a fixed pedigree, and the robustness of kingman’s coalescent. Genetics, 190(4):1433–1445.
Wegmann, D., Leuenberger, C., and Excoffier, L. (2009). Efficient approximate bayesian compu-tation coupled with markov chain monte carlo without likelihood. Genetics, 182(4):1207–1218.
Wiley, E. O. (1978). The evolutionary species concept reconsidered. Syst. Biol., 27:17–26.
Wilkins, J. S. (2009). Species: a history of the idea, volume 1. Univ of California Press.
Yang, W.-Y., Platt, A., Chiang, C. W.-K., Eskin, E., Novembre, J., and Pasaniuc, B. (2014). Spa-tial localization of recent ancestors for admixed individuals. G3: Genes, Genomes, Genetics, 4(12):2505–2518.
Yang, Z. and Rannala, B. (2010). Bayesian species delimitation using multilocus sequence data.
Proceedings of the National Academy of Sciences, 107(20):9264–9269.
Zhang, C., Zhang, D.-X., Zhu, T., and Yang, Z. (2011). Evaluation of a bayesian coalescent method of species delimitation. Systematic biology, 60(6):747–761.
Zhang, D.-X. and Hewitt, G. M. (2003). Nuclear dna analyses in genetic studies of populations:
practice, problems and prospects. Molecular ecology, 12(3):563–584.
Zhang, X., Chen, H., Zhang, R., Pei, J., Wang, Y., Zhao, Z., Huang, Y., and Wang, J. (2017). De-tecting complex indels with wide length-spectrum from the third generation sequencing data. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 1980–
1987. IEEE.
Zou, J. Y., Halperin, E., Burchard, E., and Sankararaman, S. (2015). Inferring parental genomic ancestries using pooled semi-markov processes. Bioinformatics, 31(12):i190–i196.