• No results found

Estimating Ancestry and Genetic Diversity in Admixed Populations.

N/A
N/A
Protected

Academic year: 2021

Share "Estimating Ancestry and Genetic Diversity in Admixed Populations."

Copied!
107
0
0

Loading.... (view fulltext now)

Full text

(1)

University of New Mexico

UNM Digital Repository

Anthropology ETDs Electronic Theses and Dissertations

5-1-2016

Estimating Ancestry and Genetic Diversity in

Admixed Populations.

Anthony Koehl

Follow this and additional works at:https://digitalrepository.unm.edu/anth_etds

Part of theAnthropology Commons

This Dissertation is brought to you for free and open access by the Electronic Theses and Dissertations at UNM Digital Repository. It has been accepted for inclusion in Anthropology ETDs by an authorized administrator of UNM Digital Repository. For more information, please contact

disc@unm.edu.

Recommended Citation

Koehl, Anthony. "Estimating Ancestry and Genetic Diversity in Admixed Populations.." (2016).https://digitalrepository.unm.edu/ anth_etds/39

(2)

Anthony Joseph Koehl

Candidate

Anthropology

Department

This dissertation is approved, and it is acceptable in quality and form for publication:

Approved by the Dissertation Committee:

Jeffrey Long PhD, Chair

Keith Hunley PhD, Member

Osbjorn Pearson PhD, Member

Lindsay Smith PhD, Member

(3)

Estimating Ancestry and Genetic Diversity

in Admixed Populations

BY

Anthony Joseph Koehl

B.S. Anthropology, Northern Kentucky University, 2003

M.S. Human Biology, University of Indianapolis, 2009

M.A. Anthropology, University of New Mexico, 2013

Dissertation

Submitted in Partial Fulfillment of the Requirements of the Degree of

Doctor of Philosophy Anthropology

The University of New Mexico Albuquerque, New Mexico

(4)

ACKNOWLEDGMENTS

I wish to wholeheartedly acknowledge Dr. Jeffrey Long, my doctoral advisor and dis-sertation chair, who worked tirelessly to advance me through this stage of my career. Dr. Long’s commitment in the classroom provided me with the knowledge to undertake this research. His door was always open, which allowed me to advance my research and my understanding of genetics and for that I am eternally grateful. He is the greatest teacher I have had and the best student I have ever seen. Dr. Long has been committed to me through this process as my mentor, and I hope that as I advance my career he will maintain that commitment as a colleague and as a friend.

In addition, I wish to thank my committee members, Dr. Keith Hunley, Dr. Ozzie Pearson, Dr. Mark Shriver, and Dr. Lindsay Smith, for their insight in improving this work and for advancing me as a professional in the field of anthropology. I hope to have the opportunity to collaborate with all of them in the future.

Finally, to my friends and family, without you I could not have succeeded in this en-deavor. I am forever thankful to you for your support, and camaraderie.

(5)

Estimating Ancestry and Genetic Diversity in Admixed Populations

by

Anthony Joseph Koehl

B.S. Anthropology, Northern Kentucky University, 2003 M.S. Human Biology, University of Indianapolis, 2009

M.A. Anthropology, University of New Mexico, 2013 Ph.D. Anthropology, University of New Mexico

ABSTRACT

Admixture is a form of gene flow that occurs when long separated populations come into contact and exchange mates. Admixture has been a primary mechanism in the for-mation of many modern human populations. The genetic characteristics of an admixed population are intermediate to, yet distinct from, those of its ancestors. In this disserta-tion, I investigate biological and statistical factors that enter into the analysis of admixed populations using genetic marker data. In chapters one and two, I use genotype data from published sources that contain 618 microsatellite loci. In chapter three, I simulate geno-types of 500 microsatellite loci.

In chapter two, I present an analysis of genetic diversity within and among 17 pop-ulations in the Americas that were formed by admixture among continental Indigenous Americans, Africans and Europeans. This is the first application of a new method to parti-tion the genetic distance between pairs of populaparti-tions into components related to ancestry and genetic drift. I show that the genetic relationships among the continental sources and genetic drift occurring after population formation strongly influence the genetic structure of these populations.

In chapter three, I investigate a new strategy to find modern populations to serve as models for ancestors in admixture events that occurred in the past. This is a long-standing

(6)

challenge to admixture studies. This chapter focuses on the Cape Coloured people of South Africa, a population that formed by mixture of indigenous Africans, Europeans, and Asians. I propose a series of models for their ancestry and use the Akaike Information Criterion to choose the best model. This method from information theory identifies a sim-ple model that proposes only African and Asian ancestors. I interpret this result in terms of both the principle of parsimony and the evolutionary recent common ancestor of the human species.

In chapter four, I use computer simulations to assess bias in ancestry fractions estimated by using maximum likelihood. These novel simulations were designed to produce data sets that mimic actual patterns of variation in human populations. I have found sampling strate-gies that produce reasonably unbiased results, despite the potential for maximum likelihood to produce biased estimates.

(7)

Contents

1 Introduction 1

2 The Contributions of Admixture and Genetic Drift to Diversity Among Post-Contact Populations in the Americas 6

2.1 Overview . . . 6

2.2 Introduction . . . 7

2.3 Population Genetic Model . . . 8

2.4 Materials and Methods . . . 11

2.5 Results . . . 17

2.6 Discussion . . . 29

3 Identifying the Number of Source Populations and Their Identities in Genetic Ancestry Analyses 33 3.1 Overview . . . 33

3.2 Introduction . . . 34

3.3 Founding of the Cape Coloured People . . . 36

3.4 Materials and Methods . . . 39

3.5 Results . . . 44

3.6 Discussion . . . 50 4 Using Contemporary Populations as Pseudo-Ancestors to Estimate Ancestry

(8)

Fractions 53

4.1 Overview . . . 53

4.2 Introduction . . . 54

4.3 Materials and Methods . . . 56

4.4 Results . . . 63

4.5 Discussion . . . 75

(9)

List of Figures

2.1 Schematic showing the independent contributions of admixture and genetic drift to genetic distance. . . 9 2.2 Inferred average continental ancestry of 49 populations. . . 20 2.3 Principal coordinates one and two of Nei’s minimum genetic distances

among the 49 populations in our analysis. . . 21 2.4 (Top) Positions of the 17 post-contact populations along the principal Eigen

vector of the ancestry component of the genetic distance matrix. (Bottom) The positions of the 17 post-contact populations along the ten principal Eigen vectors of the drift component of the genetic distance matrix. . . 27 2.5 Three sets of pairwise comparisons which display their overall genetic

dis-tance and the percent of that disdis-tance due to drift. . . 29 3.1 Timeline of the major historical events in the Cape Colony. . . 36 3.2 Twenty-six models, which serve as hypotheses in testing ancestry among

the Cape Coloured population of South Africa. . . 44 3.3 Scatter plots and their R2 values for three of the 26 potential models of

ancestry for the Cape Coloured of South Africa. . . 47 3.4 The distribution of ancestry fraction estimates among the source regions,

across all models. . . 49 4.1 A population tree that serves as a reference for our simulations. . . 57

(10)

4.2 Model one (left) simulation to estimate ancestry from the direct source de-scendants. Model two (right) estimates ancestry fractions from closely re-lated pseudo-ancestral sources . . . 59 4.3 Model three (left) estimates ancestry from the most distantly related

pseudo-ancestor in each region. Model four (right) estimates ancestry from multi-ple pseudo-ancestors in each region. . . 60 4.4 Model five (left) estimates ancestry in a simulated Latin American

pop-ulation from pseudo-ancestors who are descended from the true ancestral populations. Model six (right) estimates the ancestry of a simulated Latin American population from pseudo-ancestors that are closely related to the true ancestors. . . 61 4.5 Model seven (left) estimates the ancestry of a simulated Latin American

population from a distantly related pseudo-ancestor from each continental region. Model eight (right) estimates the ancestry of a simulated Latin American population from multiple pseudo-ancestors per continental region. 62 4.6 Results from model one, which estimates ancestry in a simulated

African-American population from the pseudo-ancestors that are the descendants of the the ancestral sources. . . 64 4.7 Results from model two, which estimates ancestry in a simulate

African-American population from ancestral proxies that are closely related con-temporary populations to the actual ancestral sources. . . 65 4.8 Results from model three, which estimates ancestry in a simulated

African-American population from distantly related pseudo-ancestors to the actual ancestral sources in their continental regions. . . 66 4.9 Results from model four, which estimates ancestry in a simulated

African-American population using multiple related pseudeo-ancestors from their continental regions. . . 68

(11)

4.10 Results from model five, which estimates ancestry in a simulated Latin American population from the contemporary descendants of the ancestral sources. . . 69 4.11 Results from model six, which estimates ancestry in a simulated Latin

American population from contemporary samples that serve a pseudo-ancestors that are closely related to the true ancestors. . . 70 4.12 Results from model seven, which estimates ancestry in a simulated Latin

American population from pseudo-ancestors that are distantly related to the actual ancestral sources. . . 72 4.13 Results from model eight, which estimates ancestry in a simulated Latin

American population from multiple pseudo-ancestors that from each con-tinental region. . . 73

(12)

List of Tables

2.1 Sampled contemporary populations that serve as ancestral proxies in our analyses, along with their associated sample sizes, global locations, and primary references. . . 13 2.2 Sampled admixed populations used in our analyses, along with their

asso-ciated sample sizes, global locations, and primary references. . . 14 2.3 The post-contact populations included in our analyses with their associated

sample sizes, inferred average continental ancestry, FST, and log likelihood

estimates. . . 19 2.4 Nei’s minimum genetic distance for all the admixed populations included

in our analyses. . . 23 2.5 Ancestry partition of Nei’s minimum genetic distance for the admixed

pop-ulations included in our analyses. . . 24 2.6 Drift partition of Nei’s minimum genetic distance for the admixed

popula-tions included in our analyses. . . 25 3.1 Populations used in our analyses, samples obtained from Pemberton et al.

(2013). . . 42 3.2 Model rankings for the putative ancestry for the Cape Coloured people. . . 45 3.3 The correlation values of the observed allele frequencies among the

(13)

Chapter 1

Introduction

In this dissertation, I use a common statistical approach for my admixture analyses. Tang et al. (2005) developed the statistical method using maximum likelihood to estimate ancestry in admixed populations from genotype data obtained from contemporary popula-tions. lnL(θ) = S X s=1 NS X i=1 L X l=1 Jl X j=1

[gsilj × ln(ysilj)] (1.1)

where ysilj = K X k=1 pjlkmik

is the predicted allele for the jthallele at the lthlocus in the ithindividual in the sthsample.

The genotype data gsilj are the counts of the jth allele (j = 1...Jl), observed at the lth

locus (l = 1...L) from the ithperson (i...Ns), belonging to the sthsample (s = 1...S). The

parameters (θ = [p, m]) are pjlkthe frequency of the jthallele, from the lthlocus, from the

kth source population (k = 1...K), and m

ik the fraction of ancestry from the kth source

(14)

Researchers previously collected the genotype data from many contemporary popula-tions found throughout the world (Cann et al., 2002; Rosenberg et al., 2002, 2006; Wang et al., 2007, 2008; Tishkoff et al., 2009). These genotypes consist of microsatellite loci. Pemberton et al. (2013) worked to centralize the data collected from other researchers into a single data set by calibrating the loci of more than 2,500 individuals from 248 world-wide populations. My research uses a subset of these samples and loci for all individuals included in my analyses.

There are several assumptions associated with maximum likelihood, which relate to biological and statistical factors in admixture analyses. First, each allele in a genotype of an individual represents an independent draw from one of the source populations. Second, contemporary individuals derive from populations that are in Hardy-Weinberg equilibrium, when conditioned on ancestry fractions. Third, marker loci are in linkage equilibrium, when conditioned on ancestry fractions. Fourth, ancestry is estimated from the true ances-tral source populations. Finally, gene flow in the form of admixture is the only evolutionary process operating in this system.

I use maximum likelihood to address and overcome challenges in admixture analy-ses. The challenges involve proper identification of ancestral source populations that con-tributed to the admixture event. There are three primary challenges confronting the proper identification of ancestral source populations. (1) Admixture events that formed many con-temporary populations began or occurred entirely in the past. (2) Ancestral source popula-tions may no longer exist, or they have evolved since the time of the admixture event. (3) A sparse historical record prevents us from fully knowing the source populations. These chal-lenges are ubiquitous in my research as well as in all other admixture analyses. I present ways to overcome these challenges, and address particular model assumptions in each of my dissertation chapters.

In chapter two, I analyze the genetic diversity within and among 13 Latin American and four African-American populations in the Americas. The admixture of continental

(15)

Indige-nous Americans, Africans, and Europeans formed the mixed populations. My analysis uses genotype data at 618 microsatellite loci from 949 individuals from 49 genetically sampled populations to estimate the ancestry and ancestral allele frequencies that contributed to the formation of these admixed groups. This analysis partitions Nei’s minimum genetic dis-tance into components of ancestry and genetic drift among the admixed populations (Nei, 1973, 1987). I partition genetic distance through a series of matrix calculations. First, I calculate Nei’s minimum genetic distance. Then I calculate the expected minimum ge-netic distances from the expected allele frequencies of the sampled admixed populations and contemporary populatoin samples that serve as pseudo-ancestors, which are obtained using the likelihood method of Tang and colleagues (2005). The expected minimum dis-tance values are the partitioned ancestry disdis-tances of the admixed populations. I obtain the genetic drift partition from matrix subtraction, which is simply the difference of the ances-try partition from Nei’s minimum genetic distance. Recall an assumption of the likelihood model; admixture is the only evolutionary process operating in the model. I show, from this research, that genetic drift plays a prominent role in shaping the genetic diversity of the ad-mixed populations, which provides a fuller perspective of the effect of the evolutionary process in these populations.

In chapter three, I investigate how to choose contemporary populations to serve as an-cestral sources in admixture analyses of ancestry. This work addresses the challenge of using what Tang and colleagues (2005) call pseudo-ancestors. Pseudo-ancestors are pop-ulations that are closely related to the ancestral sources that contributed to the formation of an admixed population, but who did not aid directly in the formation of the admixed population (Tang et al., 2005). The use of pseudo-ancestors is necessary because of the challenges in admixture analyses. First, source populations may no longer exist or have evolved since the time of the admixture event. Second, a sparse historical record prevents us from fully knowing who the true ancestors of an admixed population were. I construct 26 models of proposed ancestry for the Cape Coloured population of South Africa who

(16)

serve as a focal admixed population to test this method. The method I use is the Akaike Information Criterion (AIC) to choose the best model from the 26 that estimate ancestry for the Cape Coloured population (Akaike, 1973, 1974). These models contain between two and five ancestral source populations, which include the Khoesan, Bantu speakers, European, South Asian, and East Asian. Each ancestral source population is comprised of genotype data containing 618 microsatellite loci of individuals from two contemporary population samples.

In chapter four, I investigate the concept of pseudo-ancestors further. I use coalescent simulations to examine ancestry proportion estimates of an admixed population (Excoffier and Foll, 2011). In knowing the relationships of the pseudo-ancestors to the true ancestors, I will determine if genotype data from pseudo ancestral sources in lieu of the true ancestors biases estimates of ancestry. This research addresses several challenges inherent in admix-ture analyses. Primarily, these challenges are admixadmix-ture events that formed many contem-porary populations began or occurred entirely in the past; and ancestral source populations may no longer exist, or they have evolved since the time of the admixture event. I begin by constructing a simulated consensus tree of pseudo-ancestors to serve as ancestral sources used to estimate ancestry proportions in an admixed population. The pseudo-ancestors in the tree mimic observed levels of genetic diversity from contemporary samples of actual African, European, and Indigenous American populations.

I then simulate the formation of an admixed population from a single admixture event between two of the pseudo-ancestral populations. I construct a series of eight models whereby ancestry proportions are estimated from varying pseudo-ancestors in the tree. The first four models estimate ancestry from the continental sources of Africa and Europe in the formation of an African-American population. The last four models estimate ancestry from European and American continental sources in the formation of a Latin American popu-lation. I estimate ancestry proportions from simulated genotype data, which contains 500 loci from the individuals of each pseudo-ancestral population, as well as the focal admixed

(17)

population. For each model, I vary the sample sizes for the ancestral source populations, as well as for the admixed population. The first sampling scenario samples 100 individuals among all populations, for each of the two ancestral sources and for the admixed popula-tion. The second sampling scenario samples 100 individuals from the admixed population, and 20 individuals from each of the ancestral source populations. The third sampling sce-nario estimates ancestry from a sample of 20 individuals from the admixed population, and 100 individuals from each of the ancestral source populations.

(18)

Chapter 2

The Contributions of Admixture and

Genetic Drift to Diversity Among

Post-Contact Populations in the

Americas

2.1

Overview

Objective: We present a partition of Nei’s minimum genetic distance in admixed popu-lations into components of admixture and genetic drift. We applied this technique to 17 admixed populations in the Americas to examine how admixture and drift have contributed to the patterns of genetic diversity.

Materials and Methods: We analyzed 618 short tandem repeat loci in 949 individuals from 49 population samples. Thirty-two samples serve as proxies for continental ances-tors. Seventeen samples represent admixed populations: (4) African-American and (13)

(19)

Latin American. We estimate ancestry fractions and allele frequencies for all populations. We partition genetic distance and calculate fixation indices and principal coordinates to interpret our results.

Results: The partition of genetic distance shows that both admixture and genetic drift con-tribute to patterns of genetic diversity. The admixture component of genetic distance pro-vides evidence for two distinct axes of continental ancestry. However, the genetic distances show that ancestry contributes to only one axis of genetic differentiation. The drift com-ponent of genetic distance indicates that modest founder effects accompanied admixture in the formation of these populations.

Discussion: Our results show that the genetic structure of admixed populations in the Americas reflects more than admixture. We show that the evolution of the source popu-lations influenced the genetic structure of the admixed popupopu-lations. Notably, the history of serial founder effects constrains the impact of admixture on allele frequencies to a single dimension. Founder effects in the admixed populations imposed a new level of genetic structure onto that created by admixture.

2.2

Introduction

European colonization of the Americas beginning in the late 15th century had a major im-pact on the human species by bringing people living in Europe and Africa to the Americas. Populations on these continents had been isolated from each other for thousands of years before this. The result of re-contact was the formation of new genetically mixed popula-tions that trace their recent ancestry to two or more continental regions. Many mixed popu-lations formed and each one constituted a unique gene pool. The mixed popupopu-lations resided in geographic locations dispersed throughout the Americas, and to varying degrees, the populations were isolated from each other. Each newly formed population had its genetic diversity structured by factors such as the composition of African, European, and

(20)

Indige-nous American ancestors, and their specific degree of isolation. Many population genetic studies have compared the fractions of continental ancestry across admixed populations in the Americas (Shriver et al., 2003; Bonilla et al., 2004, 2005; Galanter et al., 2012). Re-cent efforts have linked admixed populations to subpopulations of the continental ancestral groups (Wang et al., 2008; Moreno-Estrada et al., 2014). However, ancestry fractions, no matter how fine-grained, do not fully account for the patterns of genetic diversity in the ad-mixed populations. A full account requires consideration of other evolutionary processes, notably genetic drift. The Indigenous American population experienced dramatic decline and rebound in the relatively short timeframe since European contact (Livi-Bacci, 2006; O’Fallon and Fehren-Schmitz, 2011). Many of the admixed populations formed during the time of Indigenous American decline, and necessarily grew from small to large size in their early generations. In essence, the foundation of these populations would have encom-passed both mixing of continental ancestry and founder effects. No study heretofore has connected the genetic diversity within and among the mixed populations to the combination of continental ancestry and genetic drift. To this end, we developed a model and methods of analysis to partition genetic distance into ancestry and drift components. We analyzed a diverse set of admixed populations in North and South America. We found that demo-graphic forces that go beyond admixture, which are related to genetic drift, have played a prominent role in shaping patterns of diversity within and among admixed populations in the Americas.

2.3

Population Genetic Model

Figure 2.1 presents the basics of our model and its main parameters. For the purpose of explanation, we consider a pair of recently formed populations that we will label A and B. Both A and B have ancestry from two continental sources that we will label 1 and 2. We assume that the continental source populations have been isolated from each other for

(21)

enough time to allow their allele frequencies to diverge by genetic drift.

Nei’s minimum genetic distance is a principal quantity in the formulation of our model (Nei, 1973, 1987). For a single genetic locus, this genetic distance is a function of allele frequencies computed according to the formula,

Dhi2 = 1 2

X

j

(phj − pij)2 (2.1)

where, phj and pij represent the frequency of the jthallele in the hthand ithpopulations,

re-spectively. The summation is taken over all alleles at the locus. Minimum genetic distances are typically reported as averages over many genetic loci.

Figure 2.1: Schematic showing the independent contributions of admixture and genetic drift to genetic dis-tance.

Our goal for the populations A and B is to partition their minimum genetic distance into two components, one representing ancestry and the other representing genetic drift. To accomplish this, we construct allele frequencies in A and B using allele frequencies in the source populations, ancestry fractions, and a contribution by genetic drift (Long, 1991). For the purpose of exposition, we will assume that 1 and 2 are the only ancestral populations

(22)

of A and B. This restriction can be relaxed and our methods generalize to any number of source populations that contributed ancestors to any number of mixed populations. We construct allele frequencies in the admixed populations according to the formulas

pAj = p2j + mA1(p1j − p2j) + Aj

pBj = p2j + mB1(p1j − p2j) + Bj (2.2)

where, mA1 and mB1 are the proportions of ancestry in populations A and B that were

contributed by parental population 1. Since all ancestry in A and B must trace back to 1 or 2, we construct mA2 = 1 − mA1 and mB2 = 1 − mB1. The frequencies of the jth allele

in the source populations are represented by p1j and p2j, respectively. The final terms, Aj

and Bj represent the deviations of the allele frequency from that which a pure admixture

process would produce. We assume that these terms represent genetic drift occurring in the mixed populations during, or after, the admixture process.

To obtain the genetic distance between A and B in terms of our admixture and drift model, we substitute the allele frequency formulas from Eqs. 2.2 into Eq. 2.1

D2AB =X

j

(p2j + mA1(p1j − p2j) + Aj− p2j − mB1(p1j − p2j) − Bj)2 (2.3)

After collecting terms and simplifying,

D2AB = (mA1− mB1)2D212+

X

j

(Aj− Bj)2

(23)

The component ∆AB represents the portion of genetic distance related to admixture, while

the component EAB represents the portion of genetic distance related to drift, following

admixture. It is clear from the admixture component of genetic distance that the impact of admixture depends on the level of differentiation of the source populations.

This model for genetic distance requires two assumptions. First, genetic drift and ad-mixture are the only processes that have influenced allele frequencies in the admixed pop-ulations. Second, the effects of genetic drift and admixture have operated independently.

Neis minimum genetic distance is one of the simplest measures of genetic distance (Nei, 1987). We have chosen it as our primary metric because it is easy to partition into additive components related to the distinct processes of admixture and genetic drift. Moreover, this distance makes it easy to relate population differentiation to genetic phenomena such as homozygosity and heterozygosity. Some other measures of genetic distance (Shriver et al., 1995; Goldstein et al., 1995; Nei, 1973) utilize the mutation rate to measure divergence times in phylogenetic models. At best, these genetic distance measures provide indirect information about admixture. They are unsuited to the populations in this analysis because admixture produces genetic outcomes that are distinct from the outcomes of population fissions and phylogenetic radiation. We feel mutation is unlikely to influence the results for recently founded populations. In this light, we favor a method that is simple to interpret and likely to produce accurate results.

2.4

Materials and Methods

The focus of our analyses is a set of 17 populations of mixed ancestry in the Americas. This set includes 13 populations labeled in original sources as Mestizo (Wang et al., 2008) and four populations labeled in original sources as African-American (Tishkoff et al., 2009). Various investigators collected these samples in North and South America. To guide our analyses of mixed populations we include four populations labeled in original sources as

(24)

European, two populations labeled in original sources as Sub-Saharan African, and 26 pop-ulations labeled in original sources as Indigenous American (Cann et al., 2002; Rosenberg et al., 2002; Wang et al., 2007). The original investigators collected the African, European, and Indigenous American samples on their respective continents of origin. The primary sources for our data are (Cann et al., 2002; Rosenberg et al., 2002; Wang et al., 2007, 2008; Tishkoff et al., 2009). Tables 2.1 and 2.2 give the population names, geographic coordi-nates, sample sizes, and primary references for all 49 populations.

We analyze genotypes at 618 autosomal short tandem repeat (STR) loci. The genotyp-ing service at the Marshfield Clinic performed the laboratory analyses for all of the original studies. The Marshfield Clinic selected these loci for linkage mapping in other studies. The loci are spaced on the genetic map approximately 5 cM to 10 cM apart. We use data from the set that Pemberton et al. (2013) created by calibrating allele sizes and combining across the original studies.

To test the statistical significance of genetic distance estimates between samples, we constructed confidence intervals using the jackknife method (Efron and Tibshirani, 1993). We rejected the null hypothesis of zero genetic distance if the confidence interval for an estimate did not span zero. One-sided confidence intervals are appropriate for these tests because genetic distance cannot be negative.

(25)

Table 2.1: Sampled contemporary populations that serve as ancestral proxies in our analyses, along with their associated sample sizes, global locations, and primary references.

Population Name Sample Size GPS Coordinates Primary Reference

Orcadian 16 59°N −3°E Cann et al. (2002); Rosenberg et al. (2002) French 29 46°N 2°E Cann et al. (2002); Rosenberg et al. (2002) Italian 13 46°N 10°E Cann et al. (2002); Rosenberg et al. (2002) Russian 25 61°N 40°E Cann et al. (2002); Rosenberg et al. (2002) Mandenka 24 12°N −12°E Cann et al. (2002); Rosenberg et al. (2002) Yoruba 25 8°N 4°E Cann et al. (2002); Rosenberg et al. (2002) Yoruba 25 7.9°N 5°E Tishkoff et al. (2009)

Pima 25 29°N −108°E Cann et al. (2002); Rosenberg et al. (2002) Mixtec 19 17°N −97°E Wang et al. (2007)

Zapotec 17 16°N −97°E Wang et al. (2007) Mixe 20 17°N −96°E Wang et al. (2007)

Maya 25 19°N −91°E Cann et al. (2002); Rosenberg et al. (2002) Kaqchikel 12 15°N −91°E Wang et al. (2007)

Cabecar 20 9.5°N −84°E Wang et al. (2007) Guaymi 16 8.5°N −82°E Wang et al. (2007) Kogi 16 11oN −74°E Wang et al. (2007) Arhuaco 16 11°N −73.8°E Wang et al. (2007) Waunana 20 5°N −77°E Wang et al. (2007) Embera 11 7°N −76°E Wang et al. (2007) Zenu 18 9°N−75°E Wang et al. (2007) Inga 16 1°N −77°E Wang et al. (2007) Quechua 20 −14°−74°E Wang et al. (2007) Aymara 18 −22°N −70°E Wang et al. (2007) Huilliche 19 −41°N −73°E Wang et al. (2007) Kaingang 5 −24°N −52.5°E Wang et al. (2007) Guarani 10 −23°N −54°E Wang et al. (2007) Wayuu 17 11°N −73°E Wang et al. (2007)

Piapoco-Curripaco 13 3°N −68°E Cann et al. (2002); Rosenberg et al. (2002) Ticuna Tarapaca 18 −4°N −70°E Wang et al. (2007)

Ticuna Arara 15 −4°N −70°E Wang et al. (2007)

Karitiana 24 −10°N −63°E Cann et al. (2002); Rosenberg et al. (2002) Surui 21 −11°N −62°E Cann et al. (2002); Rosenberg et al. (2002) Ache 17 −24°N −56°E Wang et al. (2007)

(26)

Table 2.2: Sampled admixed populations used in our analyses, along with their associated sample sizes, global locations, and primary references.

Population Name Sample Size GPS Coordinates Primary Reference Oriente 19 14.63°N −89.7°E Wang et al. (2008) Mexico City 19 19.4°N −99.2°E Wang et al. (2008) CVCR 20 11.5°N −84.1°E Wang et al. (2008) Quetalmahue 20 −42.4°N −73.5°E Wang et al. (2008) Paposo 20 24°N −70°E Wang et al. (2008) Catamarca 12 −29.3°N −65.8°E Wang et al. (2008) Salta 19 −24.8°N −65.4°E Wang et al. (2008) Tucuman 19 −27°N −65.2°E Wang et al. (2008) RGS 20 −31°N −54°E Wang et al. (2008) Pasto 19 1°N −78.5°E Wang et al. (2008) Peque 20 7.6°N −73°E Wang et al. (2008) Medellin 20 5.4°N −74.4°E Wang et al. (2008) Cundinamarca 19 3.2°N −74.1°E Wang et al. (2008) Chicago 15 42°N −87.9°E Tishkoff et al. (2009) Pittsburgh 21 40.5°N −80.2°E Tishkoff et al. (2009) Baltimore 44 39.2°N −76.7°E Tishkoff et al. (2009) North Carolina 18 35.9°N −78.8°E Tishkoff et al. (2009)

Fitting our population genetic model requires us to estimate each component of allele frequency given by Eq. 2.2. The following estimation steps underlie our analysis. (1) We identify source populations. (2) We estimate allele frequencies for the sources. (3) We estimate for the mixed populations the fraction of their ancestry attributable to each source population. (4) We estimate expected allele frequencies for each mixed ancestry popula-tion. The expected allele frequencies for a mixed population are the averages of source population allele frequencies weighted by the fractions of ancestry in mixed populations that are attributable to the sources. (5) We estimate the drift deviations for each allele fre-quency, in each mixed population, as the difference between the observed and expected allele frequencies.

We use the maximum likelihood approach of Tang and colleagues to make the estimates described in the previous paragraph (Tang et al., 2005). This method assumes a population model in which the ancestry in a mixed group traces back to a pre-specified number K of

(27)

ancestral sources. The method assumes that each allele in each genotype of an individual with mixed ancestry represents an independent draw from one of the source populations. This is equivalent to assuming that genotypes in mixed populations are in Hardy-Weinberg equilibrium when conditioned on the ancestry fractions. The method requires us to assume that the STR marker loci are also in linkage equilibrium when conditioned on ancestry fractions.

We have written new software for the method to accommodate STR data. We wrote this software using the Bloodshed Development Environment (http//www.bloodshed.net) in the C++ language. Prior implementations of Tang’s method are restricted to single nucleotide polymorphism data (Alexander et al., 2009; Tang et al., 2005). The likelihood function is of extremely high dimension when applied to genomic scale data. Maximizing this function requires estimating thousands of parameters, consisting of allele frequencies and ancestry fractions. Our program uses the EM algorithm described by Tang and colleagues (2005) as a numerical method to obtain asymptotic results from the likelihood equation. Alexander and colleagues (2009) note that a stringent convergence criterion is necessary to obtain precise results.

Determining the number of source populations is a special case of determining the num-ber of clusters in a mixture. This is a long-standing problem in statistics and population genetics. Following Tang et al. (2005), we intend that our source populations represent populations that were isolated on different continents in pre-Columbian times, but we in-vestigate the possibility that alternative models with more source populations per continent may fit the data better than a model with one source per continent. To distinguish models, we apply the standard approach of tracking the increase in model likelihood that occurs with increasing the number of source populations, i.e., increasing K. However, we take some additional steps too. We perform multiple runs and of the program and check for consistency in the maximized likelihood across runs. Then, we check the individual mixed populations to be certain that they make the same overall contribution to the overall

(28)

like-lihood across runs. Finally, we check the individual mixed populations to make sure the contributions of the source populations remain constant across replicate runs of the same model. In light of the complexity in identifying actual source populations and estimat-ing their allele frequencies, we follow the precedence of recognizestimat-ing these putative source populations as pseudo-ancestors (Tang et al., 2005).

We take the following steps to partition unbiased estimates of genetic distance into ad-mixture and drift components. These equations allow any number of K ancestral source populations. First, we calculate Nei’s unbiased estimate of minimum genetic distance be-tween admixed populations A and B. Second, we create expected allele frequencies for each admixed sample according to

ˆ pAj = K X s=1 ˆ mAspˆsj (2.5)

where, ˆpAj is the expected frequency of the jth allele in the Ath admixed population, and

ˆ

mAs is the estimated contribution of the sthancestral source population to the Athadmixed

population. Third, we compute the admixture portion of the estimated genetic distance between admixed populations A and B as

ˆ ∆AB =

X

j

(ˆpAj− ˆpBj)2 (2.6)

Fourth, we compute the drift portion of the estimated genetic distance as

ˆ

EAB = ˆDAB − ˆ∆AB (2.7)

where, ˆDAB is the estimate of Nei’s minimum genetic distance.

We use original scripts written for the R statistical computing environment to manip-ulate allele frequency output from our likelihood program, to compute genetic distance matrices and their partitions, and to produce graphs (R Core Team, 2014).

(29)

we compute the fixation index FST to help assess the extent of genetic drift in admixed

populations (Wright, 1951; Nei, 1987; Long, 1991). We use the general formula,

ˆ FST = ˆ HT − ˆHO ˆ HT (2.8) for estimation, where ˆHT is the estimated heterozygosity in a base population and ˆHO is

the estimated heterozygosity in an observed sample. For the pseudo-ancestors, we compute ˆ

HT from the allele frequencies estimated for the specific continental source population, and

for the admixed populations we compute ˆHT from the allele frequencies expected from the

admixture process. Second, we use principal coordinates to represent distance matrices in lower dimension (Gower, 1966). We use the multidimensional scaling function in R to compute principal coordinates.

2.5

Results

We estimated genetic ancestry and allele frequencies twice. First, we assumed that K=3 ancestral source populations contributed to the 49 contemporary samples, and second we expanded to K=4 ancestral sources. We constructed these analyses in a partially supervised fashion. We constrained individuals from the four European samples to have 100% ancestry from one source population, and individuals from the two African samples to have 100% ancestry from a second source population. This construction obligated source one to rep-resent European ancestors, and source two to reprep-resent African ancestors, and by default, sources three and four represented Indigenous American ancestors. We estimated ancestry in the populations labeled Indigenous American because prior research shows mixed con-tinental ancestry in some of these samples (Hunley and Healy, 2011). All models necessi-tated estimating 6,333 independent allele frequencies per ancestral source population, and ancestry fractions for 792 individuals. In total, K=3 required estimating 20,583 parameters, and K=4 required estimating 26,916 parameters. To fit models, we used random starting

(30)

values for all parameters, and iterated the EM procedure until the likelihood changed by less than 10−6 between successive steps.

With K=3, we were able to replicate the highest likelihood in several runs of the ances-try estimation program using different starting values. Importantly, our ancesances-try estimates for individuals and populations were consistent across runs, generally not differing by more than 0.001. By contrast, our results for models with K=4 were less successful. Although, running the program with K=4 always yielded higher likelihoods than running it with K=3, we could not replicate the best likelihood on independent runs. Moreover, we found with K=4 that parameter estimates could be quite different from runs of the program that pro-duced similar likelihoods. In light of our limited success with K=4, we preformed all subsequent analyses of ancestry and drift contributions to genetic distance using maximum likelihood estimates with K=3.

Table 2.3 gives sample size and estimates of continental ancestry for the 17 post-contact populations. The African-American populations have ancestry proportions similar to each other (approximately, 80% African and 20% European). By contrast, the Latin American populations vary widely in their ancestry; average African ancestry varies from 0% to 9%, average European ancestry varies from 33% to 73%, and average Indigenous American ancestry varies from 18% to 64%. The wide variation in ancestry within Latin American populations, and between Latin American and African-American populations, makes our questions about the contribution of variation in ancestry to genetic distance particularly salient.

(31)

Table 2.3: The post-contact populations included in our analyses with their associated sample sizes, inferred average continental ancestry, FST, and log likelihood estimates.

Sample n African European Indigenous American FST lnL(i)

Chicago 15 0.788 0.2 0.012 0.0006 -29,968 Pittsburgh 21 0.79 0.196 0.014 0 -43,142 Baltimore 44 0.828 0.158 0.014 0 -89,668 North Carolina 18 0.775 0.204 0.021 0 -36,181 Mexico City 19 0.035 0.621 0.344 0.0001 -35,594 Oriente 19 0.069 0.456 0.474 0.0002 -35,687 CVCR 20 0.044 0.711 0.245 0.0007 -38,385 Peque 20 0.051 0.437 0.512 0.0199 -37,423 Medellin 20 0.093 0.697 0.211 0.0042 -38,857 Cundinamarca 19 0.02 0.529 0.451 0.0051 -35,443 Pasto 19 0.035 0.457 0.508 0.0053 -34,909 Salta 19 0.024 0.332 0.644 0.0054 -33,878 Paposo 20 0.018 0.499 0.483 0.0176 -35,948 Tucuman 19 0.044 0.698 0.258 0 -35,164 Catamarca 12 0.027 0.594 0.379 0.0082 -22,480 RGS 20 0.094 0.731 0.175 0 -38,371 Quetalmahue 20 0.004 0.564 0.432 0.0293 -36,803

Table 2.3 also gives estimates of FST, which measures the drift of allele frequencies

in each post-contact population from the expectations set by admixture of intercontinental sources. The four African-American populations independently show minimal influence from genetic drift based on FST. The Latin American populations show varying impact of

genetic drift. FST is less than 0.001 for five populations, and greater than 0.01 for three

populations. The remaining five Latin American populations show intermediate impact of drift, 0.001 ≤ FST ≤ 0.01. While FST in this intermediate range seems low, it is typical of

populations on the European continent.

We calculated the matrix of Nei’s minimum genetic distances among pairs of the 49 populations analyzed (17 post-contact populations and 32 indigenous continental popula-tions). The 17 post-contact populations (4 African-American and 13 Latin American) yield 136 pairs. The genetic distance was statistically significant with p-values below 0.05 for 135 of these pairs. The highest p-value was 0.06 between the African-American

(32)

popula-tions in Pittsburgh and Baltimore. The p-value was below 0.0001 for 116 of the pairwise comparisons. A p-value of 0.0004 is required for a conservative Bonferroni correction for multiple comparisons (Sokal and Rohlf, 2012). To visualize patterns, we extracted the Eigen vectors produced by multidimensional scaling to summarize patterns of genetic distance among these 49 populations.

Figures 2.2 and 2.3 show the outcomes of our ancestry and genetic distance analyses. Figure 2.2 displays ancestry estimates for the 49 populations in a triangle plot. As ex-pected, the continental populations are concentrated on the vertices. The 17 post-contact populations occupy intermediate locations. The four African-American samples cluster tightly on the axis between continental African and continental Europeans. The 13 Latin American populations disperse along the axis between European and Indigenous American populations.

Figure 2.2: Proportions of continental ancestry fill a two-dimensional space defined by the constraint that ancestry fractions sum to 1.0. Ancestry estimates are presented for 49 populations. African (blue) and European (red) samples were constrained to 100% ancestry from their respective continental sources. The ancestry of contemporary Indigenous Americans (gold) was estimated from a three-way admixture model to account for recently introduced European and African ancestry. Samples from African-American populations are shaded dark green. Samples from Latin American populations shaded light green shading.

(33)

Figure 2.3 plots the first two principal coordinates of the matrix of genetic distances among the 49 populations. Population positions along the first axis, which accounts for 50% of the dispersion, correlate with continental ancestry. African populations occupy one extreme and Indigenous American populations occupy the other extreme. European populations lie intermediate to the other two continental groups. African-Americans lie between the African and European populations. Latin Americans lie between European and Indigenous American populations. Unexpectedly, the second principal coordinate separates a pair of Indigenous American population. The next several axes primarily differentiate Indigenous Americans.

Figure 2.3: Only the first principal coordinate of the genetic distance matrix shows the ancestry pattern of continental populations and their post-contact descendants formed by admixture. The second, and subsequent coordinates, primarily reveal the extreme divergence of Indigenous Americans from a continental gene pool. The color conventions are those established in Figure 2. The dots containing crosses represent the putative ancestral populations for the continental sources.

(34)

We present Nei’s minimum genetic distances for the 17 post-contact population in the Americas (Table 2.4). We partitioned the matrix of Nei’s minimum genetic distances among the 17 post-contact populations into a matrix of admixture distances and a ma-trix of drift distances (Tables 2.5, and 2.6). Then we used the Eigen vectors produced by multidimensional scaling to summarize patterns within the distance matrices (fig. 2.4).

(35)

Table 2.4: Nei’s minimum genetic distance for all the admixed populations included in our analyses. MC OR CR PQ MD CN PS SL PP TC CT RGS QT BL CH NC PT MC 0 0.0041 0.0048 0.0235 0.0066 0.0042 0.0082 0.0156 0.0149 0.0049 0.006 0.0047 0.02 0.0523 0.0509 0.0465 0.0466 OR 0.0041 0 0.0096 0.0175 0.0125 0.0028 0.0056 0.0106 0.0144 0.0085 0.0058 0.013 0.0225 0.0573 0.0532 0.0503 0.0528 CR 0.0048 0.0096 0 0.0223 0.0043 0.007 0.0152 0.0248 0.02 0.0044 0.0112 0.0039 0.0247 0.0463 0.0464 0.0416 0.0416 PQ 0.0235 0.0175 0.0223 0 0.0237 0.0168 0.0212 0.0256 0.0303 0.0241 0.0214 0.0307 0.0353 0.0764 0.0753 0.0691 0.0711 MD 0.0066 0.0125 0.0043 0.0237 0 0.0094 0.0155 0.0281 0.0226 0.0052 0.0093 0.0026 0.0296 0.0424 0.0437 0.0394 0.0402 CN 0.0042 0.0028 0.007 0.0168 0.0094 0 0.0046 0.0116 0.0153 0.0078 0.0067 0.0131 0.021 0.0613 0.059 0.0567 0.0577 PS 0.0082 0.0056 0.0152 0.0212 0.0155 0.0046 0 0.0095 0.0151 0.0124 0.009 0.0183 0.0196 0.0647 0.0623 0.0592 0.0587 SL 0.0156 0.0106 0.0248 0.0256 0.0281 0.0116 0.0095 0 0.0181 0.0209 0.0135 0.0305 0.0256 0.0801 0.0783 0.0757 0.0747 PP 0.0149 0.0144 0.02 0.0303 0.0226 0.0153 0.0151 0.0181 0 0.0195 0.0166 0.0252 0.0261 0.0738 0.0721 0.0699 0.0698 TC 0.0049 0.0085 0.0044 0.0241 0.0052 0.0078 0.0124 0.0209 0.0195 0 0.0059 0.005 0.0246 0.0458 0.0439 0.0413 0.041 CT 0.006 0.0058 0.0112 0.0214 0.0093 0.0067 0.009 0.0135 0.0166 0.0059 0 0.0119 0.0196 0.0564 0.0548 0.0535 0.0521 RGS 0.0047 0.013 0.0039 0.0307 0.0026 0.0131 0.0183 0.0305 0.0252 0.005 0.0119 0 0.0287 0.0387 0.0371 0.0334 0.0354 QT 0.02 0.0225 0.0247 0.0353 0.0296 0.021 0.0196 0.0256 0.0261 0.0246 0.0196 0.0287 0 0.0787 0.0796 0.0761 0.0748 BL 0.0523 0.0573 0.0463 0.0764 0.0424 0.0613 0.0647 0.0801 0.0738 0.0458 0.0564 0.0387 0.0787 0 0.0048 0.003 0.0016 CH 0.0509 0.0532 0.0464 0.0753 0.0437 0.059 0.0623 0.0783 0.0721 0.0439 0.0548 0.0371 0.0796 0.0048 0 0.0043 0.0048 NC 0.0465 0.0503 0.0416 0.0691 0.0394 0.0567 0.0592 0.0757 0.0699 0.0413 0.0535 0.0334 0.0761 0.003 0.0043 0 0.004 PTa 0.0466 0.0528 0.0416 0.0711 0.0402 0.0577 0.0587 0.0747 0.0698 0.041 0.0521 0.0354 0.0748 0.0016 0.0048 0.004 0

aMC=Mexico City, Mexico; OR=Oriente, Guatemala; CR=Central Valley, Costa Rica; PQ=Peque, Colombia; MD=Medellin, Colombia; CN=Cundinamarca,

Colombia; PS=Pasto, Colombia; SL=Salta, Argentina; PP=Paposo, Chile; TC=Tucuman, Argentina; CT=Catamarca, Argentina; RGS=Rio Grande do Sul, Brazil; QT=Quetalmahue, Chile; BL=Baltimore, United States; CH=Chicago, United States, NC=North Carolina, United States; PT=Pittsburgh, United States

(36)

Table 2.5: Ancestry partition of Nei’s minimum genetic distance for the admixed populations included in our analyses. MC OR CR PQ MD CN PS SL PP TC CT RGS QT BL CH NC PT MC 0 0.0021 0.0011 0.0033 0.002 0.0013 0.003 0.01 0.0021 0.0008 0.0001 0.0031 0.0009 0.0516 0.0474 0.0456 0.0475 OR 0.0021 0 0.0061 0.0002 0.0076 0.0003 0.0002 0.0031 0.0002 0.0055 0.0013 0.0098 0.0006 0.0561 0.0523 0.0504 0.0524 CR 0.0011 0.0061 0 0.008 0.0003 0.0046 0.0076 0.0175 0.0061 0 0.0019 0.0006 0.0038 0.0471 0.0427 0.0411 0.0429 PQ 0.0033 0.0002 0.008 0 0.0098 0.0005 0 0.0019 0.0002 0.0072 0.0021 0.0122 0.001 0.061 0.0572 0.0551 0.0572 MD 0.002 0.0076 0.0003 0.0098 0 0.0063 0.0095 0.0203 0.008 0.0004 0.0031 0.0001 0.0054 0.041 0.0368 0.0354 0.037 CN 0.0013 0.0003 0.0046 0.0005 0.0063 0 0.0004 0.0042 0.0001 0.0041 0.0006 0.0082 0.0001 0.0596 0.0555 0.0535 0.0556 PS 0.003 0.0002 0.0076 0 0.0095 0.0004 0 0.002 0.0001 0.0069 0.0019 0.0119 0.0008 0.0624 0.0585 0.0564 0.0585 SL 0.01 0.0031 0.0175 0.0019 0.0203 0.0042 0.002 0 0.0029 0.0164 0.0078 0.0238 0.0052 0.077 0.0733 0.0709 0.0732 PP 0.0021 0.0002 0.0061 0.0002 0.008 0.0001 0.0001 0.0029 0 0.0055 0.0012 0.0102 0.0003 0.0622 0.0581 0.0561 0.0582 TC 0.0008 0.0055 0 0.0072 0.0004 0.0041 0.0069 0.0164 0.0055 0 0.0016 0.0008 0.0033 0.0475 0.0431 0.0415 0.0433 CT 0.0001 0.0013 0.0019 0.0021 0.0031 0.0006 0.0019 0.0078 0.0012 0.0016 0 0.0045 0.0003 0.0543 0.0501 0.0482 0.0502 RGS 0.0031 0.0098 0.0006 0.0122 0.0001 0.0082 0.0119 0.0238 0.0102 0.0008 0.0045 0 0.0072 0.0403 0.0361 0.0347 0.0363 QT 0.0009 0.0006 0.0038 0.001 0.0054 0.0001 0.0008 0.0052 0.0003 0.0033 0.0003 0.0072 0 0.0601 0.0558 0.0538 0.0559 BL 0.0516 0.0561 0.0471 0.061 0.041 0.0596 0.0624 0.077 0.0622 0.0475 0.0543 0.0403 0.0601 0 0.0001 0.0002 0.0001 CH 0.0474 0.0523 0.0427 0.0572 0.0368 0.0555 0.0585 0.0733 0.0581 0.0431 0.0501 0.0361 0.0558 0.0001 0 0 0 NC 0.0456 0.0504 0.0411 0.0551 0.0354 0.0535 0.0564 0.0709 0.0561 0.0415 0.0482 0.0347 0.0538 0.0002 0 0 0 PTa 0.0475 0.0524 0.0429 0.0572 0.037 0.0556 0.0585 0.0732 0.0582 0.0433 0.0502 0.0363 0.0559 0.0001 0 0 0

aMC=Mexico City, Mexico; OR=Oriente, Guatemala; CR=Central Valley, Costa Rica; PQ=Peque, Colombia; MD=Medellin, Colombia; CN=Cundinamarca,

Colombia; PS=Pasto, Colombia; SL=Salta, Argentina; PP=Paposo, Chile; TC=Tucuman, Argentina; CT=Catamarca, Argentina; RGS=Rio Grande do Sul, Brazil; QT=Quetalmahue, Chile; BL=Baltimore, United States; CH=Chicago, United States, NC=North Carolina, United States; PT=Pittsburgh, United States

(37)

Table 2.6: Drift partition of Nei’s minimum genetic distance for the admixed populations included in our analyses. MC OR CR PQ MD CN PS SL PP TC CT RGS QT BL CH NC PT MC 0 0.0019 0.0037 0.0202 0.0046 0.0029 0.0052 0.0057 0.0128 0.0041 0.0058 0.0016 0.0191 0.0007 0.0035 0.0009 0 OR 0.0019 0 0.0035 0.0174 0.0049 0.0025 0.0054 0.0075 0.0142 0.003 0.0045 0.0032 0.0219 0.0012 0.0009 0 0.0004 CR 0.0037 0.0035 0 0.0143 0.004 0.0024 0.0075 0.0072 0.0138 0.0044 0.0093 0.0033 0.021 0 0.0037 0.0005 0 PQ 0.0202 0.0174 0.0143 0 0.0139 0.0162 0.0212 0.0237 0.0301 0.0169 0.0192 0.0185 0.0343 0.0154 0.0181 0.014 0.0139 MD 0.0046 0.0049 0.004 0.0139 0 0.0032 0.006 0.0078 0.0146 0.0048 0.0061 0.0025 0.0242 0.0014 0.0069 0.0041 0.0032 CN 0.0029 0.0025 0.0024 0.0162 0.0032 0 0.0042 0.0074 0.0152 0.0037 0.0062 0.0049 0.021 0.0017 0.0035 0.0032 0.0022 PS 0.0052 0.0054 0.0075 0.0212 0.006 0.0042 0 0.0075 0.015 0.0055 0.0071 0.0064 0.0188 0.0023 0.0039 0.0028 0.0002 SL 0.0057 0.0075 0.0072 0.0237 0.0078 0.0074 0.0075 0 0.0152 0.0045 0.0057 0.0067 0.0204 0.0031 0.005 0.0048 0.0015 PP 0.0128 0.0142 0.0138 0.0301 0.0146 0.0152 0.015 0.0152 0 0.014 0.0154 0.015 0.0258 0.0116 0.014 0.0138 0.0116 TC 0.0041 0.003 0.0044 0.0169 0.0048 0.0037 0.0055 0.0045 0.014 0 0.0044 0.0042 0.0214 0 0.0008 0 0 CT 0.0058 0.0045 0.0093 0.0192 0.0061 0.0062 0.0071 0.0057 0.0154 0.0044 0 0.0074 0.0193 0.0021 0.0047 0.0053 0.0019 RGS 0.0016 0.0032 0.0033 0.0185 0.0025 0.0049 0.0064 0.0067 0.015 0.0042 0.0074 0 0.0215 0 0.001 0 0 QT 0.0191 0.0219 0.021 0.0343 0.0242 0.021 0.0188 0.0204 0.0258 0.0214 0.0193 0.0215 0 0.0186 0.0238 0.0223 0.0189 BL 0.0007 0.0012 0 0.0154 0.0014 0.0017 0.0023 0.0031 0.0116 0 0.0021 0 0.0186 0 0.0046 0.0028 0.0015 CH 0.0035 0.0009 0.0037 0.0181 0.0069 0.0035 0.0039 0.005 0.014 0.0008 0.0047 0.001 0.0238 0.0046 0 0.0043 0.0048 NC 0.0009 0 0.0005 0.014 0.0041 0.0032 0.0028 0.0048 0.0138 0 0.0053 0 0.0223 0.0028 0.0043 0 0.004 PTa 0 0.0004 0 0.0139 0.0032 0.0022 0.0002 0.0015 0.0116 0 0.0019 0 0.0189 0.0015 0.0048 0.004 0

aMC=Mexico City, Mexico; OR=Oriente, Guatemala; CR=Central Valley, Costa Rica; PQ=Peque, Colombia; MD=Medellin, Colombia; CN=Cundinamarca,

Colombia; PS=Pasto, Colombia; SL=Salta, Argentina; PP=Paposo, Chile; TC=Tucuman, Argentina; CT=Catamarca, Argentina; RGS=Rio Grande do Sul, Brazil; QT=Quetalmahue, Chile; BL=Baltimore, United States; CH=Chicago, United States, NC=North Carolina, United States; PT=Pittsburgh, United States

(38)

We partitioned the matrix of Nei’s minimum genetic distances among the 17 post-contact populations into a matrix of admixture distances and a matrix of drift distances. Then we used the Eigen vectors produced by multidimensional scaling to summarize pat-terns within the distance matrices.

The admixture distance matrix produced one Eigen vector that explained 98.6% of the dispersion (Figure 2.4-top). The positions of the 17 post-contact populations on this axis correlate strongly with ancestry fractions. R2 = 0.91 between position and either African ancestry or Indigenous American ancestry. R2= 0.37 with European ancestry; however, R2 = 0.96 when computed between European ancestry and the absolute value of axis position. Because our model includes three sources of continental ancestry - African, European, and Indigenous American - we expected to find two principal axes of ancestry. However, the genetic diversity among continental sources forms a linear gradient that projects into the mixtures among sources.

Ten Eigen vectors explained the drift distance matrix. However, most of the dispersion was concentrated in the first three Eigen vectors (Figure 2.4-bottom). It is easy to see how these Eigen vectors relate to drift by comparing the positions of populations to their population specific estimates of FST (Table 2.3). The populations with the highest values

of FST occupy the terminal positions on the first axis. The second axis draws a contrast

between the population with the third highest estimate of FST and the two populations with

higher FST. The next three axes contrast populations with middle levels of FST. Axes seven

(39)

Figure 2.4: (Top) Positions of the 17 post-contact populations along the principal Eigen vector of the ancestry component of the genetic distance matrix. The shading represents increased European ancestry. (Bottom) Positions of the 17 populations along ten principal Eigen vectors of the drift component of the genetic distance matrix. The shading is proportional to FST.

Finally, we can relate the patterns found in our decomposition of genetic distances back to the patterns evident in the total distance matrix. The positions of populations on the first Eigen vector of the total distance matrix correlate highly with their positions on the principal Eigen vector of the admixture distance matrix (R2 = 0.97). Moreover, the first Eigen vector of the total distance shows little correlation with any of the ten Eigen vectors of the drift distance matrix (0.00 < R2 < 0.07). The positions of populations on

the second Eigen vector of the total distance matrix are uncorrelated with the Eigen vector of the admixture distance matrix. However, they show strong correlation with positions of populations on the first Eigen vector of the drift genetic distance matrix (R2= 0.88) and little correlation with positions on the remaining Eigen vectors of genetic drift (0.00 <

(40)

R2 < 0.03). In a similar vein, the third Eigen vector of the total distance matrix shows

high correlation with the second Eigen vector of the drift distances (R2 = 0.74) and little correlation with positions on the remaining Eigen vectors of genetic drift (0.00 < R2 <

0.14).

Figure 2.5 shows some interesting unexpected patterns involving the role of genetic drift in the differentiation of post-contact populations in the Americas. Overall, the trend is negative - the greater the genetic distance the less genetic drift has contributed to it. However, this negative trend is absent in all three groupings of populations, when viewed individually. African-American - by - African-American comparisons show a strong pos-itive relationship (R2 = 0.69), although the number of comparisons is small, and the trend is not statistically significant. When comparing pairs of Latin American populations, there is no relationship between the total genetic distance and the percent that genetic drift ac-counts for (R2= 0.00). For example, among Latin American populations showing the least differentiation, between 20% and 100% of the total differentiation owes to drift. Similarly, among Latin American populations showing the most differentiation, between 20% and 100% of the total differentiation owes to drift. Finally, genetic drift can be important to dif-ferentiation, even when comparing a Latin American population with an African-American population.

(41)

Figure 2.5: Percent genetic distance owing to genetic drift plotted against total genetic distance. Points are color-coded to identify three levels of comparison: African-American by African-American (orange), Latin American by Latin American (blue), and Latin American by African-American (green).

2.6

Discussion

Estimating the allele frequencies of source populations has been a significant issue through-out the history of genetic admixture studies (Reed, 1969; Cavalli-Sforza and Bodmer, 1971; Adams and Ward, 1973). There are two significant problems. First, the populations that mixed may be unidentified, or no longer exist (Chakraborty, 1986). Second, the source pop-ulations may have evolved since the time of mixing. Modern statistical methods partially ameliorate both these problems. The likelihood method from Tang et al. (2005) apportions allele frequencies from the mixed samples back to the ancestral sources. However, allele frequencies from modern proxies guide the apportionment, and poor choices for the prox-ies can bias the ancestry estimation. Allele frequency drift in the proxprox-ies may also skew the apportionment. Despite this potential problem, we found an admixture model that fits this large data set well. FST is below 0.001 in all four African-American samples, and five

(42)

samples. Nonetheless, the three populations that show the most drift enter into about a third of the pairwise distance comparisons.

There are different ways to characterize the genetic structure of admixed populations. One approach is in the space defined by proportions of ancestry from different continen-tal source populations. Another approach is in the space defined by allele frequencies in the admixed populations. These vantage points are connected, both evolutionarily and methodologically. From the perspective of evolution, a population receives its alleles from its ancestors. From the methodological perspective, we estimate ancestry from the alleles contained in samples from populations. The results of this study seem paradoxical in light of the fundamental connection between ancestry and allele frequencies in admixed popula-tions. Notably, the plot of ancestry proportions in Figure 2.2 looks distinct from the plot of genetic distance coordinates in Figure 2.3. The triangle plot of ancestry fractions in Figure 2.2 fills a two-dimensional space, whereas ancestry correlates with only one major axis of allele frequencies summarized as genetic distances. We can graphically resolve the two plots by projecting the apex of the triangle, which represents European ancestry, onto the axis between African and Indigenous American ancestry. We can also resolve the apparent disparity analytically and evolutionarily.

The ancestry component of genetic distance (Fig. 2.1 and Eq. 2.4) is complicated by the fact that differences in ancestry between the admixed populations are modulated by genetic distances between the sources. In other words, the degree to which differences in ancestry contribute to the genetic differentiation of mixed populations depends on the lev-els of differentiation among the ancestral source populations (Cavalli-Sforza et al., 1994). The single axis of ancestry that we see in the principal coordinate plot reflects the recent evolution of human diversity. Genetic differentiation on the intercontinental scale has been driven by a series of founder effects (Ramachandran et al., 2005; Hunley et al., 2009). The entire species traces back to a population that lived in Africa approximately 200,000 years ago. A founder effect led to the habitation of Eurasia more recently, less than 100,000

(43)

years ago. The peopling of the Americas resulted from a subsequent founder effect from a population residing in Eurasia. A consequence of this history is that populations living in Africa have the greatest diversity, in terms of both the kinds of alleles and heterozygosity. Eurasian populations have a subset of the allelic types found in Africans and lower het-erozygosity. Indigenous American populations have a subset of the allelic types found in Eurasians and lower yet heterozygosity (Li et al., 2008; Long et al., 2009). Ultimately, loss of variation via the founder effects created a single trajectory of genetic distances among populations on different continents. The ancestry of admixed populations will determine their placement on the axis, but it cannot introduce new axes of variation.

A full account of the genetic structure of admixed populations requires us to look at the effects of genetic drift, in addition to admixture. The principal coordinates of the drift distance matrix display two principal findings for the 13 Latin American populations. First, these Latin American populations have drifted independently. There is no evidence for a concerted pattern that a phylogenetic radiation from a single founding event would produce. It is likely that Latin American populations were founded independently by admixture in many locations. The proportions of continental ancestry differed among the populations. In a few populations, such as the Peque, Paposo, and Quetalmahue, high values of FST

indicate that modest founder effects after, or during, the formation of populations (Table 2.3). These founder effects superimposed a new level of genetic structure on that created by admixture. Second, the drift and ancestry fractions contribute about equally to the pattern of genetic differentiation among the Latin Americans (Fig. 2.5). We do not observe a corre-lation between genetic distance and the impact of drift. Drift may predominate the distance between either closely related, or distantly related, populations. These two findings lend further support to the position expressed by Tishkoff and Kidd (2004) that anthropologists and geneticists cannot conceive of Latin Americans as a homogeneous genetic population. Our analysis shows that the genetic structure of Latin Americans involves more than vary-ing proportions of continental ancestry.

(44)

Drift accounts for over 90% of the genetic distance among the four African-American populations. However, it should be noted that small differences characterize the popula-tions analyzed here. Broader coverage of African-American populapopula-tions, perhaps includ-ing the Gullah of South Carolina (Parra et al., 2001), and African-Americans livinclud-ing on the West Coast (Reed, 1969), could increase genetic distances and show instances where both admixture and drift drive patterns of differentiation.

Genetic drift plays a less dominant role in the genetic differentiation between African-American populations and Latin African-American populations (Fig. 2.5). A large role for con-tinental ancestry is unsurprising because all of the 13 Latin American populations have below 10% African ancestry, while the African-American populations have above 75% African ancestry. However, it is notable that genetic drift accounts for up to a third of the distances between the most divergent populations.

In conclusion, this research introduces a new method to assess genetic diversity in ad-mixed populations. Specifically, we show how to partition the minimum genetic distance between a pair of admixed populations into two components, one related to differences in continental ancestry and the other related to genetic drift in the admixed population. This partition allows greater precision in identifying how the recent evolutionary process has shaped modern human diversity. Our work paves the way for future investigations of geo-graphic regions such as the Caribbean where many populations were formed by a complex combination of admixture and founder effects.

(45)

Chapter 3

Identifying the Number of Source

Populations and Their Identities in

Genetic Ancestry Analyses

3.1

Overview

Objective: We investigate the ancestry of a mixed population whose ancestry is uncertain. We propose multiple ancestry models that differ in the number of populations that con-tribute to the mixed population, and use the Akaike Information Criterion to choose the best model. Our focal admixed population is the Cape Coloured of South Africa. The Cape Coloured exemplify the challenges associated with estimating genetic ancestry.

Materials and Methods: We provide a history of South Africa to describe the develop-ment of the Cape Coloured. We analyzed the genotypes of 207 individuals from 11 con-temporary populations at 618 autosomal microsatellite loci. Using maximum likelihood, we estimate allele frequencies for the ancestral sources, ancestry proportions and expected

(46)

allele frequencies among the Cape Coloured. We construct 26 models, ranging from two to five ancestral sources, and use AIC to determine the best fitting model.

Results: The ancestry estimates of the Cape Coloured fluctuate based on which ancestral sources are included in each model. AIC indicates the best fitting model consists of two ancestral sources, the San and East Asians, and estimated 9,712 parameters. The fit for each model decreased as the number of parameters increased. All models have high R2 values for the observed and predicted Cape Coloured allele frequencies, ranging from 0.930 to 0.951.

Discussion: We demonstrate the utility of AIC in multi-model hypothesis testing for ad-mixture research. Our analyses support the concept of parsimony. The best fitting models have a minimal number of parameters, and contain two ancestral sources.

3.2

Introduction

A goal in admixture analyses is to estimate the contributions of ancestors to admixed indi-viduals and populations. This is typically achieved by constructing allele frequencies in a mixed sample as a linear combination of allele frequencies in populations that contributed ancestors to the mixed sample. Estimating the allele frequencies of the ancestors is a chal-lenging problem because the true sources of ancestry may no longer exist, or may not have been genetically sampled, or are otherwise unavailable for study. Tang et al. (2005) recom-mend a solution to this problem which consists of constructing pseudo-ancestors, who are descendants from close relatives of the true ancestors. However, for populations such as the Cape Coloured, it can be difficult to choose pseudo-ancestors because the number and identities of the true ancestral sources is unknown.

Here we describe an approach to investigate the ancestry of a contemporary mixed pop-ulation when there is uncertainty about the sources of ancestry. In this approach, we pro-pose multiple models that differ in the number of populations that contribute ancestry to the

(47)

mixed population and we use the Akaike Infromation Criterion (AIC) to choose the best model. Our focal admixed population to which we apply the AIC is the Cape Coloured of South Africa. Coloured is a nationally recognized ethnic group in South Africa. The Cape Coloured exemplify the challenges in analysis of genetic ancestry in an admixed popula-tion. Population geneticists have designated the Cape Coloured as a population of mixed ancestry (Tishkoff et al., 2009), and have also classified them as Afro-Europeans of mixed ancestry (Pemberton et al., 2013). However, Cape Coloured history suggests that such la-bels are too simple because the Cape Coloured people are likely to have ancestors from as many as five ethno-geographic populations, including non-Africans and non-Europeans. Two recent studies of Cape Coloured ancestry have postulated different ethno-geographic sources of ancestry, and, as should be expected, produced differing results (Patterson et al., 2010; de Wit et al., 2010).

Research design and statistical methods play an important role in the identification of sources of ancestry. All designs and methods have a similar recognition of the population genetic process. Allele frequencies in an admixed individual, or population, are a linear combination of allele frequencies in the ancestral sources, and the coefficients of the linear combination represent ancestry fractions. Two distinct strategies emerge from this common starting point.

Strategy #1: First, assemble a meta-sample that combines individuals from a focal mixed population with samples from regional populations throughout the world. Second, fit cluster models that treat the meta-sample as a mixture of a predefined number of ances-tral source populations. Third, run a sequence of cluster analyses that increase the number of ancestral sources for the meta-sample, until the regional populations appear as having approximately homogeneous ancestry, while the individuals from the focal mixed popu-lation have varying degrees of ancestry from the ancestral sources (Pritchard et al., 2000; de Wit et al., 2010).

References

Related documents

They inform their study and production of art by integrating information and skills from other disciplines and areas of knowledge such as math, reading, English Language Arts,

Auxiliary results include an extrapolation method to estimate the full-boundary data from the measured one, an approximation of the complex geometrical optics solutions

Previous work looking at children’s ability to shift their own style of speech as a function of social context suggests that seeds of this ability may be present quite early:

Plasma concentrations of cholesterol, estradiol and progesterone also were increased in cows fed diets containg 5.2% fat where the predominant fatty acids were oleic and

To je zapis vseh osnovnih frekvenc za vsak ˇcasovni odsek (v primeru frame-level pristopa), torej rezultat procesa detekcije osnovnih frekvenc. Tak zapis vsebuje informacije, ki

In order to formulate an effective guiding question, you should : choose a general topic, do preliminary research about it, narrow your topic, start asking questions, make a list

I am delighted to announce the 2014 Surgical Services Summit, a high-level networking and learning environment for the leaders and managers tasked with running the business

Communications systems are sensitive to winter storms – particularly wind or ice events that impact power and cable lines and snow and ice that limit access to remote