Computing F st - Methods for demographic inference from single-nucleotide polymorphism data

It is possible to simulate data from both the migration and isolation models to produce similar F_st values between subpopulations. Section 2.1.1 showed ways of estimating a migration rate, through (2.2), and population divergence times, via (2.3). Slatkin (1991, 1993) approximated F_stthrough coalescent times by considering a definition of F_stin terms of identity by descent, the probability that two genes reach their most recent common

ancestor unaffected by any mutation. That is,

F_st = f0− ¯f

1 − ¯f , (3.5)

where f₀ is the probability of identity by descent of two genes sampled from the same subpopulation and ¯f is the probability of identity by descent of two genes sampled from the whole population, regardless of subpopulation. Wakeley (2009) writes the probability of identity by descent as an infinite sum:

P {IBD} =

∞

t=1

(1 − µ)^2t

1 − 1

t−1

1 N

∞

i=1

(1 − µ)^2tP (t), (3.6)

where µ is the mutation rate per generation. P (t) is the probability that the two lineages reach their most common recent ancestor in generation t and (1 − µ)^2t is the probability that neither lineage is affected by a mutation up to and including generation t. For a small mutation rate, Slatkin (1993) approximated (3.6) by

f¯ ≈

∞

t=1

(1 − 2tµ)P (t)

= 1 − 2µ¯t.

and showed from (3.5) that

F_st = f₀− ¯f 1 − ¯f

≈ (1 − 2µt₀) − (1 − 2µ¯t) 1 − (1 − 2µ¯t)

= ¯t − ¯t₀

t¯ , (3.7)

CHAPTER 3. DATA SIMULATION 50 where ¯t0 is the average coalescent time between two genes from the same subpopulation and ¯t is the average coalescent time between two genes samples from the whole population.

3.5.1 Migration model

An in-depth analysis of the migration model with D subpopulations was given by Slatkin (1991). Suppose each subpopulation is of haploid size N . In this model the migration and coalescent processes occur independently. Slatkin (1991) presents the result that if subpopulations exchange migrants at constant rate m, then ¯t₀ = N^T = DN , that is the total population size. Using this results, the probability that two genes from the same subpopulations coalesce in the next generation is _DN¹ and so, measuring time in DN generations,

¯t0 = 1.

In the case of two genes that are from different subpopulations, Slatkin considered the probability that the two genes were in the same subpopulation in the previous generation.

Let m be the probability that a gene migrates to any of the other D − 1 subpopulations, so the probability that a gene migrates to a specific subpopulations is _D−1¹ . There are three possible ways that two genes, now in different subpopulations, were in the same subpopulation in the previous generation as demonstrated below. The first two scenarios

●

● ●

●

● ●

●

are the cases of one gene migrating, backwards in time, into the occupying subpopulation of the other gene. The last is the case that both genes migrate from different subpopulations in the current generation to the same subpopulation in the previous generation. When

m → 0, the last scenario has negligible probability. Therefore,

P r{two genes in same subpopulation in the previous generation} = 2m(1 − m) 1 D − 1

≈ 2m

D − 1, for small m. Hence,

¯t1 = the time until two genes are in the same subpopulation

+ the average time until two genes in the same subpopulation coalesce

= D − 1

2m + DN.

Measuring time in DN generations leads to

t¯1 = 1 + D − 1 2DN m.

Lastly, to find ¯t, Slatkin applied the law of total probability, namely

¯t = P r{two genes in same subpopulation}¯t₀ + P r{two genes from different subpopulation}¯t1

= 1

Dt¯0+

1 − 1

¯t1

= 1 +(D − 1)² 2D²N m. As a result, by applying (3.7),

F_st =

1 + 2N mD² (D − 1)²

−1

. (3.8)

CHAPTER 3. DATA SIMULATION 52 3.5.2 Isolation model

In the isolation model, to find ¯t₀ consider two genes belonging to the same subpopulation.

Within each isolated subpopulation, it is assumed that the lineages coalesce according to a neutral Wright-Fisher model with population size N , as illustrated in figure 3.7.

Measuring time in generations, the probability that the two genes coalesce in the previous generation is _N¹. Since it has been assumed the ancestral population size is also N , the average time till the two genes coalesce is N whether or not they coalesce before or after the split. Measuring time in DN generations,

¯t₀ = 1 D.

Two genes from different subpopulations can only coalesce after the population split time, τ . Therefore, the expected coalescent time of two genes that belong to different subpopu-lations, ¯t₁, equals the expected time that two genes coalesce from the same subpopulation plus the population split time τ ,

¯t1 = τ + 1 D.

To find the average time until two genes coalesce from the entire sample, then either the genes belong to the same or different subpopulations with equal probability. Therefore,

N N

● ●

Figure 3.7: Isolation model with two subpopulations.

¯t = ¹₂(¯t1+ ¯t0) and so

t¯ = 1 D+τ

2. From (3.7),

F_st = τ

2D⁻¹+ τ. (3.9)

As a result, it is possible to choose the parameters τ and m to simulate SNP data from D subpopulations that produce similar pairwise Fstvalues. In the case of two subpopulations, for a given F_st, the appropriate estimates are, from (3.9) and (3.8), respectively

τ = Fst

1 − Fst

, (3.10)

m = 1 − F_st 8N Fst

. (3.11)

Simulation results are shown in figure 3.8. In each case, D = 2, N1 = N2 = 500 and n₁ = n₂= 50. For a range of F_st values, the migration rate m and

0.00 0.05 0.10 0.15 0.20 0.25 0.30

0.000.050.100.150.200.250.30

Fst

F^st

Migration model Isolation model

Figure 3.8: Estimates of Fstunder migration and isolation models with two subpopulations and m and τ estimated using (3.10) and (3.11).

CHAPTER 3. DATA SIMULATION 54 population split time τ were calculated using (3.11) and (3.10) and data were simulated under both models and F_stestimated using (3.7). Under the isolation model, the estimated Fst values appear almost identical to the predetermined Fst values as expected, whereas under the migration model, there is a slight underestimation for larger F_st values.

Distinguishing models

For a given data set, it is often possible to fit both an isolation model and a migration model. In order to infer demographic history, it is important to be able to distinguish between these models. Consider the two models shown in figure 4.1. Figure 4.1(a) shows a two subpopulation migration model. The two subpopulations exchange migrants at rate m per generation. Figure 4.1(b) shows the isolation model with two subpopulations that diverged at time τ in the past. In both cases, assume each of the two subpopulations are of haploid size N and time is measured in units of 2N generations.

4.1 Methods of distinguishing migration from isolation

This section introduces and briefly describes established ways of distinguishing these mod-els.

4.1.1 Pairwise differences

Wakeley (1996) showed that the isolation model can be identified by a hypothesis test involving the variance of pairwise differences within and between subpopulations. Suppose

CHAPTER 4. DISTINGUISHING MODELS 56

Figure 4.1: (a) An example of the migration model with two subpopulations that exchange migrants at rate m. (b) An example of the isolation model with 2 subpopulations that diverged at time τ .

DNA sequence data are available from two populations with sample sizes n1 and n2 and let k_jj⁰ be the number of differences between sequences j and j⁰. Wakeley defined the average pairwise differences between sequences within subpopulation i as

d_i = 1 between the two subpopulations as

d₁₂ = 1 variance of the average pairwise differences (within and between subpopulations):

s²_i = 1

respectively.

Wakeley considered many function of the intra populations statistics s²₁/d₁ and s²₂/d₂ that may be used to distinguish the two model. From the set of functions considered, he found that when the migration rate was high, or a low population divergence time,

Ψ =

n1(n1− 1)s1

+ n2(n2− 1)s2

+ 2n1n2

s12

was most successful in distinguishing the two models. He compared the expectation of Ψ over a range of migration rates. As the migration rate increased, the expectation under both models converged to the same value, whereas for smaller migration rates, and so more ancestral population divergence times, the expectation under the migration model is higher and so the isolation model is rejected for ‘large’ values of Ψ.

In document Methods for demographic inference from single-nucleotide polymorphism data (Page 64-73)