6.4 Discussion
7.1.3 Estimating τ
Test III relies on adequately estimating τ and currently applies
ˆ
τ =
Fst
1−Fst, if 1−FFst
st > 0.013;
0, otherwise,
with Fstestimated by (5.1). The value 0.013 was shown to improve the estimates produced in figure 6.12. Applying the same estimator to ascertained data leads to figure 7.5, which shows confidence bands for τ . 1000 SNPs were simulated from two subpopulations of sample size 10 and na = 2 and ˆτ computed. Repeating this 100 times, the endpoints of the confidence bands are taken as the lower and upper 2.5th percentiles. Evidently, ascertained data fail to lead to good estimates of τ , perhaps not surprisingly.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.00.20.40.60.81.01.2
τ
τ^
Figure 7.5: Confidence bands for τ from ascertained data.
Given the ascertainment scheme, the Markov-chain Monte-Carlo ABC algorithm used in
CHAPTER 7. EXTENSIONS TO THE HYPOTHESIS TEST 144
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
220 260 300 340
ππ
0204060
210 270 330 390
ππB
0204060
200 240 260 300
ππW
04080
−0.7 −0.1 0.5
Dt
04080
120 220 320 420
ηη1
120 220 320 420
ηηmax
04080
Figure 7.6: Histograms of summary statistics from data simulated under the isolation model without ascertainment (green bars), with ascertainment (red bars) and using the maximum-likelihood allele frequencies (orange bars).
section 5.3.1 can be used to estimate τ . Given an initial draw τ0, this method iteratively simulates from a transition distribution, chosen to be Normal with mean τi−1and variance σ2, and accepts a draw if the observed Fst is close to the simulated Fst. This algorithm can directly accommodate ascertainment by simulating from the isolation model with as-certainment. The statistic chosen in this algorithm, Fst, is calculated from the ascertained data and so it may be beneficial either to correct for ascertainment and calculate Fst using (5.2) or to use (7.4), not accounting for ascertainment. Figure 7.7 shows confidence bands for ˆτ calculated by taking the lower and upper 2.5 percentiles of the estimated densities
p(τ |Fst) from the ABC MCMC algorithm correcting and not correcting for ascertainment.
Simply using the ascertainment data, confidence bands, which contain the true values of τ , are produced which are narrower compared to the corrected data. Therefore, this algorithm works well, directly accounting for ascertainment without any allele frequency corrections.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.00.20.40.60.8
ττ
ττ^
(a)
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.00.20.40.60.8
ττ
ττ^
(b)
Figure 7.7: Confidence bands for ˆτ from ascertained data (a) correcting for ascertainment and (b) not correcting for ascertainment.
7.1.4 Hypothesis test
This section aims to incorporate ascertainment directly into Test III. In its current state, ascertained data may be tested using Test III directly by correcting for ascertainment.
Figure 7.6 demonstrated that, in most cases, the distributions of the statistics from non-ascertained data and corrected data are roughly similar. Excluding πW and πB from the set of statistics, the type I error rate is shown in figure 7.8. Most of the confidence intervals lie around 0.5, therefore, by correcting for ascertainment, Test III does not control the type I error rate.
Figure 7.1 compared the distributions of the summary statistics under the migration and
CHAPTER 7. EXTENSIONS TO THE HYPOTHESIS TEST 146
ττ Prob of rejecting H0 00.250.50.751
0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7
Figure 7.8: Type I error rate of Test III correcting for ascertainment.
isolation models with ascertainment and shows that most of the statistics are still able to distinguish the models excluding πW and ηmax. Figure 7.7 shows that the ABC MCMC algorithm can be used to estimate the posterior distribution of τ given a summary of observed data without correcting for ascertainment. As a result, it appears practical to construct a hypothesis test in the presence of ascertainment in the following way:
Algorithm 5 (Test IV).
1. Calculate the set of observed summary statistics Sobs and Fst = Fst1 as described in the text.
2. Estimate the posterior distribution of τ given Fst1, p(τ |Fst1) by ABC MCMC.
3. For k = 1, . . . , Nsim :
4. Take the kth simulated draws of τ from p(τ |Fst1) and simulate data under the isolation model under the same ascertainment scheme as the observed data.
5. Estimate the set of summary statistics Sk= {Sk1, . . . , Skm}.
6. For each statistic separately, test the hypothesis H0i : Sobsi = ¯Si H1i : Sobsi 6= ¯Si, where ¯Si =PNsim
k=1 Ski for i = 1, . . . , m.
7. Test the global null hypothesis, H0 =Tm i=1H0i.
1000 SNPs were simulated from the isolation model with two subpopulations, each of sample size 10 which diverged at time τ . The models were tested using Test IV for a range of τ values. This was repeated 100 times and the results are presented in figure 7.9(a). As in the case of non-ascertained data, the type I error is controlled. Figure 7.9(b) shows the power of Test IV, shown by the blue bars, for a range of migration rates compared to the power of Test III with non-ascertained data, shown by the green bars. As the migration rate decreases, this test shows high power in distinguishing these models, whereas as the migration rate increases, testing ascertained data through Test IV is more powerful that testing non-ascertained data through Test III. This apparent increase in power may be consequence of the choices of h and σ2 in the ABC MCMC algorithm. Recall that, at iteration i, a value τ∗ ∼ N (τi−1, σ2) and corresponding Fst = Fsim are accepted if
|Fsim− Fobs| /h < 1/2. As in any MCMC algorithm, the value of σ2 ensures that one is neither in the situation where each proposed value of τ is accepted or each proposed value is rejected. Likewise, the value of h controls the acceptance rate of Fst values.
Given a proposed value, h provides a balance between accepting Fsim too often or too infrequently. Fixing h (or similarly σ2) too high causes difficulty in rejecting t∗ when Fobs is much smaller than h. For instance, suppose h = 0.1 and Fst = 0.01. As the migration rate increases, Fst decreases and so the choice of in parameters becomes more crucial. The green bars in figure 7.9(b) show the power of Test III using data with no ascertainment.
Data was tested from the migration model with m = 0.05 and no ascertainment using the ABC MCMC algorithm to estimate τ . Repeating this 50 times, then 25 out of the 50 were rejected giving an interval for the probability of rejecting the null hypothesis of (0.36, 0.64) which seems more consistent was the ascertainment result in figure 7.9(b) (albeit this interval still lies slightly below the blue line when m = 0.05).
Another explanation of the increase in power may be due to the ascertainment process selecting SNPs with an intermediate allele frequency, which may be more informative in distinguishing between these two models than, for example, loci with a small minor allele
CHAPTER 7. EXTENSIONS TO THE HYPOTHESIS TEST 148
ττ Prob of rejecting H0 00.250.50.751
0.005 0.007 0.01 0.03 0.05 0.07 0.1 0.3 0.5 0.7
(a)
m Prob of rejecting H0 00.250.50.751
5e−06 5e−05 5e−04 0.005 0.05 0.1
ascertained data non ascertained data
(b)
Figure 7.9: Type I error rate (a) and power (b) of Test IV.
frequency. This can be explained by observing that SNPs with a low allele frequency are more likely to have a mutation on a terminal branch whereas mutations that occur on internal branches may be more revealing about the evolutionary history of the sample.