5.3 Bayesian approaches to parameter estimation
5.3.1 Approximate Bayesian Computation
Approximate Bayesian Computation (ABC) has been developed over the last 10 years in population genetics in order to estimate parameters of interest when the likelihood cannot be estimated or is computationally intractable. Given a sample of SNP data with an unknown genealogy then the likelihood function requires integrating over all possible genealogies. Figure 1.3 shows that in a sample of size 6 from a neutral model, there are 2700 possible branching structures. For parameter φ to be estimate, this method computes a set of summary statistics denoted by Sobsfrom observed data and then, loosely speaking, simulates data under the proposed model for a range of values of φ computing Ssim and accepts values of φ when η(Sobs, Ssim) ≈ 0 for some distance measure η(·, ·) and so replacing the full likelihood function p(x|φ) by p(S(x)|φ) for observed data x.
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 86 More precisely, Robert et al. (2011) provide the underlining ABC procedure. Given ob-served data x ∈ X , summary statistics S and distance measure η, then samples {φ1, . . . , φm} are simulated from
f (φ, S(z)|x) ∝ π(φ)f (S(z)|φ)IAη,x(z),
where
Aη,x = {z ∈ X |η{S(z), S(x)} ≤ },
and
IAη,x(z) =
1, if z ∈ Aη,x ; 0, otherwise,
is an indicator function and chosen to be ’close’ to zero. The performance of this method depends on the choice of η and S. Stephens (2007) suggests that using statistics that are directly affected by the parameter of interest will increase the efficiency of this method.
However, it is useful to consider not only how informative a statistic is in parameter estim-ation but also how informative the statistic is given the set of statistics already included in the analysis. Joyce and Marjoram (2008) assess the question: given statistics S1, . . . , Sk−1, then how beneficial will an additional statistic Sk be? They compare the log likelihood function l{S1, . . . , Sk−1|φ} to l{S1, . . . , Sk−1, Sk|φ}, the difference being the latter function contains the additional term l{Sk|S1, . . . , Sk−1, φ}. They attached a score δk to each new statistic Sk given S1, . . . , Sk−1 and φ. Once a score falls below a pre-specified threshold then the corresponding statistic is not included in the inference. Their methodology was to find a set of approximately sufficient statistics. For example, if S1, . . . , Sk−1 are sufficient statistics then l(Sk|S1, . . . , Sk−1, φ) = l(Sk|S1, . . . , Sk−1) and so Sk adds no additional in-formation about φ. Barnes et al. (2011) take an inin-formation theory approach to finding the set of statistics that minimise the consequent loss of information that results from in-corporating summary statistics into any inference as opposed to using the full data. Given
a larger set S of statistics they devise an algorithm to find the minimal set U ⊂ S such that S contains no more information about φ given U . In order to assess the performance of the set of statistics S, Nunes and Balding (2010) implemented the square root of the sum of squared errors, RSSE;
RSSE = 1 na
ns
X
i=1
Ii|φi− φobs|2
1/2
,
where nsis the number of simulations in the ABC algorithm, nais the number of accepted draws and
Ii =
1, if φi is an accepted draw;
0, otherwise.
For each subset U ⊆ S, the ABC algorithm is repeated no times and RSSE computed with the intention to find the set U∗ which minimises
MRSSE =
no
X
j=1
RSSEj.
With attention given to methods of selecting the optimal set of statistics, the focus now moves to describing the evolution of ABC algorithms. Beaumont et al. (2002) proposed a method of inference based on summary statistics as an extension to a method described by Tavar´e et al. (1997) who were interested in estimating the time to the most recent common ancestor of a sample conditioning on the observed number of segregating sites Sn. They specified priors πN and πµ on the population size N and mutation rate µ per site per generation. In a genealogy of length L, Sn∼ P oi(12Lθ). The algorithm used was:
1. Simulate N and µ from πN and πµ.
2. Simulate data under the standard coalescent model to find coalescent waiting times {W1, . . . , Wn}.
3. Find the time to the most recent common ancestor TM RCA and the total length of the genealogy Ttotal using {W1, . . . , Wn}.
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 88 4. Keep TM RCA, N and µ with probability
u ∝ P r(Sn= k|Ttotal= L)
= 1
k!e−Lθ2 Lθ 2
k
.
This process simulates samples {TM RCA1, . . . , TM RCAnsim} from distribution of TM RCA, given N and µ|Sn= k and an estimate of TM RCAis ˆTM RCA= n1
sim
Pnsim
i=1 TM RCAi. Pritchard et al. (1999) studied an ancestral population of size NAthat underwent a period of exponential expansion at rate r that began at time t0. They used three summary statistics, namely the average heterozygosity ¯H, the number of distinct haplotypes, n, and the mean of the variance in repeat numbers (using microsatellite data), ¯V . The rejection algorithm implemented was:
1. Calculate observed statistics, s = {n, ¯H, ¯V }.
2. Simulate φ0 from the prior distribution πφwith φ0= {NA0, t00 and r0}.
3. Given φ0, simulate a genealogy under the required model and then microsatellite data.
4. Computed the statistics s0 = {n0, ¯H0, ¯V0} on the simulated data.
5. If |s0i− si| < δ for all i = 1, 2 and 3, then accept φ0.
Again, this process produces draws from the joint distribution of the parameters of interest φ given the observed summary statistics s. Beaumont et al. (2002) extended this method by following the first four steps to produce draws (φi, si) for i = 1, · · · , ns and weighting draw φi by |si− s|. They assume that φi can be modelled by
φi = α + (si− s)Tβ + ei for i = 1, · · · , ns.
for some coefficients α and β to be estimated and independent errors ei Since E(φ|S = s) = α, the authors write
φ∗ = φ − (si− s) ˆβ
as a random sample from p(φ|S = s). Although the relationship between {φi} and {si} may not be linear, the authors assume that the relationship is linear locally to s. They estimate (α, β) by minimising
ns
X
i=1
(φi− α − (si− s)β)2Kδ(|si− s|)
for band–width δ (synonymous with the last step of the above algorithm) and kernel function Kδ.
Fearnhead and Prangle (2012) describe ABC algorithms based on rejection and MCMC sampling methods. If interest lies in estimating a parameter φ given observed data xobs, the Metropolis-Hastings algorithm, detailed by Gelman et al. (2004), aims to make draws φ1, . . . , φNsim from the target distribution p(φ|xobs). Given a prior distribution π(φ), the algorithm begins by drawing φ0 from π(φ) and, for i = 1, . . . , Nsim, a draw φ0 is made from a proposal distribution q(φ0|φi−1) and is accepted with probability min(α, 1) where
α = p(φ0|xobs)/q(φ0|φi−1) p(φi−1|xobs)/q(φi−1|φ0).
The algorithm described by Fearnhead and Prangle requires specification of statistics S(x) with sobs = S(xobs) and a density K(x). It is initiated by simulating φ0 from π(φ) and calculating sobs, and for i = 1, . . . , nsim,
1. Simulate φ0 from proposal distribution q(φ0|φi−1).
2. Simulate data xsim, from the required model, and calculate S(xsim) = ssim.
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 90 3. With probability
α = min
1,K{(ssim− sobs)/h}
K{(si−1− sobs)/h}
π(φ0)g(φi−1|φ0) π(φi−1)g(φ0|φi−1)
set φi = φ0, otherwise set φi = φi−1, for some pre-specified bandwidth h > 0.
This algorithm may be used to estimate the population divergence time τ by setting
• S = Fst(=F say),
• a N (τi−1, σ2) proposal distribution,
• a U ni(0.00001, 0.7) prior distribution for τ , and,
• K(x) =
1, if |x| ≤ 0.5;
0, otherwise.
Therefore,
α = min
1, K{(Fsim− Fobs)/h}π(τ )g(τi−1|τ ) K{(Fi−1− Fobs)/h}π(τi−1)g(τ |τi−1)
and
K{(Fsim− Fobs)/h} =
1, if |(Fsim− Fobs)/h| < 12; 0, otherwise.
Necessarily, K{(Fi−1− Fobs)/h} = 1 and
π(τ ) π(τi−1) =
1, if τ ∈ (0.00001, 0.7);
0, otherwise.
Since the proposal distribution is normal, g(τi−1|τ ) = g(τ |τi−1) and α = 1 or 0. The ABC MCMC algorithm details how to estimate τ .
Algorithm 1 (ABC MCMC).
For i = 1, . . . , Nsim :
1. Simulate τ from N (τi−1, σ2).
2. Simulate data using τ and calculate Fst = Fsim.
3. If α = 1 then set τi= τ and Fi= Fsim. If α = 0 set τi = τi−1 and Fi= Fi−1.
This algorithm produces a set of draws τ1, τ2, . . . , τNsim. Figure 5.8 shows density estimates of simulated τ ’s for a range of true τ values using this algorithm. On each plot, the red dot shows the true value of τ . For most values of τ , the density peaks around the true values with a few exceptions. By taking the lower and upper 2.5% of each distribution, figure 5.9 shows 95% credible bands for the range of τ values. This algorithm estimates τ well; the bands contain of the line τ = ˆτ and are narrower than those produced by the Fst-based estimator (figure 5.4).