5.3 Bayesian approaches to parameter estimation
5.3.2 ABC packages
There is a battery of summary statistics that are extensively used in population genetics.
Several packages use ABC to infer aspects of demographic history, for example, DIY ABC (Cornuet et al. (2008)), PopABC (Lopes et al. (2009)) and ABCtoolbox (Wegmann et al. (2010)). The statistics employed in these packages are discussed in chapter 6.
Although each program aims to make inferences about population parameters via summary statistics, the methods adopted in each are different.
Cornuet et al. (2008) model the demographic history of a sample by firstly specifying the population size, population divergence times (backwards in time) and population admix-ture (backwards in time, a population splits into two other populations in the sample).
Data is then simulated under this pre-specified history and a set of summary statistic
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 92
Figure 5.8: Density plots of simulated τ ’s for a range of true τ values (red dot) using ABC MCMC algorithm.
computed. Simulated data set i is compared to the observed data using distance measure
di =
where m is the number of statistics, V arj the variance of the jth statistic across statistics, sij is the value of statistic j in simulation i and sobsj is the observed value of statistic j.
This program then uses the algorithm given by Beaumont et al. (2002) to estimate the
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.00.20.40.60.8
true ττ
Crediable band for ττ
Figure 5.9: 95% credible bands for τ , and the line of equality.
parameters.
Lopes et al. (2009) fit the isolation with migration model presented originally by Nielsen and Wakeley (2001) and described in section 4.2. It aims to estimate the tree topo-logy (treated as a categorical variables with several possible topologies), population size, population split times, migration, mutation and recombination rates, by implementing a rejection based algorithm.
Once specifying a model (to simulate from) and a set of summary statistics, Wegmann et al. (2010) use partial least squares (PLS) to make linear combinations of the summary statistics in order to find an optimal set of statistics. This methods was motivated by Joyce and Marjoram (2008) who showed that although it may be beneficial to include as many summary statistics thought to be informative about the parameters of interest, adding too many contribute more noise. A further discussion of the matter is given in section 6.2.4.
PLS regression has two main steps as described by Boulesteix and Strimmer (2007). The first is a dimension reduction step. It is assumed that there are q continuous response variables Y1, . . . , Yqand p continuous explanatory variables X1, . . . , Xpwith observed data yi = {yi1, . . . , yiq} and xi = {xi1, . . . , xip} for i = 1, . . . , nsim. Wegmann et al. (2010) consider the summary statistics as the explanatory variables and the parameters of interest
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 94 the response variables. The general underlying model of PLS is to write
X = T PT + E and Y = T QT + F,
where T is a matrix of latent components, P and Q are matrices of dimension p × c and q ×c, respectively, and E and F are error matrices. As with principal components analysis, linear combinations of the columns of the matrix X, of dimension n × p, can be found that are independent and contain most of the variability in the data. For example, in the notation of Boulesteix and Strimmer (2007),
Tj = w1jX1+ . . . + wpjXp, for j = 1, . . . , c,
where T1, . . . , Tc are the components, c is the chosen number of components that are thought to explain most of the variation in the data and the columns of the matrix W = {wij} of dimension p × c are such that the latent components explain the variation in the explanatory and response data. Therefore,
T = XW, and hence X = T WT.
The second stage is to model the data. As in the case of multiple linear regression, the matrix QT can be estimated by QT = (TTT )−1TTY . In particular, Y can be modelled by
Y = T QT + F = XW QT + F
and so a least squares estimation of the matrix of regression coefficients W QT = W (TTT )−1TTY . This approach was implemented to find a minimal number of independent statistics, with the authors also suggesting this procedure recovers an op-timal set of summary statistics.
5.4 Model selection
The probability of a model given data provides the natural Bayesian tool to assess which model provides the better fit to the data. To illustrate, data were simulated from the isol-ation model with two subpopulisol-ations diverging at time 0.7. The ABC MCMC algorithm was used to estimate p(τ |Fst), as illustrated in figure 5.8, but also p(m|Fst), the posterior distribution of the migration rate between the two subpopulations given Fst under the (misspecified) migration model. Figure 5.10 shows the posterior density estimate of m given Fst with the red dot showing the posterior mean value.
0.00025 0.00030 0.00035
050001000015000
m
Density
●●
Figure 5.10: Estimate of p(m| ˆFst) using the ABC MCMC algorithm.
Robert et al. (2011) provide an algorithm to calculate the probability of a model given the data, or summaries of the data. In this application, let M1 denote the isolation model and M2 denote the migration model. The algorithm produces a vector m = (m1, . . . , mNsim).
At the ith step, m∗ is generated from π(M ) the prior distribution on the models, for example p(M = M1) = p(M = M2) = 0.5, and, using a draw φm∗ ∼ π(φm∗), data are simulated under model m∗ and the summary statistics Ssim computed. These steps are repeated until the distance between Sobs and Ssim is less that and they set mi = m∗. They estimate the probability of model j, for j = 1, 2, given Sobs as
P r{Mj|Sobs} = 1 Nsim
Nsim
X
i=1
Imi=j.
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 96 Using the algorithm, with Nsim = 1000 and both models a priori equally likely, the probabilities of the isolation and migration models are estimated to be:
P r{migration|Fst} = 0.24 P r{isolation|Fst} = 0.76.
If M1 and M2 correspond to the isolation and migration models, respectively, then the Bayes factor
B12 = P r{M1|Sobs}p(M = M1) P r{M2|Sobs}p(M = M2)
= 3.2,
providing evidence in favour of the isolation model over the migration model.
In population genetics, it is often the case that the statistics implemented in ABC (for example, Tajima’s D) are not sufficient which presents problems when estimating the likelihood function p(x|φ) since the observed data x are replaced by a statistic S(x).
In particular, Bayesian model selection methods require the likelihood function to be evaluated as discussed by Robert et al. (2011) and Barnes et al. (2011).
The issue is made explicit by Barnes et al. (2011). The authors define sufficiency in terms of the likelihood function. A statistic S is sufficient if
f
x|S(x), φ
= g
x|S(x)
,
where f
x|S(x), φ
is the likelihood of data x given parameter φ and statistic S(x) and g(x|S(x)) is the probability of the data given the statistic, independent of φ. For two prospective models, M1 and M2, the posterior probabilities of the models p(Mi|x) for i = 1, 2 given the data are estimated. The model comparison considers the ratio
B12 = p(x|M1)
p(x|M2) = p(M1|x)π(M1) p(M2|x)π(M2).
In particular, in the case of k models, the posterior probability of the ith model is
where π(Mi) is the prior probability that the data are from model Mi with parameters φMi ∈ ΘMi. If the observed data x is replaced by simulated data y then as, → 0,
In the context of ABC where the observed data x is replaced by a summary of the simulated data S(y), Barnes et al. (2011) note that (5.9) becomes
Z
S(X )
p(S(y)|φMi)π(φMi)IAη,x(y)dS(y).
If S is a sufficient statistic, then p(S(y)|φMi) ∝ p(y|φMi). However if S is not sufficient, as is often the case, then P r(Mi|x) cannot be approximated by
π(Mi)R
CHAPTER 5. ESTIMATING POPULATION PARAMETERS 98 Therefore, although using non–sufficient statistics to estimate the joint distribution of f (φ, z|x) for {z ∈ X |η(S(x), S(z)) < } is valid, there are problems in computing Bayes factors. Barnes et al. (2011) argue that the problems with using insufficient statistics in model selection are reflected in parameter estimation evidenced in an example estimating the mean from a N (µ, 1) distribution. They consider four separate statistics to estimate µ namely the sample mean, variance, the minimum value and the maximum value. They show using the sample mean produces the most accurate results.