2.4 Estimation and inference
2.4.1 Parameter estimation
To estimate the parameters in a HMM including Jm
at each marker, as well as αmln
and θmln(h) at each hidden cluster and each marker, we use the EM algorithm, also
known as the Baum-Welch training algorithm, to find maximum-likelihood estimates
of these model parameters by employing the forward and backward probabilities.
Forward-backward algorithm
First, we specify the forward probability which is the joint probability of an observed
sequence up to and including gm−1
and the underlying state sm−1
at marker m − 1
(shown in left part in Figure 2.4) as fk(m − 1) = p(g1, . . . , gm−1, sm−1
= k). This
probability is obtained from the forward algorithm. The backward probability is
defined as bl(m) = p(gm+1, . . . , gM|sm
= l) (in the right part of Figure 2.4).
sequence g is:
p(sm−1
= k, sm
= l|g1, . . . , gM) =
fk(m − 1)amkleml(gm)bl(m)
p(g)
(2.6)
where p(g) is full probability of the sequence, which can be computed efficiently by
the forward algorithm. Similarly, the posterior probability of state l at marker m
given the genotype sequence g is fl(m)bl(m)/p(g).
Expectation and maximisation
Based on the posterior probability decoding method, the expected counts of tran-
sitions from clusters k = {k1, . . . , kN} to l = {l1, . . . , lN} at marker m given the
observed genotype sequence g is obtained by the use of the forward-backward algo-
rithm and summing over the expected counts of all observed sequences:
X
i
1
p(gi)f
i
k(m − 1)amkleml(gim)b
i
l(m)
(2.7)
where i is the index of the sequence; fki(m − 1) and bil(m) are the forward and
backward probability, respectively; amkl
and eml(gmi
) are the transition and emission
probability based on a polyploid HMM, respectively, which can be obtained from
equations (2.2) and (2.4).
Similarly, the expected counts of emissions at marker m given l = {l1, . . . , lN} is
X
i
1
p(gi)f
i
l(m)b
i
l(m)
(2.8)
The counts of transitions and emissions in a polyploid HMM are then decomposed
into counts of haploid based parameters— CJm, Cαmln
and Cθmln(h)
according to
the permutation probabilities in equations (2.2) and (2.4). To avoid probability
estimates of 0 (over fitting), a small number (pseudo count) is added to each of
counts. The maximum likelihood estimates for Jm, αmln
and θmln(h) with priors
of pseudo counts from Dirichlet distributions (maximum a posterior estimates with
Dirichlet priors) are given by
ˆ
Jm=
CJm
+ uJ
∗ 0.5
(CJm
+ uJ
∗ 0.5) + (C1−Jm
+ uJ
∗ 0.5)
ˆ
αmln
=
Cαmln
+ uα∗ 1/z
Pz
l=1(Cαmln
+ uα∗ 1/z)
ˆ
θmln(h) =
Cθmln(h)+ uθ∗ 1/H
PH
h=1(Cθmln(h)+ uθ∗ 1/H)
(2.9)
where the terms uJ∗ 0.5, uα∗ 1/z, and uθ∗ 1/H are the pseudo counts for estimating
parameters Jm, αmln
and θmln(h) respectively. For this stage of model fitting, we set
uJ
= uα
= uθ
= 0.01 for the first application and uJ
= uα= uθ
= 0.1 for the second
and third applications. The log likelihood value given observed genotype sequences
is calculated though the forward algorithm using these parameter estimates.
Computation of the training process
The training process might converge to a local maximum of the likelihood function,
which is a typical problem for the EM algorithm. A likelihood usually has many
local maxima, and different initial values for the EM algorithm could lead to different
local maxima (resulting in different parameter estimates). One way to deal with this
problem is to train the data with a fixed number of repetitions—each with different
initial values, and the parameter estimates can be obtained by one of the following
approaches: (1) the estimates are then selected from the repetitions that gives the
highest value of the log likelihood given the observed genotypes; (2) the estimates
are then obtained by averaging over the estimates from all the repetitions; (3) the
estimates are then inferred by combining the results across different repetitions using
a sampling algorithm (see next section). In this thesis, we employed (1) and (3) with
10 repetitions of the training algorithm with different start values. Each repetition
of training has 25 iterations. According to our experience, 25 iterations in each
training process is able to give a reasonable good convergence.
Prior
For each initialisation of the EM algorithm, we use Dirichlet priors on all of our
parameters. Namely, for scalars uθ
> 0 and uα
> 0, we let θml.
∼ Dirichlet(uθmθ),
where mθ
is the uniform vector with each element 1/H, and αm.
∼ Dirichlet(uαmα)
where mα
is the uniform vector with each element 1/z. The u parameters measure
the strength of the prior information, so that large u implies sampling more tightly
around m.
Although Jm
is considered as an unknown parameter and will be inferred from our
training algorithm, Jm
can be expressed as a compound parameter, (1 − e−rmdm),
where dm
is the physical distance and rm
is the jump rate per bp between markers
m−1 and m. In general, rm
is thought of as being related to the recombination
rate. However, Scheet and Stephens (2006) suggested that generally there might be
little correlation between actual recombination rate and estimates of rm. Here, the
Jm
is considered as a single parameter in the training algorithm. Nevertheless, for
each initialisation of the EM algorithm, we set Jm
∼ Beta(uJ(1−e−dmr), uJe−dmr)
where uJ
> 0 and dm
(bp) is the distance between markers m−1 and m. We take
r = 10−8
per base pair in the population, reflecting the background probability of
In document
Analysis of association studies and inference of haplotypic phase using hidden Markov models
(Page 38-42)