• No results found

2.4 Estimation and inference

2.4.1 Parameter estimation

To estimate the parameters in a HMM including Jm

at each marker, as well as αmln

and θmln(h) at each hidden cluster and each marker, we use the EM algorithm, also

known as the Baum-Welch training algorithm, to find maximum-likelihood estimates

of these model parameters by employing the forward and backward probabilities.

Forward-backward algorithm

First, we specify the forward probability which is the joint probability of an observed

sequence up to and including gm−1

and the underlying state sm−1

at marker m − 1

(shown in left part in Figure 2.4) as fk(m − 1) = p(g1, . . . , gm−1, sm−1

= k). This

probability is obtained from the forward algorithm. The backward probability is

defined as bl(m) = p(gm+1, . . . , gM|sm

= l) (in the right part of Figure 2.4).

sequence g is:

p(sm−1

= k, sm

= l|g1, . . . , gM) =

fk(m − 1)amkleml(gm)bl(m)

p(g)

(2.6)

where p(g) is full probability of the sequence, which can be computed efficiently by

the forward algorithm. Similarly, the posterior probability of state l at marker m

given the genotype sequence g is fl(m)bl(m)/p(g).

Expectation and maximisation

Based on the posterior probability decoding method, the expected counts of tran-

sitions from clusters k = {k1, . . . , kN} to l = {l1, . . . , lN} at marker m given the

observed genotype sequence g is obtained by the use of the forward-backward algo-

rithm and summing over the expected counts of all observed sequences:

X

i

1

p(gi)f

i

k(m − 1)amkleml(gim)b

i

l(m)

(2.7)

where i is the index of the sequence; fki(m − 1) and bil(m) are the forward and

backward probability, respectively; amkl

and eml(gmi

) are the transition and emission

probability based on a polyploid HMM, respectively, which can be obtained from

equations (2.2) and (2.4).

Similarly, the expected counts of emissions at marker m given l = {l1, . . . , lN} is

X

i

1

p(gi)f

i

l(m)b

i

l(m)

(2.8)

The counts of transitions and emissions in a polyploid HMM are then decomposed

into counts of haploid based parameters— CJm, Cαmln

and Cθmln(h)

according to

the permutation probabilities in equations (2.2) and (2.4). To avoid probability

estimates of 0 (over fitting), a small number (pseudo count) is added to each of

counts. The maximum likelihood estimates for Jm, αmln

and θmln(h) with priors

of pseudo counts from Dirichlet distributions (maximum a posterior estimates with

Dirichlet priors) are given by

ˆ

Jm=

CJm

+ uJ

∗ 0.5

(CJm

+ uJ

∗ 0.5) + (C1−Jm

+ uJ

∗ 0.5)

ˆ

αmln

=

Cαmln

+ uα∗ 1/z

Pz

l=1(Cαmln

+ uα∗ 1/z)

ˆ

θmln(h) =

Cθmln(h)+ uθ∗ 1/H

PH

h=1(Cθmln(h)+ uθ∗ 1/H)

(2.9)

where the terms uJ∗ 0.5, uα∗ 1/z, and uθ∗ 1/H are the pseudo counts for estimating

parameters Jm, αmln

and θmln(h) respectively. For this stage of model fitting, we set

uJ

= uα

= uθ

= 0.01 for the first application and uJ

= uα= uθ

= 0.1 for the second

and third applications. The log likelihood value given observed genotype sequences

is calculated though the forward algorithm using these parameter estimates.

Computation of the training process

The training process might converge to a local maximum of the likelihood function,

which is a typical problem for the EM algorithm. A likelihood usually has many

local maxima, and different initial values for the EM algorithm could lead to different

local maxima (resulting in different parameter estimates). One way to deal with this

problem is to train the data with a fixed number of repetitions—each with different

initial values, and the parameter estimates can be obtained by one of the following

approaches: (1) the estimates are then selected from the repetitions that gives the

highest value of the log likelihood given the observed genotypes; (2) the estimates

are then obtained by averaging over the estimates from all the repetitions; (3) the

estimates are then inferred by combining the results across different repetitions using

a sampling algorithm (see next section). In this thesis, we employed (1) and (3) with

10 repetitions of the training algorithm with different start values. Each repetition

of training has 25 iterations. According to our experience, 25 iterations in each

training process is able to give a reasonable good convergence.

Prior

For each initialisation of the EM algorithm, we use Dirichlet priors on all of our

parameters. Namely, for scalars uθ

> 0 and uα

> 0, we let θml.

∼ Dirichlet(uθmθ),

where mθ

is the uniform vector with each element 1/H, and αm.

∼ Dirichlet(uαmα)

where mα

is the uniform vector with each element 1/z. The u parameters measure

the strength of the prior information, so that large u implies sampling more tightly

around m.

Although Jm

is considered as an unknown parameter and will be inferred from our

training algorithm, Jm

can be expressed as a compound parameter, (1 − e−rmdm),

where dm

is the physical distance and rm

is the jump rate per bp between markers

m−1 and m. In general, rm

is thought of as being related to the recombination

rate. However, Scheet and Stephens (2006) suggested that generally there might be

little correlation between actual recombination rate and estimates of rm. Here, the

Jm

is considered as a single parameter in the training algorithm. Nevertheless, for

each initialisation of the EM algorithm, we set Jm

∼ Beta(uJ(1−e−dmr), uJe−dmr)

where uJ

> 0 and dm

(bp) is the distance between markers m−1 and m. We take

r = 10−8

per base pair in the population, reflecting the background probability of