Sampling scheme and Conditional distributions

The main steps of the sampling procedure are:

1. p(S|η,Θ,ρ, π,Y,X) 2. P(A|S, η,Θ,ρ, π,Y,X) 3. P(η|Θ, π,A,S,Y,X) 4. P(π|Θ, π,A,S,Y,X)

The details for each of these steps are given below:

• P(S|η,Θ,ρ,Y,X, π)

Sample the nucleosomal states starting from the end of the sequence Use forward algorithm I. Recall that N is the entire length of the sequence DrawS[N] with probability

F[N,1] F[N,1] +F[N,2] Conditioned onS[i+1] Draw S[i] with probability

F[i,2]T[S[i],S[i+ 1]]

F[i,1]T[2,S[i+ 1]] +F[i,2]T[1,S[i+ 1]]

• P(A|S, η,Θ,ρ,Y,X, π)

Sample the motifs conditional on on the states. In order to achieve this, we use the results of the forward algorithm II.

At a given position i, compute

P =π(φ(Yi−w+1:i, µa1, σ1)Θ[Xi−w+1:i])

The sampling probability that this position is a motif ending position is given by P

G[i]

If the motif position is selected, set a[i-w+1:i]=1. Or in other words declare the last w positions as a motif region. Else move back one position.

Set i=i-1.

Stop when i < w.

• P(η|S,A,Θ,ρ, π)

Conditioned on the motifs and states, we sample the emission parameters as follows:

σ₀2∼Gamma((n0−1)/2,

X i;S[i]=0

(Yi−Ym0)

2_/₂₎

wheren0 is the number of positions in the nucleosomal state

andYm0 =

i;S[i]=0Yi/n the average of the signal strength in the nucleosomal

states. Sample µ0∼N(a, b) where a= P i;S[i]=0Yi n0 and b=σ2₀ Sample (µ1, β)∼M ultivariateN ormal(ˆb, σ21)

where ˆbis the least squares estimate of the vector of regression from regressing Y1

The second column ofX1 takes the value 0, whenA[i]=0.

Mathematically,

b= (Z₁0Z1)−1Z10Y

Yiis the predicted value in position i, defined asZ1[i]∗ˆb

σ2₁ ∼Gamma((n1−1)/2,

X i;S[i]=1

(Yi−Yˆi)2/2 • P(Θ|A, E)

The conditional posterior distributions are

Θ∼ Y i,j;A[i:j]=1 [Θ[Xi:j] k=j Y k=i φ(Yk, µa1, σ1) (4.5.1) θb ∼ Y i,j;A[i:j]=10 [ρ[Xi:j]] (4.5.2)

In order to sample Θ and ρwe needed to make a transformation of the 4*w probabilities of the position weight matrix . For each such probability,

θil(l= 1. . . w, j= 1. . .4) the logit function p‘ =log((p)/(1−p)) will be used. That is, we define a new 4 by w matrix L and a new 1 by 4 vector R as follows

Lil= log(Θil) 1−log(Θil) (4.5.3) Ril = log(ρi) 1−log(ρi) (4.5.4)

A Metropolis Hastings algorithm is then applied to sample the new parameters under the constraint

X j Θjl= 1∀l and, X j ρi = 1

where j is the nucleotide index .

The Metropolis Hastings algorithm is implemented in the following steps.

1. Start with l= 1.

2. Generate a 3 x 1 random vectorNl from Multivariate Normal (Rl,0.4I) where Rl is the vector (R1l, R2l, R3l)

3. Compute Jil= ₁₊exp_exp(Ril₍_Ril)₎ 4. IfP

iJil >1 then go to the previous step. 5. IfP

iJil <1 then calculateJ4l = 1−P3i=1Jil 6. RecomputeNl by appending ₁₋log_log(J₍4_Jl₄)_l₎

7. By inverse transformation calculate the new motif matrix Θn _{where the} _lth column of Θn is the inverse logit transform of Nl and all other columns remain the same.

8. Compute Ratio = L(Θn,ρ,πη|Y,X,S) L(Θ,ρ,πη|Y,X,S)

9. Accept new matrix Θn with probability min(1,R) 10. Repeat the above steps forl= 2,3, . . . w

11. Updateρin the same way. Note that, in place of w columns, we have to update only one vector.

• P(π|A)

We sampled π from the Beta distribution,Beta(n1−nsg+ 1, n1+ 1) where

nsg= |A_w|,which is equal to the number of motifs sampled in the previous step. • P(ψ|S)

Since there is no closed for for the kernel of the posterior given by

L(log(λ), log(µ)|Y,S) we resort to a Metropolis Hastings Scheme. The jump distribution was taken to be Gaussian,with a the variance of the jump denoted as V. The scheme is given below.

1. Generate a sample log(λnew)∼N(log(λold), V). 2. Compute the ratioR= L(log(λnew),µ,ηb|Y)

3. Accept λnew with probability min(1,R).

The same steps are repeated forµ. Note that in the above sampling procedure we have used a combination of Forward Algorithms I and II ( i.e we have sampled out the states unconditionally first, and then the motifs) as an alternative to using the Forward Algorithm III. Initially the way the forward algorithm is formulated, forward algorithm III took more computational time than algorithms I and II combined. This is because a three by N matrix has to be defined and accessed throughout the algorithm. However, we see that essentially the time can be much reduced, because in the recursion equations, H[i,1] and H[1,2] occur together as a common summand. This motivated us to implement the same forward algorithm II without taking recourse to defining a new matrix, but using F[i,1] and F[i,2] of the forward algorithm I. That is, the idea is not use the three state representation in the computational stage. In computing the likelihood, we use only the two way representation of whether the state is in a nucleosomal position or not. But while sampling for say position i, we do not use F[i,1] or F[i,2] but go back a number of steps F[i-1,1] F[i-1,2] F[i-w,1] F[i-w,2] and use them in computing the three probabilities for the three possible scenarios in the position i. Here we are doing a joint sampling but without having a new matrix to define. This procedure cost us less computational time (44 seconds) instead of the combination of I and II(76 seconds). However we have seen that Forward Algorithm III , does not typically takes more time to converge than the combination of I and II. Because in both cases we are doing a block sampling of motifs and states together. In I and II , we are achieving this by by first doingP(S|.)P(A|S|.) and in the third we are doingP(A,S|.) Both of them are improvements on trying to execute the full conditionals as in a conventional Gibbs sampler.

In document Bayesian model based approaches in the analysis of chromatin structure and motif discovery (Page 91-95)