• No results found

4.2.1 Data Structure and Assumptions

The joint model consists of four main components: the measured data: Y, the log-intensities; X, the sequence; and the unobserved (latent) components: A, the indicator variable of all motif start positions, andS, the set of latent nucleosomal states for each probe. The sequence data,X, can be considered one long string of DNA, each element being a letter from A,C,G or T. The probe intensity dataY, however, is at a resolution of individual probes, which consist typically of about 25-50 base pairs in our experimental data. However, in order to build a coherent joint model, we require data from both these sources to be at the same resolution. Hence we need to make a choice between two simplifying assumptions: (i) for all the base pairs within a single probe,Y has a constant probe specific value, or (ii) assign a value to each base pair of the probe (e.g. by numerical interpolation). If all probes are of equal length and equi-spaced the first option may work well, but since we typically have probes with gaps and of varying distance we choose option (ii), which also leads to including variability in the intensity over a probe.

4.2.2 Observed data and latent variables.

For the moment, consider the sequence as a single vector of lengthL, denoted by

X = (X1, . . . , XL). EachXi (i= 1, . . . , L) takes values in the set{A, C, G, T}. Let X[a:b] denote the subsequenceXa, . . . , Xb. The signal intensity data, at the same resolution, are represented byY = (Y11, Y12, . . . , Y1j1, Y21, . . . , Y2j2, . . . , YP jP), whereP denotes the total

number of probes, andjp (p= 1, . . . , P) is the number of basepairs corresponding to probep(which is typically equal, but could vary in some cases, depending on

experimental limitations), andj1+j2+· · ·+jP =L. Along with X and Y, we also define a vectortof length L, which contains the actual chromosomal location of the nucleotides corresponding toX and Y. For instance, if there is a gap of N nucleotides between probesp1 andp2, we will still denote the sequence corresponding toYp1,jp1 as,

say, someXq, and the sequence corresponding toYp2,1 (the next probe measurement) as

Xq+1; however, the value of tq+1=N would contain the information on the distance

between the closest measured probes in this scenario. Next,S = (S1, . . . , SP) denotes the

indicator vector for nucleosomal state, withSp = 1 if probe p belongs to a

nucleosome-free region (NFR) and 0 if it belongs to a nucleosomal region. We can also write this in expanded form as anL-dimensional vector (S11, . . . , S1j1, . . . , SP jP), where

Spjp takes the same value for 1,2, . . . , jp (p= 1, . . . , P). Finally, let A= (A1, . . . , AL) denote a latent vector of indicator variables withAi = 1 if a motif site originates at positioniof the sequence, and 0 otherwise.

4.2.3 Model formulation.

For simplicity, assume there is one motif type in the model, withw columns,

characterized by a 4×wmatrix of probabilities Θ. Columnl (l= 1, . . . , w) of Θ is a vector (θl1, . . . , θl4)0, whereθlm denotes the probability of observing them-th base in the

set{A, C, G, T}, and P4

m=1θlm = 1. We can generalize this to a model with,K motif types, characterized by weight matrices{Θ1, . . . ,ΘK}, of motif widths w1, . . . , wK. Also, denote the backgrounddistribution of nucleotides by the parameter ρ. In the simplest model,ρis a vector of probabilities (ρ1, . . . , ρ4) of the 4 nucleotides; however, in complex

nucleotides due to the presence of long repeat sequences. We assume that each columnl of a motif instance (l= 1, . . . , w) is generated by a draw from the multinomial

distribution characterized by parameter θl = (θl1, . . . , θl4)0. Then, let Θ[X[a:a+w−1]], and

ρ[X[a:a+w−1]] denote the probability of the segment X[a:a+w−1] being generated from the motif and the background model respectively, that is, Θ[X[a:a+w−1]] =Qwl=1θlXa+l−1, and

ρ[X[a:a+w−1]] =Qwl=1ρlXa+l−1 (for an i.i.d. background). Additionally, π denotes the

probability of occurrence of a motif at any position (only within the NFR regions). For the model of transition between the states of nucleosome (0) and NFR (1), we assume a continuous time Markov process characterized by a matrix of transition probabilitiesP(t) for an intervalt, where, we have,P00[t] = λ+µµ+λ+λµe−(λ+µ)t,P11[t] = λ+λµ+λ+µµe−(λ+µ)t.

Next, the log-intensity data for probep is assumed to beN(µps, σ2s), where the indexs denotes the nucleosomal state s∈ {0,1}. For the nucleosomal state, we assume a

conjugate hierarchical model forµp0, such thatµp0∼N(µ0, τ0σ2s). For the NFR state, the mean of the normal distribution at positionaof the probe, µpa is modeled as a baseline valueµ1 linked to a parameter which associates the occurrence of motifs at that position

of the probe to the observed intensity value. Specifically, we assume, conditionally on the latent variablesA and S,µa1=µ1+βlog

Θ[X[a:a+w−1]]/ρ[X[a:a+w−1]]

, where

1≤a≤jp−w+ 1. β is an unknown parameter measuring the direct association between signal data and the occurrence of motifs. The positive sign of the log-likelihood ratio translates into the scenario where the motif probabilities are different from the

background; the higher the difference, the higher the strength of the motif. The priors for µ0,µ1 andβ are assumed to be uniform (flat); and the priors for σs2 are also taken to be flat σ12

s (s= 0,1). For Θ and ρ we assume product Dirichlet and Dirichlet distributions with hyperparameters equal to 1. The only parameter for which we require an informative prior is the motif site prevalenceπ, since a non-informative prior for π typically leads to non-identifiability in practice (Gelfond et al., 2009). Forπ we thus use a relatively strong Beta(δ(1−γ), δγ) prior whereδ is a large “pseudocount” and γ (0< γ <1) is a “prior expected value”. Based on preliminary studies in a number of other studies including Gelfond et al. (2009), we propose to chooseγ over a range of values between 10−5 and 10−3 which seem to be reasonable, and test the sensitivity of the final inference to such choice. Let us denote byη the set of all non-motif related parameters in the model, that

is,η= (µ0, µ1, σ02, σ21, β). We denote by ψall the transition parameters. That is,

ψ= (log(λ), log(µ)) Now, the complete data likelihood can be expressed in the form: L(Y,X,S,A|η,Θ,ρ, π) =P(Y|A,S,X, η,Θ,ρ, π)P(X|A,Θ,ρ, π)P(A|S, π)P(S|λ, µ). The problem of interest is to estimate the latent variablesA and S, and the unknown parameters. Given the complex form of the likelihood, the estimation procedure we propose is a hybrid MCMC algorithm, that incorporates elements of recursive sampling-based data augmentation for efficient inference.