Generating Function Formalism - Statistics for Transcription Factor Binding Sites

An important idea of the previous approach is the summing of word probabilities into TF occurrence probabilities. Using this idea, we can easily derive generating functions for the number of TF occurrences. The formalism is based on the introduction to generating functions given in Chapter 4 following the exposition in Rahmann (2000). The advantage of such a formalism is that this enables us to use all kind of statistics derived for word counting based on generating functions (e.g., see Nicodeme et al., 1999; Stefanov et al.,

5.5 Generating Function Formalism

2007). However, we have to restrict the sequence model to one strand since two occurrences at one position are difficult to deal with. Furthermore, we assume that the word occurrence probabilities are equal.

However, if the assumption of equi-probable nucleotide distribution is not fulfilled the formalism still yields a reasonable approximation. For non-uniform background distribution, the scores in the PSM are strongly influenced by the background distribution since the PSM contains the likelihood ratios. Hence, positions with low background nucleotide frequencies only have high scores if they have high support from the PFM. In this case, there is a strong consensus at this position. Therefore, the set of compatible words usually contains words with similar occurrence probabilities µ(a) justifying our assumption.

5.5.1 Waiting Time and Stopping Probabilities

We define an occurrence of TF A with length ` and set of compatible words A by the last position of the hit

Y_i(A) := X

a∈A

Y_i(a),

where Y_i(a) is equal to 1 if i is the last position of an occurrence of a and otherwise 0. Note that here we use the fact that the events {Y_i(a)}a∈A are disjoint. Thus, A is not

allowed to be a multi-set. Therefore, we have to remove the complementary strand from the analysis.

The waiting time T₁(A) until the first occurrence of A is defined based on the word waiting times T₁(a)

T₁(A) := min

a∈AT1(a),

where T₁(a) is the smallest index i for which Y_i(a) = 1. We can compute the correponding probabilities by introducing stopping probabilities for a ∈ A

w_n(a):= Pµ(T1(A) = n, Yn(a) = 1) = Pµ(T1(A) = n, T1(a) = n).

Again, we use the disjoint property of {Y_i(a)}a∈A. For n < `, the stopping probabilities

are 0 since the occurrence cannot start before the sequence. Obviously, one obtains tn :=

Pµ(T1(A) = n) =

a∈Aw (a)

n due to the law of total probability. However, we will see

that it is difficult to compute the stopping probabilities without the assumption of an equi-probable nucleotide distribution.

Return Probabilities

First, we need to define return probabilities

r_n(a,a0):= Pµ(Yi+n(a0) = 1|T1(a) = i) = Pµ(Yi+n(a0) = 1|Yi(a) = 1).

They are easy to compute since we know the overlap probabilities

r_n(a,a0):=        1 if n = 0 and a = a0, 0 if n = 0 and a 6= a0, n(a, a0)Q`κ=`−n+1µ(a0κ) if 1 ≤ n < `, µ(a0) otherwise,

where the overlap bit n(a, a0) is 1 if an= a01, . . . , a` = a0`−n+1 and otherwise 0.

Stopping Probabilities for Set of Words

Now, we can express the stopping probabilities in terms of return probabilities. We decom- pose the event of an occurrence by

{Y_n(a) = 1} ≡ n [ i=0 ( [ a0_∈A {Y_n(a) = 1, T₁(A) = i, T₁(a0) = i} ) .

Thus, the event of an occurrence of a at n is a partition of the events that at position i is the first occurrence of any word in A. Of course, i can also be n which means that a is the first occurrence. Conditioning on the waiting times directly yields for a ∈ A

Pµ(Yn(a) = 1) = n X i=0 X a0_∈A r(a 0_,a) n−i w (a0) i . (5.15)

Hence, we obtain an |A| dimensional system of linear equations. In general, one can show that the system has a unique solution (Rahmann, 2000) if A is a set (in contrast to a multiset). However, the problem for TF is the large size of A. Therefore, we avoid solving this linear system by using our assumption.

5.5 Generating Function Formalism

Stopping Probabilities under equi-probable Sequence Model

In this case, the stopping probabilities wn(a) are equal. For all a ∈ A and n ≥ 0, this leads

wn:= w(a)n .

Under this assumption, above system of linear equations derived from Eq. (5.15) becomes after summing over all a ∈ A

X a∈A Pµ(Yn(a) = 1) = n X i=0 wi X a∈A X a0_∈A r(a_n−i0,a). (5.16)

The left hand side is obviously the probability of an occurrence of A. The right hand side sums the return probabilities over all possible word pairs. For the sums over the return probabilities, one obtains after changing the index n − i to n for convenience

X a∈A X a0_∈A r(a_n0,a)= X a∈A X a0_∈A

Pµ(Yi+n(a) = 1|Yi(a

0_{)) =} X

a0_∈A P

a∈APµ(Yi+n(a) = 1, Yi(a0))

Pµ(Yi(a0))

With above assumption that the word occurrence probabilities are equal (hence, Pµ(Yi(a0)) =

µ(A)/|A|), one obtains X a∈A X a0_∈A r_n(a0,a)= |A|µ(A)−1 X a0_∈A X a∈A Pµ(Yi+n(a) = 1, Yi(a0)).

The right hand term is the joint occurrence probabilities summed over all word pairs. Hence, we can substitute this by γn(A) for 0 ≤ n < ` and otherwise by µ(A)2. Hence, the whole

term becomes |A|¯γn(A) resp. |A|µ(A). Using the definition

rn:=

γn(A) if 0 ≤ n < `

µ(A) otherwise , and the fact tn=Pa∈Awn= |A| ¯wn yields for Eq. (5.16)

µ(A) =

i=0

rn−iti.

Note that this equation neither contains the sum over the compatible words since the joint/conditional and occurrence probabilities can be computed by the score convolution. Furthermore, the number of compatible words |A| is cancelled. The equation is similar to Eq. (4.4) in Chapter 4 for word counting. Hence, with above assumption one can derive similar fundamental equations to develop further statistics. This equation can be written as generating function by

µ(A)z`

1 − z = r(z)t(z), where r(z) = P

n≥0rnzn and t(z) =

n≥0tnzn. Dividing by r(z) directly yields the

equation for the waiting time.

5.5.2 Number of Counts

With the same assumption, one obtains formulae for the waiting time till the kth occurrence, as well as for the number of counts. However, nothing changes except that the return probabilities contain the conditional occurrence probabilities for A under our assumption. For the probability vn of the inter-arrival time between two successive occurrences, one

obtains the corresponding generating function v(z) = 1 − [r(z)]−1. The probability t(k)n that

the kth occurrence occurs at the nth position, is given by

t(k)(z) :=X

n≥0

t(k)_n zn= t(z) [v(z)]k−1.

Finally, the probability fn(k) for k occurrences in a sequence of length n can be computed

f(k)(z) :=X

n≥0

f_n(k)zn= t(k)(z)1 − v(z) 1 − z .

Hence, based on the score convolutions and under above assumption one can compute the corresponding generating functions as easy as for single words.

5.6 An Efficient Algorithm for Computing Overlap Probabilities

In document Statistics for Transcription Factor Binding Sites (Page 112-116)