An important idea of the previous approach is the summing of word probabilities into TF occurrence probabilities. Using this idea, we can easily derive generating functions for the number of TF occurrences. The formalism is based on the introduction to generating functions given in Chapter 4 following the exposition in Rahmann (2000). The advantage of such a formalism is that this enables us to use all kind of statistics derived for word counting based on generating functions (e.g., see Nicodeme et al., 1999; Stefanov et al.,
5.5 Generating Function Formalism
2007). However, we have to restrict the sequence model to one strand since two occurrences at one position are difficult to deal with. Furthermore, we assume that the word occurrence probabilities are equal.
However, if the assumption of equi-probable nucleotide distribution is not fulfilled the for- malism still yields a reasonable approximation. For non-uniform background distribution, the scores in the PSM are strongly influenced by the background distribution since the PSM contains the likelihood ratios. Hence, positions with low background nucleotide frequencies only have high scores if they have high support from the PFM. In this case, there is a strong consensus at this position. Therefore, the set of compatible words usually contains words with similar occurrence probabilities µ(a) justifying our assumption.
5.5.1 Waiting Time and Stopping Probabilities
We define an occurrence of TF A with length ` and set of compatible words A by the last position of the hit
Yi(A) := X
a∈A
Yi(a),
where Yi(a) is equal to 1 if i is the last position of an occurrence of a and otherwise 0. Note that here we use the fact that the events {Yi(a)}a∈A are disjoint. Thus, A is not
allowed to be a multi-set. Therefore, we have to remove the complementary strand from the analysis.
The waiting time T1(A) until the first occurrence of A is defined based on the word waiting times T1(a)
T1(A) := min
a∈AT1(a),
where T1(a) is the smallest index i for which Yi(a) = 1. We can compute the correponding probabilities by introducing stopping probabilities for a ∈ A
wn(a):= Pµ(T1(A) = n, Yn(a) = 1) = Pµ(T1(A) = n, T1(a) = n).
Again, we use the disjoint property of {Yi(a)}a∈A. For n < `, the stopping probabilities
are 0 since the occurrence cannot start before the sequence. Obviously, one obtains tn :=
Pµ(T1(A) = n) =
P
a∈Aw (a)
n due to the law of total probability. However, we will see
that it is difficult to compute the stopping probabilities without the assumption of an equi-probable nucleotide distribution.
Return Probabilities
First, we need to define return probabilities
rn(a,a0):= Pµ(Yi+n(a0) = 1|T1(a) = i) = Pµ(Yi+n(a0) = 1|Yi(a) = 1).
They are easy to compute since we know the overlap probabilities
rn(a,a0):= 1 if n = 0 and a = a0, 0 if n = 0 and a 6= a0, n(a, a0)Q`κ=`−n+1µ(a0κ) if 1 ≤ n < `, µ(a0) otherwise,
where the overlap bit n(a, a0) is 1 if an= a01, . . . , a` = a0`−n+1 and otherwise 0.
Stopping Probabilities for Set of Words
Now, we can express the stopping probabilities in terms of return probabilities. We decom- pose the event of an occurrence by
{Yn(a) = 1} ≡ n [ i=0 ( [ a0∈A {Yn(a) = 1, T1(A) = i, T1(a0) = i} ) .
Thus, the event of an occurrence of a at n is a partition of the events that at position i is the first occurrence of any word in A. Of course, i can also be n which means that a is the first occurrence. Conditioning on the waiting times directly yields for a ∈ A
Pµ(Yn(a) = 1) = n X i=0 X a0∈A r(a 0,a) n−i w (a0) i . (5.15)
Hence, we obtain an |A| dimensional system of linear equations. In general, one can show that the system has a unique solution (Rahmann, 2000) if A is a set (in contrast to a multiset). However, the problem for TF is the large size of A. Therefore, we avoid solving this linear system by using our assumption.
5.5 Generating Function Formalism
Stopping Probabilities under equi-probable Sequence Model
In this case, the stopping probabilities wn(a) are equal. For all a ∈ A and n ≥ 0, this leads
to
wn:= w(a)n .
Under this assumption, above system of linear equations derived from Eq. (5.15) becomes after summing over all a ∈ A
X a∈A Pµ(Yn(a) = 1) = n X i=0 wi X a∈A X a0∈A r(an−i0,a). (5.16)
The left hand side is obviously the probability of an occurrence of A. The right hand side sums the return probabilities over all possible word pairs. For the sums over the return probabilities, one obtains after changing the index n − i to n for convenience
X a∈A X a0∈A r(an0,a)= X a∈A X a0∈A
Pµ(Yi+n(a) = 1|Yi(a
0)) = X
a0∈A P
a∈APµ(Yi+n(a) = 1, Yi(a0))
Pµ(Yi(a0))
.
With above assumption that the word occurrence probabilities are equal (hence, Pµ(Yi(a0)) =
µ(A)/|A|), one obtains X a∈A X a0∈A rn(a0,a)= |A|µ(A)−1 X a0∈A X a∈A Pµ(Yi+n(a) = 1, Yi(a0)).
The right hand term is the joint occurrence probabilities summed over all word pairs. Hence, we can substitute this by γn(A) for 0 ≤ n < ` and otherwise by µ(A)2. Hence, the whole
term becomes |A|¯γn(A) resp. |A|µ(A). Using the definition
rn:=
¯
γn(A) if 0 ≤ n < `
µ(A) otherwise , and the fact tn=Pa∈Awn= |A| ¯wn yields for Eq. (5.16)
µ(A) =
n
X
i=0
rn−iti.
Note that this equation neither contains the sum over the compatible words since the joint/conditional and occurrence probabilities can be computed by the score convolution. Furthermore, the number of compatible words |A| is cancelled. The equation is similar to Eq. (4.4) in Chapter 4 for word counting. Hence, with above assumption one can derive similar fundamental equations to develop further statistics. This equation can be written as generating function by
µ(A)z`
1 − z = r(z)t(z), where r(z) = P
n≥0rnzn and t(z) =
P
n≥0tnzn. Dividing by r(z) directly yields the
equation for the waiting time.
5.5.2 Number of Counts
With the same assumption, one obtains formulae for the waiting time till the kth occurrence, as well as for the number of counts. However, nothing changes except that the return probabilities contain the conditional occurrence probabilities for A under our assumption. For the probability vn of the inter-arrival time between two successive occurrences, one
obtains the corresponding generating function v(z) = 1 − [r(z)]−1. The probability t(k)n that
the kth occurrence occurs at the nth position, is given by
t(k)(z) :=X
n≥0
t(k)n zn= t(z) [v(z)]k−1.
Finally, the probability fn(k) for k occurrences in a sequence of length n can be computed
by
f(k)(z) :=X
n≥0
fn(k)zn= t(k)(z)1 − v(z) 1 − z .
Hence, based on the score convolutions and under above assumption one can compute the corresponding generating functions as easy as for single words.