Absence Probability - Single Word Count Distribution

4.2 Single Word Count Distribution

4.2.1 Absence Probability

The absence probability an is the probabiltiy that there is no occurence of the word in the

sequence of length n: an:= Pµ( n X i=1 Y_i = 0)

We can express the absence probabiltiy by using the waiting time till the 1st occurence: If the first occurence is after the nth position, the word is absent. Let W be the random

variable for the position of the first occurence (in fact, it is the position where the first hit ends). Due to the new definition of an occurrence, T₁ cannot be less than `. Defining t0= 0, tn:= Pµ(T1 = n) for n > 0, we obtain for the absence probability

an= Pµ(T1 > n) = 1 − Pµ(T1 ≤ n) = 1 − n

i=0

ti. (4.2)

Thus, given the probability distribution for the first occurence, we can simply compute the absence probability. Therefore, we introduce the return probability in the next paragraph to be able to derive formulae for the waiting time afterwards.

Return Probability

The return probability rn is the probability to observe an occurrence n steps after an-

other occurrence. There can also be further occurrences in between. Thus, we can define independently of i

rn:= Pµ(Yi+n= 1|Yi = 1) = Pµ(Yi+n = 1|T1 = i).

Due to the i.i.d. property of the sequence model, the occurrence probabilities are indepen- dent of the waiting time if the occurrences do not overlap, thus, Pµ(Yi+n = 1|T1 = i) =

Pµ(Yi+n) = µ(w) for n ≥ `. For overlapping occurrences, we can use the overlap bit d

defined in Eq. (3.6). Since the return probability for n = 0 is obviously 1, we obtain for the return probabilities

rn=    1 if n = 0, nQ`κ=`−n+1µ(wκ) if 1 ≤ n < `, µ(w) otherwise.

Capturing the overlap in the correlation polynomial c(z) with c(z) := P`−1

n=0nzn, the

first ` return probabilities in d(z) := P`−1

n=0rnzn, we obtain for the generating function

r(z) :=P

n≥0rnzn the expression

r(z) = d(z) +µ(w)z

1 − z . (4.3)

Note that the second term is derived by shifting the generating function P

n≥0µ(w)zn =

4.2 Single Word Count Distribution

Figure 4.1: Decomposition of the event of an occurrence at position n by the waiting time

and the return probability. The upper row shows a waiting time of ` for the first occurrence and a returned occurrence after n − ` positions. In the second row, the waiting time is ` + 1 and the return after n − ` − 1 steps. One has to consider all such events till the waiting time is n (last row).

Waiting Time

The probability for the waiting time for the first occurrence at the first ` − 1 positions equals to 0 since the first occurrence is not before position ` due to the definition of the occurrence indicators Y_i. Hence, we obtain t0 = t1 = . . . = t`−1 = 0. Based on the results

for the return probability, we can compute the remaining waiting time probabilities We decompose the event {Y_n= 1} of an occurrence at position n. The occurrence at n can be expressed in terms of the waiting time (see Fig. 4.1 for an illustration): For all ` ≤ i ≤ n, the first occurrence is at position i and another occurrence returns after n − i positions. For i = n, we consider the possibility that there is no occurrence before n. Hence, we obtain for the event {Y_n = 1} ≡ Sn

i=`{Yn = 1, T1 = i}. Conditioning the events on T1 = i, one

obtains for n ≥ ` Pµ(Yn= 1) = n X i=0 Pµ(Yn= 1|T1= i)Pµ(T1 = i) = n X i=0 rn−iti. (4.4)

Hence, we can recursively compute tn by solving above equation for the last term in the

sum: tn= µ(w) − n−1 X i=0 rn−iti.

Defining the generating function for the waiting time by t(z) :=P

n≥0tnzn, we can express

it by considering that the sum in Eq. (4.4) is the convolution of r(z) and t(z). Furthermore, Eq. (4.4) equals to 0 for n < `. Hence, we can use for the left hand side (1 − z)−1µ(w)z` and obtain

µ(w)z`

1 − z = r(z)t(z). Solving for t(z), we obtain

t(z) = µ(w)z

Absence Probability

In Eq. (4.2), the absence probability is defined in terms of the waiting time. Based on the absence probability an = 1 −Pni=0ti in a sequence of length n, we obtain the generating

function a(z) :=P

n≥0anzn for the absence probability taking the product of 1 −Pni=0ti

and 1 − z:

a(z) = 1 − t(z)

1 − z . (4.6)

Example 4.2. We consider the word w =’ACA’ in a two-letter alphabet with µ(’A’) = p. Hence, the occurrence probability is µ(w) = p2(1 − p). The correlation polynomial is given by c(z) = 1 + z2 since the word overlaps at position 0 and 2. The generating function for the return probability is

r(z) = 1 + p(1 − p)z2+p

2_{(1 − p)z}3

1 − z . The generating function for the waiting time yields

t(z) = p 2_{(1 − p)z}3 (1 − z) 1 + p(1 − p)z2₊p2(1−p)z3 1−z = p2(1 − p)z3 1 − z − (p2_{− p)z}2_{− (p − 2p}2_{+ p}3_)z3.

For the generating function of the absence probability a(z) = (1 − t(z))/(1 − z), one obtains

a(z) = 1 + (p − p

2_)z2

1 − z + (p − p2_)z2_{+ (−p + 2p}2_{− p}3_)z3.

Solving for [zn]a(z) for 0 ≤ n < 8 yields

a0 = a1 = a2 = 1, a3 = 1 − p2+ p3, a4 = 1 − 2p2+ 2p3,

a5 = 1 − 3p2+ 4p3− 2p4+ p5, a6 = 1 − 4p2+ 6p3− 3p4+ p6,

a7 = 1 − 5p2+ 8p3− 4p4+ p7.

Obviously, the probability for no occurrence in a sequence with length less than 3 is 1 since the word is larger than the sequence. For a sequence of length 3, the absence probability is the complement of the occurrence probability: 1 − µ(w) = a3. A sequence of length 4

can contain the word either at position 3 or 4. Since the occurrences cannot overlap, they cannot occur at the same time, therefore, they are disjoint. Hence, we obtain a4 = 1−2µ(w).

For longer sequences, it is more difficult to manually derive the probabilities, however, the generating function a(z) contains all probabilities.

4.2 Single Word Count Distribution

In document Statistics for Transcription Factor Binding Sites (Page 85-89)