Score Distribution - The Position Frequency Matrix (PFM) Model

2.4 The Position Frequency Matrix (PFM) Model

2.4.2 Score Distribution

A PSM yields a hit on a sequence if the score s reaches the threshold t. The significance of the hit is the probability of this event under a sequence model. Many articles deal with the efficient calculation of this probability (Staden, 1989; Claverie and Audic, 1996; Stormo, 2000; Wu et al., 2000; Rahmann et al., 2003; Beckstette et al., 2006; Touzet and Varr´e, 2007). Here, we derive the score distribution and give a dynamic programming algorithm following Beckstette et al. (2006).

The score for a word w = w1, . . . , w` is computed by summing the position-specific scores

corresponding to the nucleotides of the word:

s(w) =

κ=1

Ψκ,wκ. (2.2)

Given a sequence model instead of an actual word, the score is a random variable S. Similarily to Eq. (2.2), the score S is the sum of the position-specific scores S(κ) which are also random variables. Hence, the distribution of S denoted by L(S) is given by the convolution of S(κ):

One can use a dynamic programming approach to compute L(S). The idea is simple: Before starting the summation, we have a Dirac score distribution with all its probability weight at 0. In each step, we add the scores of the next position. Thus, the score s yields the probability of the position-specific score Ψκ,a and the probability of the score s − Ψκ,a of

the previous step. Using the Bernoulli sequence model with probabilities µ(a), this is

Q0(s) := 1 if s = 0, undefined else, Qκ(s) := X a∈A Qκ−1(s − Ψκ,a) · µ(a).

After the last step, Q`(s) contains the probability to observe score s. Hence, we can

write Pµ(S = s) = Q`(s). Replacing µa in the equation for Qκ by a different nucleotide

distribution yields the score distribution of S under another model. Furthermore, we can also use position dependent nucleotide distribution. In this way, it is straight-forward to compute the score distribution under the motif model. The distribution of the nucleotides of the motif are given in the PFM Π. Thus, we can compute the distribution by

Q0₀(s) := Q0(s),

Q0_κ(s) := X

a∈A

Q0_κ−1(s − Ψκ,a) · πκ,a.

Hence, we have PΠ(S = s) = Q0_`(s).

Type-I and II error probabilities Based on these score distributions, we can compute type- I and type-II error probabilities. The type-I error occurs if the score reaches the threshold but without an actual binding site at this position (false positive). Using the sequence model µ as background model for a sequence without binding sites we can compute the type-I error probability by

α := Pµ(S ≥ t) =

s≥t

Q`(s).

Thus, α is the p-value or significance of a hit.

Likewise, we can compute the type-II error probability. Retrieving a score lower than the threshold on a position which is an actual binding site is a type-II error (false negative). Hence, we have to use the sequence model Π instead of µ and get for the type-II error probability

β := PΠ(S < t) =

s<t

2.4 The Position Frequency Matrix (PFM) Model

Example 2.7. Here, we reconsider the example DNA motif (see Ex. 2.1). The upper panel of Fig. 2.2 contains the distributions for the score of the PSM under the Bernoulli sequence model µ as background model and the motif model Π. The score distribution under the background model has in general lower scores and obtains very low probabilities for scores higher than zero. In contrast, the score distribution under the motif model has mainly scores higher than zero with increasing probabilities. The lower panel of Fig. 2.2 visualizes α and β. Both errors are almost equal to a score of 13. At this score, α is equal to 0.03 and β is 0.032. ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ●●●●●● ● ●●●●●●● −300 −200 −100 0 100 0.0 0.1 0.2 0.3 0.4 Score Distributions ●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ● ●●●● ● ●●●●●●● ● ● ●●● ●● ●●●●●●● ● ● ●●● ● ● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●● ● ● ● ●●●●●●●●● ● ●● ● ● ●●● ● ● ●●●● ● ●●● ● ●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● Probability ●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●● ●●●●●●●● ● ●●●●● ●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ● ●●●● ● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −300 −200 −100 0 100 0.0 0.2 0.4 0.6 0.8 1.0 ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● Cumulative Probability

Figure 2.2: Score distribution for the example DNA motif (see Ex. 2.1) using the background model (blue) and the motif model (red). The lower panel contains the cumulated score distributions where the background distribution is reversely accumulated. Hence, they correspond to the α (blue) and β (red) probabilities.

Rahmann et al. (2003) introduces the concept of power for PFMs based on the two score distributions. If both distribution can be well separated by a threshold, the PFM is said to have good/high power. Otherwise, the motif is weak since it is hard to differentiate between motif and background model. Surprisingly, only one fifth of Transfac PFMs (Matys et al., 2003) are shown to have a reasonable power (Rahmann et al., 2003).

In document Statistics for Transcription Factor Binding Sites (Page 33-35)