Threshold Selection - The Position Frequency Matrix (PFM) Model

2.4 The Position Frequency Matrix (PFM) Model

2.4.3 Threshold Selection

Previously, we have seen that the significance of a hit depends on the threshold t. Thus, selection of an appropriate threshold is crucial. Here, we introduce five different threshold selection methods and give examples for some of them.

Type-I Error Probability Instead of deducing the type-I error probability from the threshold, one can define the type-I error probability and compute the corresponding threshold

(Rahmann et al., 2003; Touzet and Varr´e, 2007). However, the expected number of false positives on a sequence of length n is n · α. Hence, one should consider multiple testing. Therefore, we control the probability αn to find at least one false positive on a sequence

of length n. We obtain αn ≈ 1 − (1 − α)n≈ 1 − exp(−nα) where the first approximation

is due to the fact that overlaps are ignored. Instead of using the actual sequence length, we always set n = 500 heurstically to get a threshold independent of the actual sequence length. We obtain a threshold which has a clear statistical background and also restricts the number of false positives independently of the motif.

Due to the discrete nature of the score, one usually cannot obtain a threshold which exactly corresponds to the pre-defined type-I error. Therefore, one could ’control’ the type-I error such that the pre-defined type-I error is never exceeded. Unfortunately, for very low type-I errors which cannot be retrieved by a motif, this leads to a threshold which cannot be reached. Hence, we ’bound’ the type-I error probability such that the next higher threshold t + 1 is less or equal to the pre-defined type-I error probability.

Type-II Error Probability In general, one can also pre-define the type-II error probability β instead of the type-I error. For motifs with weak power, this might lead to a huge amount of false positives and insignificant hits. Therefore, this threshold selection method only plays a minor role in practice. Especially, since it can be nicely combined with the type-I error as shown in the next method.

Balanced Error A reasonable threshold is a threshold which restricts the number of false positices as well as the number of false negatives. Therefore, we can combine the type-I and the type-II error by setting t = tbal such that α500= β and call it ’balanced threshold’

(Rahmann et al., 2003). Again, the discrete nature of the score prevents both probabilities to be equal. Hence, we set the threshold such that β < α500 and for the next higher

threshold t + 1 the inequality β > α500 holds.

Type-I Extended Error The balanced threshold can lead to very high false positive num- bers if the power of the PFM is weak. Therefore, we newly introduce a threshold selection method (Pape et al., 2006). The threshold is set to the balanced threshold if a pre-defined type-I error probability is not exceeded. Otherwise, the type-I error is used for threshold selection (as described earlier). Hence, one achieves a good balance between type-I and type-II error by ensuring a small number of false positives.

Number of Compatible Words Another new threshold selection method pre-defines the number of compatible words. This can be useful for analyses comparing two PFMs. One can achieve that by using the type-I error selelection method (using α instead of α500)

based on an equi-probable sequence model. Since all words have the same probability, the type-I error probability corresponds to the ratio of the number of compatible words and the number of all possible words.

Example 2.8. Again, considering the example DNA motif (see Ex. 2.1), we can compute the different thresholds. Figure 2.3 contains the trajectories for α, β and α500 for all thresh-

2.4 The Position Frequency Matrix (PFM) Model

is the probability α500 for at least one false positive on a sequence of length 500. For low

thresholds (t < 0), the probability is almost 1. For higher thresholds, the probability drops and finally reaches around 40%.

●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●● ●●●●●●●● ● ●●●● ● ●● ●●●●●● ● ●●●● ● ●● ●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●● ● ● ● ●●●●● ●●●●● ●●● ● ●●●● ● ●● ●●● ●●● ● ●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● −300 −200 −100 0 100 0.0 0.2 0.4 0.6 0.8 1.0 Error Probabilities ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●● ● ●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●● ●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●● ●●●●●●●● Cumulative Probability

Figure 2.3: Error probabilties α and β for the example DNA motif (see Ex. 2.1) using

the background model (blue) and the motif model (red). The green points correspond to

the probability α500 of at least one false positive on a sequence of length 500. The solid

line indicates the ’balanced’ threshold while the dashed line corresponds to a ’type-I’ error probability ’bound’ at 10%.

Figure 2.3 also contains the type-I threshold for a level of 10%. The threshold is at t = 115 where we have α = 0.00098, α500 = 0.39 and β = 0.69. On a first look, it seems that

the pre-defined type-I error probability of 0.1 is not really considered since α500 = 0.39.

However, the next higher threshold t + 1 = 116 achieves error probabilities α = 0, α500 = 0

and β = 1. For this threshold, no word would have a score reaching the threshold. Hence, the threshold t = 115 is the highest, reasonable threshold. This is the reason for setting the type-I error threshold such that the next higher threshold is below the pre-defined error probability. In practice, such extreme examples (large difference between pre-defined and resulting type-I error probabilities) do not occur since the motifs are longer and the PCM contains more different values.

The balanced threshold is also depicted in Fig. 2.3. The threshold being at t = 107 is slightly lower than the type-I threshold at t = 115. For the balanced threshold, we obtain α = 0.0029, α500= 0.77 and β = 0.28. Considering the next higher threshold t+1 = 108, the

error probabilities are α = 0.00098, α500= 0.39 and β = 0.69. Hence, the balanced threshold

fulfils α500> β and the next higher threshold leads to α500 < β. Applying the type-I extended

threshold, one would disregard the balanced threshold since the corresponding type-I error probability exceeds 0.1 and, instead, use the type-I threshold of t = 115. Furthermore, the relative number of compatible words is equal to α because we use a GC content of 50%. Since 0.00098 · 45 = 1, only one word (’GCCAA’) is in the set of compatible words. This can be verified by applying the threshold t = 115 to the PSM in Ex. 2.4. However, the set of compatible words is usually much larger since real motifs are longer.

In document Statistics for Transcription Factor Binding Sites (Page 35-38)