• No results found

5.3.1

Alphabets and strings

Analphabetis a nonempty setΣof symbols orcharacters, and astringoverΣis a finite sequence of elements ofΣ. We writeΣ∗for the set of all strings over the alphabetΣ, and

|S|for the length of the stringS.

Given a stringSand integersi, j such that0< i≤j ≤ |S|, we will writeS[i]for the ith character ofS, andS[i, j]for the substring consisting of theith tojth characters ofS. Given a second stringT, theconcatenationofSandTis the stringST, where

(ST)[i] =      S[i] ifi≤ |S|, T[i− |S|] ifi >|S|.

In applications to DNA sequencesΣis typically the set {A,G,C,T}, and we will use this alphabet in examples. However, our algorithm is not restricted to this case.

5.3.2

The edit distance

In order to compare two stringsXandYit is useful to have some measure of the extent to which they differ. For the purposes of this chapter we will use theedit distance, where the edit operations we permit are the insertion of a single character; the substitution of a single character; or the deletion of a single character.

Given a set of allowed edit operations, such as those listed above, the edit distance fromX to Y, d(X,Y), is the minimum number of allowable edit operations needed to transformXintoY. With the choice of permitted edit operations made above, it is straight forward to verify thatdis a metric.

5.3.3

Tandem repeats and nested tandem repeats

Anexact tandem repeatis a string of the formXlfor somel 2. Thus, an exact tandem repeat is a string comprised of two or more contiguous exact copies of the same substring

X. This substring is called the motif of the tandem repeat. We obtain an approximate tandem repeatby allowing approximate rather than exact copies of the template motifX. More precisely, an approximate tandem repeat is a string of the formX1X2· · ·Xl, where d(X,Xi)≤k|X|for eachi, for some fixedk <1and template motifX. Where the value of the parameterkis important we may say that we have ak-approximate tandem repeat

(k-TR). For simplicity of notation, we will write X˜l to mean an approximate tandem repeat, consisting ofl approximate copies ofX.

Given two motifsXandxsuch thatd(X,x)0, anexact nested tandem repeatis a string of the form

xs0Xt0xs1Xt1· · ·xsnXtn,

where n > 1, si ≥ 1for each i > 0, and ti ≥ 1 for eachi < n. We again obtain an

approximate nested tandem repeat by allowing the copies of the motifs X and x to be approximate rather than exact. Thus, an approximate nested tandem repeat is a string of the form

˜

xs0X˜t0x˜s1X˜t1· · ·x˜snX˜tn,

wheren > 1,si ≥1for eachi >0, andti ≥1for eachi < n, and such thatx˜s0˜xs1· · ·x˜sn is an approximate tandem repeat with motif x, and X˜t0X˜t1· · ·X˜tn is an approximate

tandem repeat with motifX.

Note that the definition of an approximate nested tandem repeat includes exact nested tandem repeats as a special case. “Nested tandem repeat” or “NTR” by itself will always mean anapproximatenested tandem repeat, unless explicitly stated otherwise.

Remark. The definition of an NTR given here is slightly more general than that given in Chapter 4. In Chapter 4, a nested tandem repeat is required to satisfyti ≤1for eachi.

5.3.4

Alignment

Given an alphabet Σ, let Σ¯ be the alphabetΣ∪ {−}, where “−” (“gap”) is a character that does not belong toΣ. We defineφ: ¯Σ∗ →Σ∗to be the function that deletes all gaps. Given two stringsA,B∈Σ∗, analignmentofAandBis a choice of a pair of strings ( ¯A,B¯)∈Σ¯∗×Σ¯∗ satisfying the following conditions:

A1. φ( ¯A) = Aandφ( ¯B) = B; A2. |A|¯ =|B|¯ ; and

A3. there is no indexifor whichA¯[i] = ¯B[i] =−.

Thus,A¯ andB¯ are obtained fromAandBrespectively by inserting gaps in such a way that the resulting strings have the same length, and do not both have a gap in the same position. σ − A C G T − −∞ −2 −2 −2 −2 A −2 1 −1 −1 −1 C −2 −1 1 −1 −1 G −2 −1 −1 1 −1 T −2 −1 −1 −1 1

Table 5.1: A sample scoring matrix for DNA sequences. This matrix rewards matching characters fromΣwith a score of+1, and penalises mis-matching characters fromΣwith a score of−1. The penalty for aligning a gap against a character fromΣis−2. The value σ(−,−) = −∞reflects condition A3, which prohibits a gap being aligned against a gap.

To score an alignment we use a scoring matrixσ, which specifies the reward or penalty for aligning any two characters of Σ¯ against each other. See Table 5.1 for an example. We will assume throughout that σ penalises gaps (that is,σ(−, α)andσ(α,−)are both negative for allα ∈Σ), and we set¯ σ(−,−) =−∞to reflect condition A3 above. Given an alignment ( ¯A,B¯)for which |A|¯ = |B|¯ = L, thealignment score of( ¯A,B¯)is then defined to be σ( ¯A,B¯) = L X i=1 σ( ¯A[i],B¯[i]).

Anoptimal global alignmentis an alignment ofAandBwhich maximises the alignment score over all such alignments. See (Navarro, 1999) for a survey of this and other align- ment problems.