Other Gap Penalty Functions - Pairwise Sequence Alignment

Pairwise Sequence Alignment

1.7 Other Gap Penalty Functions

In the algorithms presented thus far, the penalty for aligning a character with a gap has been γ. However, in most biological applications, it does not make sense to penalize gaps in this manner. For example, a single insertion (or deletion) in a DNA sequence typically results in inserting (or deleting, respectively) a string of nucleotides, making multiple consecutive gap characters much more likely than isolated gap characters. The penalty function chosen should reﬂect this reality.

Pairwise Sequence Alignment 1-13 For this reason, researchers use gap penalties that do not increase linearly with the number of gap characters. In particular, researchers have studied general gap penalty functions [30], aﬃne gap penalty functions [14, 17], and concave and convex gap penalty functions [31, 13, 15].

In terms of computational cost, alignments based on general gap functions cost the most to compute, and in practice they are hardly ever used. We shall discuss them here as motivation for choosing from among simpler functions. Alignments based on convex and concave gap penalty functions can be calculated in O(nm log(n + m)) time. However, affine gap penalty functions offer enough flexibility and can be calculated almost as quickly as linear gap penalty functions. Therefore, these gap penalty functions are almost always used in practice.

1.7.1 General Gap Penalties

Envision a general gap penalty function ω(i) which is the penalty for inserting a gap of length i. If we allow for this arbitrary gap penalty function, then the runtime increases considerably. An O(n²m + nm²) algorithm can be constructed by modifying the recursive deﬁnition presented for a linear gap penalty function.

Now, instead of considering a constant number of cells when calculating S[i, j], we must consider O(i + j) cells. This is because, when aligning two suﬃxes A_i and B_j, one must consider the possibility of aligning ai with any character bk 1 < k ≤ j. One must also consider aligning all of Ai with the empty string. Extending this idea to both strings we end up with four possibilities.

1. Align a_i with b_k, 1 < k≤ j and Ai−1 with B_k₋₁. 2. Align bj with ak, 1 < k≤ i and Ak−1 with Bj−1. 3. Align Ai with a gap

4. Align Bi with a gap

The equation for S[i, j] becomes

S[i, j] = max











max^j_k=1S[i− 1, k − 1] + δ(ai, bk) + ω(j− k) maxⁱ_k=1S[k− 1, j − 1] + δ(ak, b_j) + ω(i− k) ω(i)

ω(j)

Finally, for global alignment, we initialize the ﬁrst row and column of the table based on the gap penalty function.

S[i, 0] = ω(i) S[0, j] = ω(j)

As O(n + m) possibilities are considered for each cell, and there are O(nm) cells, the total runtime of the algorithm is O(n²m + nm²).

1.7.2 Aﬃne Gap Penalties

The runtime penalty to allow the flexibility of a general gap penalty function is harsh. It would be nice to find a function with a more complex shape that incurs less of a runtime cost. Using affine gap penalty functions allows us to maintain an O(nm) runtime.

1-14 Handbook of Computational Molecular Biology An affine gap penalty function has two values. A gap opening penalty g is the cost of starting a new gap. A gap extension penalty h is the cost of extending a gap. The first gap character in an affine gap has a score of g + h. Each subsequent gap character in the affine gap has a score of h. Affine gap penalties are widely preferred by biologists because consecutive gap characters likely correspond to a single insertion/deletion event, while an equal number of scattered gaps correspond to as many insertion/deletion events, which is much less probable. For this reason h is often much smaller than g.

To solve the problem in quadratic time, we augment table S with two additional tables, GA and GB. GA[i, j] is the best score of aligning Ai with Bj under the restriction that ai

is aligned with a gap. GB[i, j] is the best score of aligning Ai and Bj under the restriction that bj is aligned with a gap. As before, S[i, j] holds the optimal score of aligning Ai and Bj under no restrictions.

There are three possible ways in which the maximum score can arise, with two of the cases consisting of two parts each.

1. ai is aligned with bj and Ai is aligned with Bj.

2. ai is aligned with a gap. In this case, we must consider two sub-cases.

(a) Ai−1 is aligned with Bj such that ai−1 is not aligned with a gap, and we start a gap.

(b) Ai−1is aligned with Bjsuch that ai−1 is aligned with a gap, and we extend the gap.

3. bi is aligned with a gap. In this case we must consider two sub-cases.

(a) Ai is aligned with Bj−1 such that bj−1 is not aligned with a gap, and we start a gap.

(b) Aiis aligned with Bj−1such that bj−1 is aligned with a gap, and we extend the gap.

These possibilities are captured and scored in the following equations:

S[i, j] = max







S[i− 1, j − 1] + δ(ai, bj) GA[i, j]

GB[i, j]

GA[i, j] = max

S[i− 1, j] + g + h GA[i− 1, j] + h

GB[i, j] = max

S[i, j− 1] + g + h GB[i, j− 1] + h

The number of cells considered in each cell calculation is constant for all tables. Because there are O(nm) number of cells per table, the total runtime of the algorithm is O(nm).

Consider global alignment. We initialize the ﬁrst row and column of each table such that S[i, 0] = g + hi, S[0, j] = g + hj, and S[0, 0] = 0. S[n, m] contains the optimal alignment score after the algorithm ﬁnishes. We can construct the alignment by tracing back the path, starting at position S[n, m]. With an extension of the ideas presented for global alignment with linear gap penalties, the traceback can be accomplished without storing any pointers.

The details are omitted.

Pairwise Sequence Alignment 1-15 Semiglobal alignment can be handled in a straightforward way. Initialize the ﬁrst row and column of each table to 0. The maximum value in the last row and column of S is the optimal alignment score. Starting at this position, trace the alignment path back through the table until reaching the ﬁrst row or column.

Assume that matches are scored positive, while mismatches and gaps are scored negative.

Now local alignment with aﬃne gaps is easy, as seen by the observation that every local alignment starts with some character a_kaligned with some b_land ends with some a_kaligned with some bl. The proof is by contradiction. Assume that an optimal local alignment L starts with a character aligned with a gap. Then there exists a new alignment with higher score constructed removing the ﬁrst character from L. Therefore, L is not optimal, a contradiction. The same reasoning holds for a gap character at the end of an alignment.

As a result of this observation, we can handle local alignment by modifying the equation for S[i, j]

The alignment is found by searching for the maximum score in S, corresponding to aligning ak and bl by the observation above. The traceback continues until reaching some 0 in S, corresponding to the initial alignment of a_k and b_l.

In document Handbook of Computational Molecular Biology (Page 33-36)