Consistency-based Methods General Concepts

Multiple Sequence Alignment

3.2 Deﬁnitions and Properties

3.3.3 Consistency-based Methods General Concepts

The concept of consistency among a set of pairwise alignments was introduced by Gotoh [29]

to look for ‘anchor points,’ i.e., local MSAs plausibly embedded in an optimal global MSA.

Consider three sequences, ¯s1, ¯s2, ¯s3, and three PSAs between them, A2(¯s1, ¯s2),A2(¯s1, ¯s3) and A2(¯s₂, ¯s₃). Each PSA can be represented by an undirected bipartite graph (V, E), where a vertex corresponds to a residue and an edge, e(s_mi, s_nj), corresponds to a matched pair (m= n ∈ [1, 2, 3], smi∈ ¯sm, s_nj ∈ ¯sn). The set of edges, E(¯s_m, ¯s_n), is called trace [60].

In this trace formulation, the degree of each vertex is either 0 or 1 depending on whether it matches a null or a residue. In Gotoh’s formulation [29], on the other hand, a vertex corresponds not to a residue itself but to a joint between them or either end of a sequence, and every vertex belongs to one or more edge (Figure 3.2 c). The latter formulation can take care of gapped alignments more precisely than the trace formulation. For the sake of cohesion with other methods, however, we consider here that each vertex corresponds to a residue (Figure 3.2 b). If there exist three edges e(s1i, s2j)∈ E(¯s1, ¯s2), e(s2j, s3k)∈ E(¯s1, ¯s3) and e(s1j, s3k)∈ E(¯s1, ¯s3), the triple edges and vertices are said to be consistent (Figure 3.2 d, thick bars). A set of residues belonging to contiguous consistent edges form a consistently aligned region (Figure 3.2 e). Essentially the same formulation is used for ﬁnding weakly conserved regions among a set of local MSAs (blocks) [62]. For N > 3, the above argument applies to every combination of three sequences in {¯sn} (1 ≤ n ≤ N). When consistency holds for any combination of three vertices from {sni}, the vertex set is considered to be consistent. Alternatively, {sni} may be considered consistent if every sni participates in at least one consistent triple vertex. The computational complexity is O(N³L) with either¯ deﬁnition. Vingron and Pevzner [114] considered intermediate cases in which each edge

Multiple Sequence Alignment 3-15

FIGURE 3.2: Consistency of pairwise alignments between three sequences ¯a, b, and ¯c. (a) Align-ments. (b) Traces. (c) Edges of bipartite graphs representing the alignments of (a). (d) Trace edges realized (solid bars) and unrealized (dotted bar) in an MSA.

Consistent edges are shown by thick bars. (e) Another formalization of the con-sistently aligned regions (shaded areas). In (d) and (e), sequence ¯a is shown in duplicate to enhance viewing.

e(s_mi, s_nj) belongs to at least K(1≤ K ≤ N − 2) consistent triplets, although the vertexes are derived not from PSAs but from dot-matrix plots between every pair of sequences.

Maximum Weight Trace Problem

The maximum weight trace problem was formulated by Kececioglu based on a motivation similar to that discussed above [54], namely, given a set of PSAs, find an MSA that is closest to all PSAs in the set. The input to the maximum weight trace problem is a set of edges, that are obtained from a set of pairwise optimal sequence alignments in a special case. The algorithm finds an MSA in which the maximum number of input edges (or more generally, maximum sum of weights associated with the edges) is realized (Figure 3.2 d). Because the MSA must satisfy the two conditions A and B mentioned in subsection 3.2.1, the maximum weight trace problem is not trivial, and in fact proven to be NP-hard [54]. Kececioglu proposed a branch-and-bound algorithm somewhat similar to the MSA program described above [54]. Recently, Kececioglu et al. [53] reformulated the problem, and proposed an ILP algorithm to solve it with a branch-and-cut technique (3.3.1). Because of the intrinsic difficulty, the applicability of even this new approach does not seem to much exceed that of the MSA.

3-16 Handbook of Computational Molecular Biology T-Coﬀee

The T-Coﬀee algorithm developed by Notredame et al. [78] is based on a more practical use of consistency information derived from a set of PSAs, compared to the strategies discussed above. Assume that we are given N sequences ¯s_n(1≤ n ≤ N) and KN(N − 1)/2 PSAs be-tween them,A^k2(s_m, s_n)(1≤ k ≤ K), in the form of a bipartite graph, where K denotes the total number of PSAs obtained with diﬀerent PSA methods (e.g., global and local) for each pair of sequences. The primary library consists of residue pairs (smi, sni)∈ E^k(¯sm, ¯sn), with associated weights. The weight assigned to such a residue pair, w^k(smi, sni) = w^k(¯sm, ¯sn), is the percent sequence identity of A^k2(¯sm, ¯sn). If (smi, sni) /∈ E^k(¯sm, ¯sn), w^k(smi, sni) is 0. When the same residue pair appears in the set of PSAs more than once, the associated weights are summed up to yield the primary weight Wp(smi, snj) =

1≤k≤Kw^k(smi, snj).

Note that W_p(s_mi, s_ni) = 0, if (s_mi, s_ni) is absent in any E^k(s_m, s_n). The primary library is nearly equivalent to the input edges used in the special case of the maximum weight trace problem mentioned above. The edges in a primary library occupy only a sparse subset of all the edges to be examined by a complete MSA procedure. T-Coﬀee extends the library by adding residue pairs that are ‘indirectly’ matched. For example, if (s_mi, s_pk)∈ E^k(¯s_m, ¯s_p) and (s_nj, s_pk)∈ E^k(¯s_n, ¯s_p) for a triplet (m, n, p; 1≤ m, n, p ≤ N), the residue pair (smi, s_nj) is said to be indirectly matched, and added to the ’extended library’ even if (smi, snj) /∈ E^k(¯sm, ¯sn). The weight to an indirectly matched pair (smi, snj) mediated by spk in a third sequence ¯sp is deﬁned as w^k(smi, snj; spk) = min{w^k(smi, spk), w^k(snj, spk)}. Now, the total weight given to a residue pair (smi, snj) is obtained by

W (smi, snj) =

T-Coffee uses these weights in place of an ordinary score matrix, such as PAMn or Blosumn, in the DP-based pairwise alignment of single or pre-aligned groups of sequences without imposing any gap penalties. Otherwise, T-Coffee adopts the typical progressive alignment strategy. The distance matrix and the guide tree are constructed in the same manner as those of ClustalW. The major advantage of T-Coffee over ClustalW and other ordinary progressive methods is that information about alignments between all pairs of sequences is condensed in the weights W (s_mi, s_nj), which are utilized even at the very beginning of the progressive procedure. Another advantage of T-Coffee is that several different sources of alignment information, e.g., that obtained from global and local PSAs, are mixed to compute a residue-pair weight. The computational complexity is O(N²L¯²) for pairwise alignment, O(N³L) for library construction, and O(N ¯¯ L²) for the progressive alignment.

DIALIGN

The notion of ‘consistency’ implied in the DIALIGN algorithm [71, 70] diﬀers from that mentioned above, but simply means that two ungapped segment pairs (diagonals or frag-ments), f = A2(aiai+1· · · ai+k−1, bjbj+1· · · bj+k−1) and f = A2(aiai+1· · · ai+k−1, bj

bj+1· · · bj+k−1), are arranged in the order of f ≤ f or f ≤ f, where f ≤ f holds if i + k ≤ i and j + k ≤ j, and vice versa. To avoid confusion, we will use the term ‘com-patible’ instead of ‘consistent’ here to refer to such situations. The idea of the DIALIGN algorithm is somewhat related to that of the maximum weight trace problem, although unit-s uunit-sed in an alignment proceunit-sunit-s are fragmentunit-s with pounit-sitive weightunit-s (unit-signiﬁcant fragmentunit-s) rather than individual matched pairs. The objective function is the sum of weights of the fragments that are involved in the ﬁnal alignment, where no penalty is imposed on the gaps.

The residues that are not included in these fragments remain unaligned. Hence, DIALIGN

Multiple Sequence Alignment 3-17 tends to produce global to more local alignments with a decrease in similarity of sequences under comparison.

For PSA, DIALIGN, as well as most standard procedures, uses a DP algorithm. At each node (i, j) of recursion, DIALIGN examines min(i, j) possible segment pairs that end at (i, j), indicating O( ¯L³) overall computational steps. By restricting the fragment sizes below a ﬁxed value K (= 40 by default), the computational complexity is reduced to O(K ¯L²), although it is still considerably greater than that of simpler DP algorithms that leave highly divergent parts unaligned [3, 45]. The most crucial step of the DIALIGN algorithm is the evaluation of the weight for a fragment f , w(f ). When a fragment of length k has an alignment score of s(f ) =

0≤l≤kS2(ai+l, bj+l), w(f ) is deﬁned as w(f ) =− log P (k, s(f)), where P (k, s(f )) denotes the probability of observing by chance one with an alignment score

≥ s(f) among (|¯a| − k + 1)(|¯b| − k + 1) pairs of segments of length k each having random sequences. Since the expected value for P (k, s(f )) is close to 1, a majority of the fragments have negative weights and are excluded from the list of candidates.

For MSA, DIALIGN adopts a greedy strategy. First, all combinations of input sequences are aligned as described above to yield the initial list of significant fragments. The weight for each fragment is recalculated in a similar fashion to that used in the construction of the extended library in the T-Coffee program, except that the supplementary weights are derived from indirectly aligned segment pairs. The fragments in the initial list are exam-ined in the order of their weight values, and moved to the second list as long as they are compatible with all the fragments already present in the second list. After all the fragments are examined, the process is repeated again from the first member remaining in the initial list until no member in the initial list is compatible with those in the second list.

As the above procedure suggests, DIALIGN is most eﬀective for detecting local align-ments among distantly related sequences or sequences composed of several domains [63].

The local nature is favorable for searching exons or regulatory elements in genomic se-quences [105]. On the other hand, DIALIGN is too expensive to align many, say N > 100, sequences, because the theoretical computational complexity is O(N⁴L¯²), which is spent for recalculating weight values. It might be wise to introduce a hierarchical structure so that the DIALIGN algorithm is applied to a set of prealigned groups of sequences [81].

In document Handbook of Computational Molecular Biology (Page 85-88)