• No results found

Scoring Schemes

Parametric Sequence Alignment

4.2.2 Scoring Schemes

Every sequence alignment method is based on some scheme that assigns numeric scores to alignments. The score of an alignment A is a function of A itself and a user-specified parameter vector λ = (λ1, . . . , λd)∈ Rd, which gives the weights of the various features of

4-4 Handbook of Computational Molecular Biology an alignment (for example, the cost of mismatches and gaps).

DEFINITION 4.3 The optimum alignment problem is to compute Z(λ) = max

A score(A, λ) (4.3)

for some fixed λ, where score is the scoring function andA ranges over all alignments of the input sequences. The alignment A that maximizes (4.3) is called the optimum alignment at λ. Function Z is the optimum score function.

Throughout the rest of this chapter the input sequences are denoted by S1 and S2; their lengths are n and m, respectively, with n ≤ m. The value of the maximum score of an alignment between S1 and S2 is sometimes called the similarity score (or simply the similarity) of S1 and S2. Intuitively, the higher the similarity between S1 and S2, the closer the two sequences are. Note that there are cases where the objective is to find an alignment of minimum score. These are sometimes called minimum edit distance problems and the value of the minimum score is called the distance between the sequences [28]. In this chapter, all results are for similarity scoring; however, these results can be translated to minimization problems. Indeed, it is often possible to establish correspondences between similarity- and distance-based scoring schemes [28].

DEFINITION 4.4 An evaluator is a procedure that given a pair of input strings and a parameter vector λ(0), computes an optimum alignment A(0) at λ(0).

Note that an evaluator also allows us to compute the value of Z(λ(0)), which equals score(A(0), λ(0)).

DEFINITION 4.5 A scoring scheme is linear if, for every alignmentA, score(A, λ) is a linear function of the parameter vector λ.

Essentially all scoring schemes used in sequence alignment are linear. For example, in global alphabet-independent scoring, the score is a linear function of four parameters,

score(A, α, β, γ, δ) = αw − βx − γy − δz, (4.4) where w, x, y, z are the number of matches, mismatches, indels, and gaps in A, and α, β, γ, δ are the respective weights. An evaluator for this problem is any optimum global alignment algorithm, such as the standard O(nm) dynamic programming procedure [28].

In global alphabet-dependent scoring, we are given a|Σ| × |Σ| matrix M = [µab], where Σ is the alphabet and µabis the score for aligning characters a and b. Then,

score(A, M, γ, δ) = 

a,b∈Σ

µabwab− γy − δz, (4.5)

where wab denotes the number of times characters a and b are aligned inA, and y, z, γ, δ are as in Equation (4.4). As in the alphabet-independent case, an optimum alignment can be found in O(nm) time for any fixed parameter vector.

The next definition captures some of the characteristics of commonly used scoring schemes.

DEFINITION 4.6 A scoring scheme is feature-based if there exists a (many to one) mapping f from alignments to a subset F of Zd and the score of any alignment A as a

Parametric Sequence Alignment 4-5 function of the parameters depends only on f (A). Set F is the feature set of the problem;

p = f (A) is the feature vector of A and each coordinate of p is called a feature. A feature-based scoring scheme is simple if for any alignment A, the score of A can be expressed as

score(A, λ) = p · λ, (4.6)

where p∈ Zd is the feature vector ofA and λ ∈ Rd is the parameter vector.

In all applications considered in this chapter, the size of the feature set is bounded as a function of the sequence lengths. This is explored in greater depth in 4.3.

The feature vector of an alignmentA represents discrete characteristics of A. To illustrate this, observe that the alphabet-independent scheme of Equation (4.4) is feature-based and simple. The feature vector of an alignment in this case is p = (w, x, y, z). Each of p’s features is bounded by the total number of characters in the input strings, and the parameter vector λ = (α, β, γ, δ) assigns a weight to each feature. The alphabet-dependent scoring scheme of Equation (4.5) is also feature-based and simple. In this case, there are|Σ|2+ 2 features, given by the vector

p = (wa1a1, . . . , wa1a|Σ|, . . . , wa|Σ|a1, . . . , wa|Σ|a|Σ|, y, z). (4.7) Each feature is at most equal to the total number of characters in the input strings.

The following is an example of an alphabet-dependent scoring scheme that is feature-based but not simple [31]. Let the scoring matrix M = [µab] be fixed, but assign different weights to matches and mismatches as shown below.

score(A, α, β, γ, δ) = α

a∈Σ

µawaa+ β 

a,b∈Σ

µabwab− γy − δz (4.8)

The feature vector in this case is given by Equation (4.7), but there are now only four parameters — α, β, γ, and δ — instead of|Σ|2+ 2.

Definition 4.6 does not capture all scoring schemes used in practice. For example, the definition does not cover schemes where the cost of a gap is some constant (the gap penalty) times the logarithm of the gap length [28]. Even though this method does not fit within our framework, it nevertheless still exhibits some of the characteristics of feature-based scoring, since, for fixed sequence lengths, the number of gaps and their various lengths can only assume a finite set of integer values. Note also that while the dependence on the gap length is non-linear, the scoring scheme is itself linear in the weight given to gaps.

4.2.3 Parametric Sequence Alignment

Parametric sequence alignment studies the effect of varying the parameter vector on the optimum alignment. Its central object of study is Z, the optimum score function of Defini-tion 4.3. Under linear similarity scoring, Z is the upper envelope of a set of linear funcDefini-tions

H ={score(A, λ) : A is optimal at some λ(0)∈ Rd}.

(For minimum edit distance problems, Z is a lower envelope.) For brevity, we refer to the maximization diagram of H as the maximization diagram of Z. By Propositions 4.1 and 4.2, the maximization diagram of Z decomposes the parameter space,Rd, into convex poly-hedral optimality regions. This is illustrated in Figure 4.2, which shows the maximization diagram of the optimum score function for global alphabet-independent alignment between two sequences as the mismatch penalty β and the indel penalty γ are varied across the

4-6 Handbook of Computational Molecular Biology

β I III

IV

II γ

FIGURE 4.2: Decomposition of the parameter space induced by sequences S1 = BAABBB and S2 = ABBAAA. The corresponding optimum alignments are AI = (BAABBB − −−, − − A − BBAAA), AII = (BAABB − −B, − − ABBAAA), AIII = (BAAB − BB, −ABBAAA), AIV= (BAABBB, ABBAAA).

positive quadrant, while each match gets a (constant) reward of one and the gap penalty is zero. There are exactly four optimality regions, whose corresponding alignments are shown.

Note that in general, each optimality region may have several co-optimal alignments.

The structure of the maximization diagram of Figure 4.2 is particularly simple: any vertical cross section encounters the same series of alignments when going from bottom to top. A representative slice is shown in Figure 4.3, which displays the optimal score function Z(γ) for the alignment problem of Figure 4.2, when β is fixed at one. The picture shows how, indeed, Z(γ) is the upper envelope of the score functions of various alignments.

The simplicity of Figure 4.2 is due to the scoring scheme, and is analyzed in more detail in 4.3. Other scoring schemes yield more intricate structures, as shown in Figure 4.4, taken from [31]. The figure gives the maximization diagram for the score of the optimum global alignment of immunoglobulins FABVL and FABVH, as a function of indel and gap penalties.

The scoring here is alphabet-dependent, using the PAM250 matrix [15] with a constant of 8 added to each entry; gaps at the end of an alignment are assigned a score of zero.