Lecture5 Dynamic Programming

(1)

(2)

Inexact Matching and Alignment

• Inexact/approximate matching means some errors will be there

• Alignment generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed

(3)

Importance of Alignment or

Approximate Matching

• It is Central in computational molecular biology

• Because of active mutational process

• “Duplication and Modification” is the central part of protein evolution

(4)

Edit Distance Between Two Strings

• Difference between two strings

• It focuses on transforming (or editing) one string into the other by a series of edit operations on individual

characters

• The permitted edit operations are

– Insertion (I) of a character into the first string – Deletion (D) of a character from the first string

– Substitution (or replacement) (R) of a character in the first string with a character in the second string

• For Match (M) no operation is necessary

(5)

Edit Transcript vs. Edit Distance

Edit Transcript: A string over the alphabet I, D, R, M that describes a transformation of one string to another is called an edit transcript, or transcript for short, of the two strings.

Edit Distance: The minimum number of edit operations – insertions, deletions and substitutions – needed to

transform the first string into the second. Also known as

Levenshtein distance.

v intner wri t ers RIMDMDMMI

(6)

Optimal Transcript

• Optimal transcript is an edit transcript that uses minimal number of edit operations.

(7)

String Alignment

• A (global) alignment of two strings S₁ and S₂ is obtained by first inserting chosen spaces, either into or at the

ends of S₁ and S₂, and then placing the two resulting

strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

v_intner_ wri_t_ers

(8)

Alignment vs. Edit Transcript

• Mathematical viewpoint these are equivalent ways to describe relationship between two strings

• Alignment can easily be converted to edit transcript and vice versa

• For modeling standpoint they are quite different

– Edit transcript emphasizes the putative mutational events that transform one string to another

– While alignment displays the relationship only

– So, one is process (edit transcript), the other is the product (alignment)

v_intner_ wri_t_ers

(9)

Dynamic Programming Calculation of

Edit Distance

• How to compute the edit distance of two string along with the accompanying edit transcript or alignment?

Definition: For two strings S₁ and S₂, D(i, j) is defined to be the edit distance of S₁[1…i] and S₂[1 … j]

(10)

Steps of Dynamic Programming

• Recurrence relation

(11)

The Recurrence Relation

• Recurrence relation establishes relationship between the value of D(i, j) for i and j and values of D with

index pairs smaller than i, j.

• Base conditions are

– D (i, 0) = i, i.e. delete i characters

– D (0, j) = j, i.e. j characters to be inserted

• The recurrence relation is

(12)

Tabular Computation: Bottom Up Approach

D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

1 2 3 4 5 6 7

(13)

Tabular Computation: Bottom Up Approach

D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

3

(14)

The Traceback

For optimal edit transcript, follow any path from cell (n, m) to cell (0, 0)

1. Horizontal edge, from (i, j) to (i, j-1), is insertion (I) of character S₂(j) into S₁ 2. Vertical edge, from (i, j) to (i-1, j), is deletion (D) of S₁(i) from S₁

3. Diagonal edge, from (I, j) to (i-1, j-1) is a match (M) if S₁(i) = S₂(j) and a

(15)

The Traceback

Alternatively in terms of alignment

1. Horizontal edge specifies a space inserted into S₁

2. Vertical edge specifies a space inserted into S₂

3. Diagonal edge specifies either a match or a mismatch

Three traceback paths

From (7, 7) to (3, 3) identical

t_ers tner_ S₁ = vintner

S₂ = writer

t_ers tner_ ri_t_ers vintner_ i n _ n wr vi i i r _ w v w _

(16)

Edit Graphs

• Often useful to represent dynamic programming solutions of string problems in terms of weighted edit graph

– If |S₁| = n and |S₂| = m then the weighted edit graph has (n+1) x (m+1) nodes

– Each edge has weights

• In the case of edit distance

problem, each edge has weight 1 except the three edges

(17)

Weighted Edit Distance

• Easy but crucial generalization is to associate weight or

cost or score to every edit operation, as well as with a match

– Let, insertion or deletion weight is d

– Substitution weight is r, and

– Match weight is e, usually very small, often zero

• Equivalently, in terms of operation-weight alignment

– Mismatch costs r

– Match costs e

– Space costs d

• Two types of weighted edit distance

– Operation weight

(18)

Operation-weight Edit Transcript

d = 1, r = 1 and e = 0 We get three optimal alignments

d = 4, r = 2 and e = 1 writ_ers

Vintner_ Total weight is 17, which is optimal

Modified Recurrence Relations: _,

(19)

Alphabet-weight Edit Distance

• Assign score/weight depending on characters

– For example, it may be more costly to replace an A

with a T than with a G

– Or, the weight of a deletion / insertion may depend on exactly which character is deleted / inserted

• Weighted edit distance usually means alphabet-weight version

• Dominant scoring matrices are PAM matrices, and the newer BLOSUM scoring matrices

(20)

String Similarity

• While edit distance is to minimize weights, string similarity is to maximize weights

• For string similarity

– Matches are greater than or equal to zero

(21)

Computing String Similarity

• Let V(i, j) is the optimal alignment of prefixes

(22)

End-space Free Variant

• Any spaces at the beginning and end has cost zero

• Encourages one string to align in the interior of the other

• Or the suffix of one string to align with a prefix of the other

• Shotgun sequence assembly (see section 16.14 and 16.15) problem uses this variant, can be a project.

0

(23)

Local vs. global alignment

• Global alignment: entire sequences

• Local alignment: segments of sequences

• Local alignment often the most relevant

(24)

The Needleman-Wunsch

and

The SMITH-WATERMAN

algorithm for

(25)

Global Sequence Alignment

• The Needleman–Wunsch algorithm performs a

global alignment on two sequences

• It is an example of dynamic programming, and

was the first application of dynamic programming to biological sequence comparison

• Suitable when the two sequences are of similar length, with a significant degree of similarity

throughout

(26)

Three steps in Needleman-Wunsch

Algorithm

• Initialization

• _Scoring

• _{Trace back (Alignment)}

• _{Consider the two DNA sequences to be}

globally aligned are:

(27)

Scoring Scheme

• Match Score = +1

• _{Mismatch Score = -1} • _{Gap penalty = -1}

• _{Substitution Matrix}

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

(28)

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty

T C G

0 -1 -2 -3

A -1

T -2

C -3

(29)

Scoring

• The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j)

scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g

(30)

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2

scoreleft = C(i, j-1) + g = -1 + -1 = -2

T C G

0 -1 -2 -3

A -1 -1

T -2

C -3

(31)

Scoring ….

• _{Final Scoring Matrix}

Note: Always the last cell has the maximum alignment score: 2

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(32)

Trace back

• The trace back step determines the actual

alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

• _{Trace back starts from the last cell, i.e.}

position X, Y in the matrix

(33)

Trace back ….

• There are three possible moves: diagonally

(toward the top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left

(gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace

(34)

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G | Seq 2: G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(35)

Trace back ….

• Final Trace back

Best Alignment: A T C G

| | | | _ T C G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(36)

Local Sequence Alignment

• _{The Smith-Waterman algorithm performs a}

local alignment on two sequences

• _{It is an example of}_{dynamic programming} • Useful for dissimilar sequences that are

suspected to contain regions of similarity or similar sequence motifs within their larger sequence context

(37)

Differences in Needleman-Wunsch and

Smith-Waterman Algorithms:

• In the initialization stage, the first row and first column are all filled in with 0s

• While filling the matrix, if a score becomes negative, put in 0 instead

• _{In the traceback, start with the cell that has}

(38)

Three steps in Smith-Waterman

Algorithm

• Initialization

• _Scoring

• _{Trace back (Alignment)}

• _{Consider the two DNA sequences to be}

globally aligned are:

(39)

Scoring Scheme

• Match Score = +1

• _{Mismatch Score = -1} • _{Gap penalty = -1}

• _{Substitution Matrix}

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

(40)

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled with 0s

T C G

0 0 0 0

A 0

T 0

C 0

(41)

Scoring

• _{The score of any cell C(i, j) is the maximum of:}

scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g

scoreleft = C(i, j-1) + g And

0

(42)

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = 0 + -1 = -1

scoreleft = C(i, j-1) + g = 0 + -1 = -1

T C G

0 0 0 0

A 0 0

T 0

C 0

(43)

Scoring ….

• _{Final Scoring Matrix}

Note: It is not mandatory that the last cell has the maximum alignment score!

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(44)

Trace back

• The trace back step determines the actual

alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

• _{Trace back starts from the cell with maximum}

value in the matrix

(45)

Trace back ….

• There are three possible moves: diagonally (toward the

top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the

neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in

sequence #2), the diagonal neighbor

(match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible

(46)

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G | Seq 2: G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(47)

Trace back ….

• Final Trace back

Best Alignment: T C G

| | | T C G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(48)

(49)

Gaps

• A gap is any maximal, consecutive run of spaces in a single string of a given alignment.

c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c

Four gaps and seven spaces

The simplest objective function that includes gaps

1. Where W_g is a constant gap for each gap

2. k is the number of gaps

(50)

Why Gaps?

• Top row shows part of the RNA sequences of one strain of the HIV-1 virus.

• The HIV virus mutates rapidly

• The three bottom rows, each shows the mutated virus strain from the original one.

• Dark one is the matching portion, white space represents gap

(51)

cDNA Matching: A Concrete Example

(52)

Choice of Gap Weights

• Constant

– Maximize [W_m(# matches) – W_ms(# mismatches) – W_g(# gaps)]

– Or

• Affine

– Maximize [W_m(# matches) – W_ms(# mismatches) – W_g(# gaps) – W_s(# spaces)]

– W_g gap initiation cost, W_s gap extension cost

• Convex

• Arbitrary