• No results found

Lecture5 Dynamic Programming

N/A
N/A
Protected

Academic year: 2020

Share "Lecture5 Dynamic Programming"

Copied!
52
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

Inexact Matching and Alignment

• Inexact/approximate matching means some errors will be there

• Alignment generally means lining up characters of strings, allowing mismatches as well as matches, and allowing characters of one string to be placed

(3)

Importance of Alignment or

Approximate Matching

• It is Central in computational molecular biology

• Because of active mutational process

• “Duplication and Modification” is the central part of protein evolution

(4)

Edit Distance Between Two Strings

• Difference between two strings

It focuses on transforming (or editing) one string into the other by a series of edit operations on individual

characters

• The permitted edit operations are

– Insertion (I) of a character into the first string – Deletion (D) of a character from the first string

– Substitution (or replacement) (R) of a character in the first string with a character in the second string

• For Match (M) no operation is necessary

(5)

Edit Transcript vs. Edit Distance

Edit Transcript: A string over the alphabet I, D, R, M that describes a transformation of one string to another is called an edit transcript, or transcript for short, of the two strings.

Edit Distance: The minimum number of edit operations – insertions, deletions and substitutions – needed to

transform the first string into the second. Also known as

Levenshtein distance.

v intner wri t ers RIMDMDMMI

(6)

Optimal Transcript

• Optimal transcript is an edit transcript that uses minimal number of edit operations.

(7)

String Alignment

A (global) alignment of two strings S1 and S2 is obtained by first inserting chosen spaces, either into or at the

ends of S1 and S2, and then placing the two resulting

strings one above the other so that every character or space in either string is opposite a unique character or a unique space in the other string.

v_intner_ wri_t_ers

(8)

Alignment vs. Edit Transcript

• Mathematical viewpoint these are equivalent ways to describe relationship between two strings

• Alignment can easily be converted to edit transcript and vice versa

• For modeling standpoint they are quite different

– Edit transcript emphasizes the putative mutational events that transform one string to another

– While alignment displays the relationship only

– So, one is process (edit transcript), the other is the product (alignment)

v_intner_ wri_t_ers

(9)

Dynamic Programming Calculation of

Edit Distance

• How to compute the edit distance of two string along with the accompanying edit transcript or alignment?

Definition: For two strings S1 and S2, D(i, j) is defined to be the edit distance of S1[1…i] and S2[1 … j]

(10)

Steps of Dynamic Programming

• Recurrence relation

(11)

The Recurrence Relation

• Recurrence relation establishes relationship between the value of D(i, j) for i and j and values of D with

index pairs smaller than i, j.

• Base conditions are

D (i, 0) = i, i.e. delete i characters

D (0, j) = j, i.e. j characters to be inserted

• The recurrence relation is

(12)

Tabular Computation: Bottom Up Approach

D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

1 2 3 4 5 6 7

(13)

Tabular Computation: Bottom Up Approach

D(i, j) = min[D(i-1, j) + 1, D(i, j-1) + 1, D(i-1, j-1) + t(i, j)]

3

(14)

The Traceback

For optimal edit transcript, follow any path from cell (n, m) to cell (0, 0)

1. Horizontal edge, from (i, j) to (i, j-1), is insertion (I) of character S2(j) into S1 2. Vertical edge, from (i, j) to (i-1, j), is deletion (D) of S1(i) from S1

3. Diagonal edge, from (I, j) to (i-1, j-1) is a match (M) if S1(i) = S2(j) and a

(15)

The Traceback

Alternatively in terms of alignment

1. Horizontal edge specifies a space inserted into S1

2. Vertical edge specifies a space inserted into S2

3. Diagonal edge specifies either a match or a mismatch

Three traceback paths

From (7, 7) to (3, 3) identical

t_ers tner_ S1 = vintner

S2 = writer

t_ers tner_ ri_t_ers vintner_ i n _ n wr vi i i r _ w v w _

(16)

Edit Graphs

• Often useful to represent dynamic programming solutions of string problems in terms of weighted edit graph

If |S1| = n and |S2| = m then the weighted edit graph has (n+1) x (m+1) nodes

– Each edge has weights

• In the case of edit distance

problem, each edge has weight 1 except the three edges

(17)

Weighted Edit Distance

• Easy but crucial generalization is to associate weight or

cost or score to every edit operation, as well as with a match

– Let, insertion or deletion weight is d

– Substitution weight is r, and

– Match weight is e, usually very small, often zero

Equivalently, in terms of operation-weight alignment

– Mismatch costs r

– Match costs e

– Space costs d

• Two types of weighted edit distance

Operation weight

(18)

Operation-weight Edit Transcript

d = 1, r = 1 and e = 0 We get three optimal alignments

d = 4, r = 2 and e = 1 writ_ers

Vintner_ Total weight is 17, which is optimal

Modified Recurrence Relations: ,

(19)

Alphabet-weight Edit Distance

• Assign score/weight depending on characters

– For example, it may be more costly to replace an A

with a T than with a G

– Or, the weight of a deletion / insertion may depend on exactly which character is deleted / inserted

• Weighted edit distance usually means alphabet-weight version

• Dominant scoring matrices are PAM matrices, and the newer BLOSUM scoring matrices

(20)

String Similarity

• While edit distance is to minimize weights, string similarity is to maximize weights

• For string similarity

– Matches are greater than or equal to zero

(21)

Computing String Similarity

Let V(i, j) is the optimal alignment of prefixes

(22)

End-space Free Variant

• Any spaces at the beginning and end has cost zero

• Encourages one string to align in the interior of the other

• Or the suffix of one string to align with a prefix of the other

Shotgun sequence assembly (see section 16.14 and 16.15) problem uses this variant, can be a project.

0

(23)

Local vs. global alignment

• Global alignment: entire sequences

• Local alignment: segments of sequences

• Local alignment often the most relevant

(24)

The Needleman-Wunsch

and

The SMITH-WATERMAN

algorithm for

(25)

Global Sequence Alignment

The Needleman–Wunsch algorithm performs a

global alignment on two sequences

• It is an example of dynamic programming, and

was the first application of dynamic programming to biological sequence comparison

• Suitable when the two sequences are of similar length, with a significant degree of similarity

throughout

(26)

Three steps in Needleman-Wunsch

Algorithm

• Initialization

Scoring

Trace back (Alignment)

Consider the two DNA sequences to be

globally aligned are:

(27)

Scoring Scheme

• Match Score = +1

Mismatch Score = -1Gap penalty = -1

Substitution Matrix

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

(28)

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled as multiple of gap penalty

T C G

0 -1 -2 -3

A -1

T -2

C -3

(29)

Scoring

• The score of any cell C(i, j) is the maximum of: scorediag = C(i-1, j-1) + S(i, j)

scoreup = C(i-1, j) + g scoreleft = C(i, j-1) + g

(30)

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = -1 + -1 = -2

scoreleft = C(i, j-1) + g = -1 + -1 = -2

T C G

0 -1 -2 -3

A -1 -1

T -2

C -3

(31)

Scoring ….

Final Scoring Matrix

Note: Always the last cell has the maximum alignment score: 2

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(32)

Trace back

• The trace back step determines the actual

alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

Trace back starts from the last cell, i.e.

position X, Y in the matrix

(33)

Trace back ….

• There are three possible moves: diagonally

(toward the top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left

(gap in sequence #2), the diagonal neighbor (match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace

(34)

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G | Seq 2: G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(35)

Trace back ….

• Final Trace back

Best Alignment: A T C G

| | | | _ T C G

T C G

0 -1 -2 -3

A -1 -1 -2 -3

T -2 0 -1 -2

C -3 -1 1 0

(36)

Local Sequence Alignment

The Smith-Waterman algorithm performs a

local alignment on two sequences

It is an example of dynamic programming • Useful for dissimilar sequences that are

suspected to contain regions of similarity or similar sequence motifs within their larger sequence context

(37)

Differences in Needleman-Wunsch and

Smith-Waterman Algorithms:

• In the initialization stage, the first row and first column are all filled in with 0s

• While filling the matrix, if a score becomes negative, put in 0 instead

In the traceback, start with the cell that has

(38)

Three steps in Smith-Waterman

Algorithm

• Initialization

Scoring

Trace back (Alignment)

Consider the two DNA sequences to be

globally aligned are:

(39)

Scoring Scheme

• Match Score = +1

Mismatch Score = -1Gap penalty = -1

Substitution Matrix

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

(40)

Initialization Step

• Create a matrix with X +1 Rows and Y +1 Columns

• The 1st row and the 1st column of the score matrix are filled with 0s

T C G

0 0 0 0

A 0

T 0

C 0

(41)

Scoring

The score of any cell C(i, j) is the maximum of:

scorediag = C(i-1, j-1) + S(i, j) scoreup = C(i-1, j) + g

scoreleft = C(i, j-1) + g And

0

(42)

Scoring ….

• Example:

The calculation for the cell C(2, 2):

scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1 scoreup = C(i-1, j) + g = 0 + -1 = -1

scoreleft = C(i, j-1) + g = 0 + -1 = -1

T C G

0 0 0 0

A 0 0

T 0

C 0

(43)

Scoring ….

Final Scoring Matrix

Note: It is not mandatory that the last cell has the maximum alignment score!

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(44)

Trace back

• The trace back step determines the actual

alignment(s) that result in the maximum score

• There are likely to be multiple maximal alignments

Trace back starts from the cell with maximum

value in the matrix

(45)

Trace back ….

• There are three possible moves: diagonally (toward the

top-left corner of the matrix), up, or left

• Trace back takes the current cell and looks to the

neighbor cells that could be direct predecessors. This means it looks to the neighbor to the left (gap in

sequence #2), the diagonal neighbor

(match/mismatch), and the neighbor above it (gap in sequence #1). The algorithm for trace back chooses as the next cell in the sequence one of the possible

(46)

Trace back ….

• The only possible predecessor is the diagonal match/mismatch neighbor. If more than one possible predecessor exists, any can be chosen. This gives us a current alignment of

Seq 1: G | Seq 2: G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(47)

Trace back ….

• Final Trace back

Best Alignment: T C G

| | | T C G

T C G

0 0 0 0

A 0 0 0 0

T 0 1 0 0

C 0 0 2 1

(48)
(49)

Gaps

A gap is any maximal, consecutive run of spaces in a single string of a given alignment.

c t t t a a c _ _ a _ a c c _ _ _ c a c c c a t _ c

Four gaps and seven spaces

The simplest objective function that includes gaps

1. Where Wg is a constant gap for each gap

2. k is the number of gaps

(50)

Why Gaps?

• Top row shows part of the RNA sequences of one strain of the HIV-1 virus.

• The HIV virus mutates rapidly

• The three bottom rows, each shows the mutated virus strain from the original one.

• Dark one is the matching portion, white space represents gap

(51)

cDNA Matching: A Concrete Example

(52)

Choice of Gap Weights

• Constant

– Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps)]

– Or

• Affine

– Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps) – Ws(# spaces)]

– Wg gap initiation cost, Ws gap extension cost

• Convex

• Arbitrary

References

Related documents

Pages Adding Captions to YouTube Videos Captioning Your Own Video for tip NOTE table page data about using free online tools to caption your videos.. Tools for subtitles,

As discussed in of the troubleshooting section, this debug command can be used to determine if the fax codec has been downloaded into the DSP as shown on the Voice Telephony

Kao što je već i prije spomenuto, programi se pomoću viših programskih jezika relativno jednostavno pišu jer je logika programiranja slična ljudskoj logici razmišljanja, a počeli

The most important sources for Muslims in learning about their history are the Quran, which is the Muslim holy book, as well as the Hadith reports, which are accounts or

The answer appears to growing specific sequences in exons and in introns that bind splicing proteins to either reinforce or suppress splicing at nearby splice sites.. RNA forms in

Callum or men like mr trump electors very dark heart, afraid of the dark transcript of dark past the transcript?. Wanted to systematically overturned at you know how did

And then the other thing is that we've said, "Wow, it does seem like from this article that we just published and our previous basic science work, that it is both

Generated through their fond du send transcript to another college transcripts online request an official transcript fees, i be published?. Documents and i add du send