Pairwise Sequence Alignment

(1)

Pairwise Sequence Alignment

Carolin Kosiol

[email protected]

(2)

Outline

Pairwise sequence alignment

¾

global - Needleman Wunsch Gotoh algorithm

¾

local - Smith Waterman algorithm

(3)

¾

Quite simply, the comparison of two or more DNA or

protein sequences to each other.

¾

The purpose of alignment is to highlight similarity

between sequences.

¾

Alignment is the procedure of writing two (or more)

sequences in a way that a maximum of identical or

similar characters are placed in the same column by

DGGLQJJDSµ

-

¶FKDUDFWHUV

(4)

Word Alignment

Species 1: SOMEONE

Species 2: AWESOME

Species 1: - - - SOMEONE

Species 2: AWESOME

(5)

-Less trivial

Species 1: ACGTTAGA

Species 2: CGTTGAA

Species 1: - - - ACGTTAGA

Species 2: CGTTGAA - - -

Species 1: ACGTTAGA

Species 2: - CGTT- GAA

(6)

Less trivial

Species 1: - - - ACGTTAGA

Species 2: CGTTGAA - - -

¾

score: -15 (gaps = -1, match = 1)

Species 1: ACGTTAGA

Species 2: - CGTT - GAA

(7)

FASTA Format - Input

¾

Standard input format for alignment programs

¾

8VHµ[¶IRUDPELJXRXVFKDUDFWHUV

¾

Strictly speaking, should not contain gaps

>Name1

ASEQUENCE1

>Name2 comments

SEQU

(8)

FASTA Format - Output

¾

Increasingly, multiple alignment returned in

FASTA-like format

¾

*DSVUHSUHVHQWHGXVLQJ¶

-

¶

¾

Order of sequences may be different in output to

input.

>Name1

ASEQUENCE1

>Name2 comments

-SEQU--CE2

etc....

(9)

Relatedness of residues in same column

¾

Making these alignments is EASY....

¾

As we know where and which evolutionary events

occurred

(10)

Quiz

Which alignment (X, Y or Z)

shows only residues related by

substitution events in the same column?

(11)

Types of alignments methods

We cannot enummerate all possible alignments.

Approaches are:

¾

Dot matrix

¾

Dynamic Programming

(12)

¾

Given two

(13)

In a dot matrix we can identify:

¾

Existing alignable parts of

sequences

¾

Possible indels

¾

Duplicated sequences and

repeats

¾

Self-complementarity

¾

Gene-order differences among

genomes

(14)

(15)

a)

A continuous main diagonal shows perfect similarity for symbols with the same indices.

b)

Parallels to the main diagonal indicate repeated regions in the same reading direction on different

parts of the sequences. In this case a region D is found twice in the sequence (D1, D2, so called

µGXSOLFDWLRQV¶

c)

Lines perpendicular to the main diagonal indicate palindromic areas. In this case the sequence is

completely palindromic in the displayed area.

d)

Partially palindromic sequence (For DNA sequences this refers to a perfect match of the normal

strand with its reverse complement, which is frequently found for many transposable elements.

e)

Bold blocks on the main diagonal indicate repetition of the same symbol in both sequences, e.g.

(G)50, so called microsatellite repeats

f)

Parallel lines indicate tandem repeats of a larger motif in both sequences, e.g. (AGCTCTGAC)20, so

called minisatellite patterns. The distance between the diagonals equals the distance of the motif.

g)

When the diagonal is a discontinuous line this indicates that the sequences T1 and T2 share a

common source. In literal analyses we may have to deal with plagiarism or in DNA analyses

sequences may be homologous because of a common ancestor. The number of interruptions

increases with modifications on the text or the time of independent evolution and mutation rate.

h)

3DUWLDOGHOHWLRQLQVHTXHQFHRULQVHUWLRQLQVHTXHQFHVRFDOOHGµindel¶,QSURWHLQFRGLQJ

sequences this can be often observed for many different types of domains, which got lost or

(16)

gap = -15

Aligning a pair of sequences

¾

Aim: get from one

corner to other

¾

Moves have a cost

¾

Choose cheapest

way

¾

Fill in table

¾

Trace route

backwards to find

alignment

match = +10,

(17)

Aligning a pair of sequences

(Dynamic Programming)

¾

Aim: get from one

corner to other

¾

Moves have a cost

¾

Choose cheapest

way

¾

Fill in table

¾

Trace route

backwards to find

alignment

A

G G G

A

- -

G

C

Aim: get from one

corner to other

Moves have a cost

Choose cheapest

way

Fill in table

Trace route

backwards to find

alignment

(18)

¾

Initialize NxM matrix with the sequences A and B of

length N and M

¾

Starting at the top left corner set the intermediate

scoring value

V

=

(19)

Substitution matrices

Some amino acids are more similar than others

¾

Adjust cost according to some similarity

matrix

E.g. Blosum62

Leu -> Leu: 4

Leu -> Met: 2

Leu -> Pro: -3

... etc.

(20)

Gap panalties

¾

Gaps tend to occur together

²

one penalty

unrealistic

¾

a gap of length three should not cost three times as

much

¾

Use affine gap cost

¾

Make extending an already existing gap cheaper

¾

Gap opening (G) / gap extension (E)

¾

Total cost for gap length x: G + x

×

E

¾

(JSVLEODVWGHIDXOWV* í( í

(21)

Global vs Local Alignment

¾

Global: Find the best overall alignment between

sequences.

¾

Local: Find short regions of highly conserved

sequence.

(22)

Global vs Local

Species 1: SOMEONE

Species 2: AWESOME

Species 1: - - - SOMEONE

Species 2: AWESOME - - -

Species 1: SOME

Species 2: SOME

(23)

Smith Watermann Algorithm

¾

Instead of looking at each sequence in its

entirety this compares segments of all

possible lengths (LOCAL alignments) and

chooses whichever maximizes the similarity

¾

For every cell the algorithm calculates ALL

possible paths leading to it. These paths can

be of any length and contain insertions and

deletions

(24)

Calculating significance

We have calculated the optimal alignment the alignment

with the best score

¾

WKLVGRHVQ¶WGHSHQGRQZKHWKHUWKHVHTXHQFHVDUH

related or not

¾

call this the maximum segment pair (MSP)

How many MSPs do we expect with at least the same

score by chance?

(25)

Calculating significance

¾

We make use of the extreme value distribution (EVD) to

calculate the number of alignments between random

sequences that we expect given our score or better

¾

This is known as the e-value

¾

E

(

S

) =

Kmn݁

ି

O

ௌ

¾

K

and

O

= scaling parameters calculated based on the

search space (

K

) and scoring scheme (

O

)

¾

m,

n

= size of the search space

¾

The probability of finding at least one match with our

score(the

p

value) 1-

e

-E(S)

¾

As both the

e

value and the

p

value decrease, the

(26)

¾

Basic Local Alignment Search Tool: Used to find

local sequence alignments between protein and

nucleotide sequences

(Altschul et al., 1990, cited over

43,000 times)

¾

Heuristic so it is an approximate best match (SW is a

¾

guarantee)

¾

calculate the high scoring matches instead of the

maximum scoring matches (HSP instead of MSP)

(27)

BLAST

¾

EUHDNWKHVHTXHQFHLQWR³ZRUGV´RIDOHQJWKGHIDXOWLV

28, we will look at 4)

GTTCACATCATCCTGC

GTTC

TTCA

TCAC

CACA

ACAT

CATC

ATCA

...

(28)

BLAST

¾

IRUHDFKRIWKHZRUGVORRNDW³OLNHO\´PXWDQWVEDVHG

on

scoring matrices)

¾

you could call this the neighborhood

GTTCACATCATCCTGC

GTTC: CTTC,GTTC,GATC...

TTCA: TTCT,TTGA,TTGT...

TCAC: AGAC,CCAC,TCTG...

CACA: ...

ACAT: ...

CATC: ...

(29)

BLAST

¾

calculate E values

¾

expectation that you would get that alignment by change

given the database of sequences

¾

return significant results

¾

we already talked about these e-values and p-values

with Smith-Waterman significance

(30)

BLAST

¾

Types:

¾

Nucleotide vs. Nucleotide: blastn

¾

Protein vs Protein: blastp

¾

Translated Nucleotide vs Protein: blastx

¾

Protein vs Translated Nucleotide: tblastn

¾

Translated Nucleotide vs translated database:

tblastx

(31)

DNA vs protein

¾

Should you use blastn or blastp?

¾

There are four potential nucleotides A,C,GT and therefore four

potential states

¾

There are 22 standard amino acids and therefore 22 potential states

¾

blastp should be more sensitive because of the lower chance of a

random hit than blastn because of the state space

¾

If there is the possibility of highly similar sequences, DNA works well

¾

,IWKHUHDUHQ¶WWUDQVODWHGVHTXHQFHVDYDLODEOH'1$LVUHTXLUHG

¾

intergenic spacers

(32)

Things to consider

nothing is 90% homologous

¾

WKLQJVDUHKRPRORJRXVRUWKH\DUHQ¶W

¾

WKHUHLVQ¶WDGHJUHHRIKRPRORJ\

¾

there may be a degree of your belief in homology

statistical significance depends on the size of the

alignments and the database

¾

e-value increases as database gets bigger more chance

for a random hit

(33)

Therefore

¾

sequence similarity can suggest homology

¾

a significant alignment over the length of both sequences

strongly suggests homology

¾

homologous sequences do not always produce

significant alignments!

¾

regions with low complexity (but that are not cleaned out

by initial steps in BLAST) can produce significant

(34)

Rules

There are no hard and fast rules

Nucleotides

¾

it has been suggested that sequence identity of more

than 70%

¾

suggests homology

¾

e-values of 10^-6 or less too bad

Proteins

¾

25% or more sequence identity

¾

e-values of 10^-3 or less nope

(35)

¾

Needleman-Wunsch

Pairwise Sequence Alignment