Outline
Pairwise sequence alignment
¾
global - Needleman Wunsch Gotoh algorithm
¾
local - Smith Waterman algorithm
¾
Quite simply, the comparison of two or more DNA or
protein sequences to each other.
¾
The purpose of alignment is to highlight similarity
between sequences.
¾
Alignment is the procedure of writing two (or more)
sequences in a way that a maximum of identical or
similar characters are placed in the same column by
DGGLQJJDSµ
-
¶FKDUDFWHUV
Word Alignment
Species 1: SOMEONE
Species 2: AWESOME
Species 1: - - - SOMEONE
Species 2: AWESOME
-Less trivial
Species 1: ACGTTAGA
Species 2: CGTTGAA
Species 1: - - - ACGTTAGA
Species 2: CGTTGAA - - -
Species 1: ACGTTAGA
Species 2: - CGTT- GAA
Less trivial
Species 1: - - - ACGTTAGA
Species 2: CGTTGAA - - -
¾
score: -15 (gaps = -1, match = 1)
Species 1: ACGTTAGA
Species 2: - CGTT - GAA
FASTA Format - Input
¾
Standard input format for alignment programs
¾
8VHµ[¶IRUDPELJXRXVFKDUDFWHUV
¾
Strictly speaking, should not contain gaps
>Name1
ASEQUENCE1
>Name2 comments
SEQU
FASTA Format - Output
¾
Increasingly, multiple alignment returned in
FASTA-like format
¾
*DSVUHSUHVHQWHGXVLQJ¶
-
¶
¾
Order of sequences may be different in output to
input.
>Name1
ASEQUENCE1
>Name2 comments
-SEQU--CE2
etc....
Relatedness of residues in same column
¾
Making these alignments is EASY....
¾
As we know where and which evolutionary events
occurred
Quiz
Which alignment (X, Y or Z)
shows only residues related by
substitution events in the same column?
Types of alignments methods
We cannot enummerate all possible alignments.
Approaches are:
¾
Dot matrix
¾
Dynamic Programming
¾
Given two
In a dot matrix we can identify:
¾
Existing alignable parts of
sequences
¾
Possible indels
¾
Duplicated sequences and
repeats
¾
Self-complementarity
¾
Gene-order differences among
genomes
a)
A continuous main diagonal shows perfect similarity for symbols with the same indices.
b)
Parallels to the main diagonal indicate repeated regions in the same reading direction on different
parts of the sequences. In this case a region D is found twice in the sequence (D1, D2, so called
µGXSOLFDWLRQV¶
c)
Lines perpendicular to the main diagonal indicate palindromic areas. In this case the sequence is
completely palindromic in the displayed area.
d)
Partially palindromic sequence (For DNA sequences this refers to a perfect match of the normal
strand with its reverse complement, which is frequently found for many transposable elements.
e)
Bold blocks on the main diagonal indicate repetition of the same symbol in both sequences, e.g.
(G)50, so called microsatellite repeats
f)
Parallel lines indicate tandem repeats of a larger motif in both sequences, e.g. (AGCTCTGAC)20, so
called minisatellite patterns. The distance between the diagonals equals the distance of the motif.
g)
When the diagonal is a discontinuous line this indicates that the sequences T1 and T2 share a
common source. In literal analyses we may have to deal with plagiarism or in DNA analyses
sequences may be homologous because of a common ancestor. The number of interruptions
increases with modifications on the text or the time of independent evolution and mutation rate.
h)
3DUWLDOGHOHWLRQLQVHTXHQFHRULQVHUWLRQLQVHTXHQFHVRFDOOHGµindel¶,QSURWHLQFRGLQJ
sequences this can be often observed for many different types of domains, which got lost or
gap = -15
Aligning a pair of sequences
¾
Aim: get from one
corner to other
¾
Moves have a cost
¾
Choose cheapest
way
¾
Fill in table
¾
Trace route
backwards to find
alignment
match = +10,
Aligning a pair of sequences
(Dynamic Programming)
¾
Aim: get from one
corner to other
¾
Moves have a cost
¾
Choose cheapest
way
¾
Fill in table
¾
Trace route
backwards to find
alignment
A
G G G
A
- -
G
C
Aim: get from one
corner to other
Moves have a cost
Choose cheapest
way
Fill in table
Trace route
backwards to find
alignment
¾
Initialize NxM matrix with the sequences A and B of
length N and M
¾
Starting at the top left corner set the intermediate
scoring value
V
=
Substitution matrices
Some amino acids are more similar than others
¾
Adjust cost according to some similarity
matrix
E.g. Blosum62
Leu -> Leu: 4
Leu -> Met: 2
Leu -> Pro: -3
... etc.
Gap panalties
¾
Gaps tend to occur together
²
one penalty
unrealistic
¾
a gap of length three should not cost three times as
much
¾
Use affine gap cost
¾
Make extending an already existing gap cheaper
¾
Gap opening (G) / gap extension (E)
¾
Total cost for gap length x: G + x
×
E
¾
(JSVLEODVWGHIDXOWV* í( í
Global vs Local Alignment
¾
Global: Find the best overall alignment between
sequences.
¾
Local: Find short regions of highly conserved
sequence.
Global vs Local
Species 1: SOMEONE
Species 2: AWESOME
Species 1: - - - SOMEONE
Species 2: AWESOME - - -
Species 1: SOME
Species 2: SOME
Smith Watermann Algorithm
¾
Instead of looking at each sequence in its
entirety this compares segments of all
possible lengths (LOCAL alignments) and
chooses whichever maximizes the similarity
¾
For every cell the algorithm calculates ALL
possible paths leading to it. These paths can
be of any length and contain insertions and
deletions
Calculating significance
We have calculated the optimal alignment the alignment
with the best score
¾
WKLVGRHVQ¶WGHSHQGRQZKHWKHUWKHVHTXHQFHVDUH
related or not
¾
call this the maximum segment pair (MSP)
How many MSPs do we expect with at least the same
score by chance?
Calculating significance
¾
We make use of the extreme value distribution (EVD) to
calculate the number of alignments between random
sequences that we expect given our score or better
¾
This is known as the e-value
¾
E
(
S
) =
Kmn݁
ି
O
ௌ
¾
K
and
O
= scaling parameters calculated based on the
search space (
K
) and scoring scheme (
O
)
¾
m,
n
= size of the search space
¾
The probability of finding at least one match with our
score(the
p
value) 1-
e
-E(S)
¾
As both the
e
value and the
p
value decrease, the
¾
Basic Local Alignment Search Tool: Used to find
local sequence alignments between protein and
nucleotide sequences
(Altschul et al., 1990, cited over
43,000 times)
¾
Heuristic so it is an approximate best match (SW is a
¾
guarantee)
¾
calculate the high scoring matches instead of the
maximum scoring matches (HSP instead of MSP)
BLAST
¾
EUHDNWKHVHTXHQFHLQWR³ZRUGV´RIDOHQJWKGHIDXOWLV
28, we will look at 4)
GTTCACATCATCCTGC
GTTC
TTCA
TCAC
CACA
ACAT
CATC
ATCA
...
BLAST
¾
IRUHDFKRIWKHZRUGVORRNDW³OLNHO\´PXWDQWVEDVHG
on
scoring matrices)
¾
you could call this the neighborhood
GTTCACATCATCCTGC
GTTC: CTTC,GTTC,GATC...
TTCA: TTCT,TTGA,TTGT...
TCAC: AGAC,CCAC,TCTG...
CACA: ...
ACAT: ...
CATC: ...
BLAST
¾
calculate E values
¾
expectation that you would get that alignment by change
given the database of sequences
¾
return significant results
¾
we already talked about these e-values and p-values
with Smith-Waterman significance
BLAST
¾
Types:
¾
Nucleotide vs. Nucleotide: blastn
¾
Protein vs Protein: blastp
¾
Translated Nucleotide vs Protein: blastx
¾
Protein vs Translated Nucleotide: tblastn
¾
Translated Nucleotide vs translated database:
tblastx
DNA vs protein
¾
Should you use blastn or blastp?
¾
There are four potential nucleotides A,C,GT and therefore four
potential states
¾
There are 22 standard amino acids and therefore 22 potential states
¾
blastp should be more sensitive because of the lower chance of a
random hit than blastn because of the state space
¾
If there is the possibility of highly similar sequences, DNA works well
¾
,IWKHUHDUHQ¶WWUDQVODWHGVHTXHQFHVDYDLODEOH'1$LVUHTXLUHG
¾
intergenic spacers
Things to consider
nothing is 90% homologous
¾
WKLQJVDUHKRPRORJRXVRUWKH\DUHQ¶W
¾
WKHUHLVQ¶WDGHJUHHRIKRPRORJ\
¾
there may be a degree of your belief in homology
statistical significance depends on the size of the
alignments and the database
¾
e-value increases as database gets bigger more chance
for a random hit
Therefore
¾
sequence similarity can suggest homology
¾
a significant alignment over the length of both sequences
strongly suggests homology
¾
homologous sequences do not always produce
significant alignments!
¾
regions with low complexity (but that are not cleaned out
by initial steps in BLAST) can produce significant
Rules
There are no hard and fast rules
Nucleotides
¾
it has been suggested that sequence identity of more
than 70%
¾
suggests homology
¾
e-values of 10^-6 or less too bad
Proteins
¾
25% or more sequence identity
¾
e-values of 10^-3 or less nope
Next
We will go over some examples in lab
¾
Needleman-Wunsch