Protein Sequence Analysis
Overview
-UDEL Workshop
Raja Mazumder
Research Associate Professor, Department of Biochemistry and Molecular Biology
Topics
Why do protein sequence analysis?
Searching sequence databases (similarity search)
Post-processing search results
Protein classification & function prediction. Detecting remote homologs
Protein bioinformatics:
protein sequence analysis
Helps characterize protein sequences in silico and allows
prediction of protein structure and function
Statistically significant BLAST hits usually signifies
sequence homology
Homologous sequences may or may not have the same function but would always (very few exceptions) have the
same structural fold
Comparative protein sequence analysis
and evolution
Patterns of conservation in sequences allows us to
determine which residues are under selective constraint
(and thus likely important for protein function)
Comparative analysis of proteins is more sensitive than comparing DNA
Homologous proteins have a common ancestor
Different proteins evolve at different rates
Protein classification systems based on evolution:
Comparing proteins
Amino acid sequence of protein generated from
proteomics experiment
e.g. protein fragment
DTIKDLLPNVCAFPMEKGPCQTYMTRWFFNFETGECELFAYGGCGGNSNNFLRKEKCEKFCKFT
Amino-acids of two sequences can be aligned
and we can easily count the number of identical residues (or use an index of similarity) as a
measure of relatedness.
Protein structures can be compared by
Protein sequence alignment
Pairwise alignment
a b a c d a b _ c d
Multiple sequence alignment provides more information
a b a c d
a b _ c d
x b a c e
Protein sequence analysis overview
Protein databases
PIR (pir.georgetown.edu) and UniProt (www.uniprot.org)
Searching databases
Peptide search, BLAST search, Text search
Information retrieval and analysis
Protein records at UniProt and PIR
Multiple sequence alignment
Secondary structure prediction
Query Sequence
Unknown sequence is Q9I7I7
BLAST Q9I7I7 against the UniProt Knowledgebase
()
Are
Q9I7I7
and
SIR2_HUMAN
homologs?
Check BLAST results
Protein structure prediction
Programs can predict
secondary structure information with 70% accuracy
Homology modeling - prediction of ‘target’ structure from closely related ‘template’ structure
Secondary structure prediction
http://bioinf.cs.ucl.ac.uk/psipred/
Homology modeling
Homology model of
Q9I7I7
Blue - excellent Green - so so Red - not good
Yellow - beta sheet Red - alpha helix Grey - loop
Multiple sequence alignment
Function
prediction
Function
prediction
Molecular Phylogenetics and
Evolution
Overview
History of phylogenetics
Sequence analysis and classification
Phylogenetics
Field of biology that studies the evolutionary
relationships between organisms, proteins or genes that share a common ancestor
Phylogenetics includes the discovery (estimation) of
these relationships, and the study of the causes behind this pattern
Tree of Life
Aristotle (384 BC–322 BC), classified all living
organisms as either a plant or an animal.
Whittaker (1969),
summarized the "Five
Kingdoms" of life: animals,
plants, fungi, protists
("protozoa"), and monera (bacteria). R. H. Whittaker, Science 163, 150 (1969)
Zuckerkandl et al. (1965) forwarded the concept that
sequences could be used to relate organisms. E. Zuckerkandl et al. Biol. 8, 357 (1965).
Woese (1990) proposed "urkingdoms" or "domains":
Eucarya (eukaryotes),
Bacteria (initially called eubacteria), and Archaea
(initially called
archaebacteria). Woese et al.Proc. Natl. Acad. Sci. U.S.A. 87, 4576 (1990).
History of Phylogenetics
Charles Darwin.1859. Author of The Origin of Species
Ernst Haeckel. 1892. Mapped a genealogical tree relating all animal life. Romanes's 1892 copy of Ernst Haeckel's allegedly fraudulent embryo drawings.
Monophyly, Paraphyly & Polyphyly
Molecular Phylogenetics
Morphological or organismal character
evolution not as consistent compared to molecular evolution
Can be used to study any organism
Rates of evolution can be studied in greater
detail
Evolutionary Change in DNA
Several models have beenproposed to study the
mechanisms of DNA evolution
Jukes and Cantor’s One-Parameter Model – assumes no bias in the direction of change so the substitution occur randomly among four types of nucleotides.
Kimura’s Two-Parameter model – transitions are
generally more frequent than transversions. The rate of transitional substitution is different than the rate of transversional substitution
Rate of change is dependent upon the rate of substitution and pattern of substitution
A C T G A A C G T A A C G C A C T G A > C > T A C > G G T > A A A > C > T C G C A C > A T G A A C > A G T > A A A > T C G C > T > C Single substitution Sequence 1 Sequence 2 Ancestral sequence Multiple substitution Coincidental substitution Parallel substitution Convergent substitution Back substitution
Evolutionary Change in Protein
Synonymous and nonsynonymous substitutions: Substitutions that result in aminoacid replacements are said to be nonsynonymous while substitutions that do not cause an amino acid replacement are said to be synonymous substitutions
Tutorial
Retrieve 1FSI (PDB id) sequence and
related sequences from UniProtKB using BLAST
Align all the sequences in Clustal (desktop
version)
Generate tree (using Clustal)
View tree (
Representation Of Phylogeny
The evolutionary relationship between two proteins can be represented in the form of a tree
A phylogeny is a bifurcating tree with nodes and branches and a root (represents the common ancestor) Protein 1a Protein 1b Protein 1c Protein 1d Node Branch Root Homologous proteins clade
Terminology
Clade – A monophyletic taxon
Taxon – any named group of organisms;
not necessarily a clade
Branches – branches connect nodes
Common Phylogenetic Tree Layout
Phylogram (branch lengths proportional to distance)
rectangular cladogram
11 slanted cladogram
Rooted vs. Unrooted Phylogenies
unrooted rooted
R
only relationships not the evolutionary path
root (R) is the common ancestor
How to Construct A Phylogenetic
Tree
Construct a multiple sequence alignment
Determine the substitution model
Build tree
Bootstrapping
Bootstrapping is a resampling tree evaluation method
A number associated with a particular branch in the tree that gives the proportion of bootstrap replicates that support the monophyly of the clade
Two-step process – generation of many new data sets from the original set and then the computation of a number that tells how often a particular branch appears in the tree
Distance - Neighbor-joining Method
NJ algorithm commonly is applied with distance tree building The fully resolved tree is “decomposed” from a fully unresolved “star” tree by inserting branches between a pair of closest
neighbors and the remaining terminals in the tree. The process is repeated. Rapid method.
Example PFK: Phosphofructokinase classification revealed that major functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue.
Function Prediction From Evolutionary
Classification
ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB PPi_PFK_TM0289 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274 E. coli (P06998) Gly105 Gly125 ATP-PFK: Gly105 + Gly125 PPi-PFK: Gly/Asp105 + Lys125 Families C las s if ic at ion t reeContact
Myself-
UniProt-