Protein Sequence Analysis

(1)

Overview

-UDEL Workshop

Raja Mazumder

Research Associate Professor, Department of Biochemistry and Molecular Biology

(2)

Topics

 Why do protein sequence analysis?

 Searching sequence databases (similarity search)

 Post-processing search results

 Protein classification & function prediction. Detecting remote homologs

(3)

Protein bioinformatics:

protein sequence analysis

 Helps characterize protein sequences in silico and allows

prediction of protein structure and function

 Statistically significant BLAST hits usually signifies

sequence homology

 Homologous sequences may or may not have the same function but would always (very few exceptions) have the

same structural fold

(4)

Comparative protein sequence analysis

and evolution

 Patterns of conservation in sequences allows us to

determine which residues are under selective constraint

(and thus likely important for protein function)

 Comparative analysis of proteins is more sensitive than comparing DNA

 Homologous proteins have a common ancestor

 Different proteins evolve at different rates

 Protein classification systems based on evolution:

(5)

Comparing proteins

 Amino acid sequence of protein generated from

proteomics experiment

e.g. protein fragment

DTIKDLLPNVCAFPMEKGPCQTYMTRWFFNFETGECELFAYGGCGGNSNNFLRKEKCEKFCKFT

 Amino-acids of two sequences can be aligned

and we can easily count the number of identical residues (or use an index of similarity) as a

measure of relatedness.

 Protein structures can be compared by

(6)

Protein sequence alignment

 Pairwise alignment

a b a c d a b _ c d

 Multiple sequence alignment provides more information

a b a c d

a b _ c d

x b a c e

(7)

Protein sequence analysis overview

 Protein databases

 PIR (pir.georgetown.edu) and UniProt (www.uniprot.org)

 Searching databases

 Peptide search, BLAST search, Text search

 Information retrieval and analysis

 Protein records at UniProt and PIR

 Multiple sequence alignment

 Secondary structure prediction

(8)

Query Sequence

 Unknown sequence is Q9I7I7

 BLAST Q9I7I7 against the UniProt Knowledgebase

()

(9)

(10)

(11)

Are

Q9I7I7

and

SIR2_HUMAN

homologs?

 Check BLAST results

(12)

Protein structure prediction

 Programs can predict

secondary structure information with 70% accuracy

 Homology modeling - prediction of ‘target’ structure from closely related ‘template’ structure

(13)

Secondary structure prediction

http://bioinf.cs.ucl.ac.uk/psipred/

(14)

(15)

(16)

Homology modeling

(17)

Homology model of

Q9I7I7

Blue - excellent Green - so so Red - not good

Yellow - beta sheet Red - alpha helix Grey - loop

(18)

(19)

(20)

Multiple sequence alignment

(21)

(22)

Function

prediction

(23)

Function

prediction

(24)

Molecular Phylogenetics and

Evolution

Overview

 History of phylogenetics

 Sequence analysis and classification

(25)

Phylogenetics

 Field of biology that studies the evolutionary

relationships between organisms, proteins or genes that share a common ancestor

 Phylogenetics includes the discovery (estimation) of

these relationships, and the study of the causes behind this pattern

(26)

Tree of Life

 Aristotle (384 BC–322 BC), classified all living

organisms as either a plant or an animal.

 Whittaker (1969),

summarized the "Five

Kingdoms" of life: animals,

plants, fungi, protists

("protozoa"), and monera (bacteria). R. H. Whittaker, Science 163, 150 (1969)

 Zuckerkandl et al. (1965) forwarded the concept that

sequences could be used to relate organisms. E. Zuckerkandl et al. Biol. 8, 357 (1965).

 Woese (1990) proposed "urkingdoms" or "domains":

Eucarya (eukaryotes),

Bacteria (initially called eubacteria), and Archaea

(initially called

archaebacteria). Woese et al.Proc. Natl. Acad. Sci. U.S.A. 87, 4576 (1990).

(27)

History of Phylogenetics

Charles Darwin.1859. Author of The Origin of Species

Ernst Haeckel. 1892. Mapped a genealogical tree relating all animal life. Romanes's 1892 copy of Ernst Haeckel's allegedly fraudulent embryo drawings.

(28)

Monophyly, Paraphyly & Polyphyly

(29)

Molecular Phylogenetics

 Morphological or organismal character

evolution not as consistent compared to molecular evolution

 Can be used to study any organism

 Rates of evolution can be studied in greater

detail

(30)

Evolutionary Change in DNA

 Several models have been

proposed to study the

mechanisms of DNA evolution

 Jukes and Cantor’s One-Parameter Model – assumes no bias in the direction of change so the substitution occur randomly among four types of nucleotides.

 Kimura’s Two-Parameter model – transitions are

generally more frequent than transversions. The rate of transitional substitution is different than the rate of transversional substitution

 Rate of change is dependent upon the rate of substitution and pattern of substitution

A C T G A A C G T A A C G C A C T G A > C > T A C > G G T > A A A > C > T C G C A C > A T G A A C > A G T > A A A > T C G C > T > C Single substitution Sequence 1 _{Sequence 2} Ancestral sequence Multiple substitution Coincidental substitution Parallel substitution Convergent substitution Back substitution

(31)

Evolutionary Change in Protein

 Synonymous and nonsynonymous substitutions: Substitutions that result in amino

acid replacements are said to be nonsynonymous while substitutions that do not cause an amino acid replacement are said to be synonymous substitutions

(32)

Tutorial

 Retrieve 1FSI (PDB id) sequence and

related sequences from UniProtKB using BLAST

 Align all the sequences in Clustal (desktop

version)

 Generate tree (using Clustal)

 View tree (

(33)

Representation Of Phylogeny

 The evolutionary relationship between two proteins can be represented in the form of a tree

 A phylogeny is a bifurcating tree with nodes and branches and a root (represents the common ancestor) Protein 1a Protein 1b Protein 1c Protein 1d Node Branch Root Homologous proteins clade

(34)

Terminology

 Clade – A monophyletic taxon

 Taxon – any named group of organisms;

not necessarily a clade

 Branches – branches connect nodes

(35)

Common Phylogenetic Tree Layout

Phylogram (branch lengths proportional to distance)

rectangular cladogram

11 slanted cladogram

(36)

Rooted vs. Unrooted Phylogenies

unrooted rooted

R

only relationships not the evolutionary path

root (R) is the common ancestor

(37)

How to Construct A Phylogenetic

Tree

 Construct a multiple sequence alignment

 Determine the substitution model

 Build tree

(38)

Bootstrapping

 Bootstrapping is a resampling tree evaluation method

 A number associated with a particular branch in the tree that gives the proportion of bootstrap replicates that support the monophyly of the clade

 Two-step process – generation of many new data sets from the original set and then the computation of a number that tells how often a particular branch appears in the tree

(39)

Distance - Neighbor-joining Method

 NJ algorithm commonly is applied with distance tree building

 The fully resolved tree is “decomposed” from a fully unresolved “star” tree by inserting branches between a pair of closest

neighbors and the remaining terminals in the tree. The process is repeated. Rapid method.

(40)

Example PFK: Phosphofructokinase classification revealed that major functional specialization can occur as a result not only of major sequence changes but also by mutation of a single amino-acid residue.

Function Prediction From Evolutionary

Classification

ATP_PFK_DR0635 ATP_PFK_euk PPi_PFK_PfpB PPi_PFK_TM0289 PPi_PFK_TP0108 PPi_PFK_SMc01852 PFK_XF0274 E. coli (P06998) Gly105 Gly125 ATP-PFK: Gly105 + Gly125 PPi-PFK: Gly/Asp105 + Lys125 Families C las s if ic at ion t ree

(41)

Contact

 Myself-

 UniProt-