Wright State University Wright State University
CORE Scholar CORE Scholar
Computer Science and Engineering Faculty
Publications Computer Science & Engineering
2003
Protein Structure, Function, and Folding Protein Structure, Function, and Folding
Dan E. Krane
Wright State University - Main Campus, [email protected]
Michael L. Raymer
Wright State University - Main Campus, [email protected]
Follow this and additional works at: https://corescholar.libraries.wright.edu/cse
Part of the Computer Sciences Commons, and the Engineering Commons
Repository Citation Repository Citation
Krane, D. E., & Raymer, M. L. (2003). Protein Structure, Function, and Folding. . https://corescholar.libraries.wright.edu/cse/389
Intro to Bioinformatics
Protein Folding 2
Determining Protein Structure
• There are O(100,000) distinct proteins in the
human proteome.
• 3D structures have been determined for 14,000
proteins, from all organisms
• Includes duplicates with different ligands bound, etc.
• Coordinates are determined by X-ray
X-Ray Crystallography
~0.5mm
• The crystal is a mosaic of millions of copies of the protein.
Intro to Bioinformatics
Protein Folding 4
X-Ray diffraction
• Image is averaged
over:
• Space (many copies) • Time (of the diffraction
Intro to Bioinformatics
Protein Folding 6
The Protein Data Bank
ATOM 1 N ALA E 1 22.382 47.782 112.975 1.00 24.09 3APR 213 ATOM 2 CA ALA E 1 22.957 47.648 111.613 1.00 22.40 3APR 214 ATOM 3 C ALA E 1 23.572 46.251 111.545 1.00 21.32 3APR 215 ATOM 4 O ALA E 1 23.948 45.688 112.603 1.00 21.54 3APR 216 ATOM 5 CB ALA E 1 23.932 48.787 111.380 1.00 22.79 3APR 217 ATOM 6 N GLY E 2 23.656 45.723 110.336 1.00 19.17 3APR 218 ATOM 7 CA GLY E 2 24.216 44.393 110.087 1.00 17.35 3APR 219 ATOM 8 C GLY E 2 25.653 44.308 110.579 1.00 16.49 3APR 220 ATOM 9 O GLY E 2 26.258 45.296 110.994 1.00 15.35 3APR 221 ATOM 10 N VAL E 3 26.213 43.110 110.521 1.00 16.21 3APR 222 ATOM 11 CA VAL E 3 27.594 42.879 110.975 1.00 16.02 3APR 223 ATOM 12 C VAL E 3 28.569 43.613 110.055 1.00 15.69 3APR 224 ATOM 13 O VAL E 3 28.429 43.444 108.822 1.00 16.43 3APR 225 ATOM 14 CB VAL E 3 27.834 41.363 110.979 1.00 16.66 3APR 226 ATOM 15 CG1 VAL E 3 29.259 41.013 111.404 1.00 17.35 3APR 227 ATOM 16 CG2 VAL E 3 26.811 40.649 111.850 1.00 17.03 3APR 228
A Peek at Protein Function
• Serine proteases – cleave other proteins
Intro to Bioinformatics
Protein Folding 8
Three Serine Proteases
• Chymotrypsin – Cleaves the peptide bond on
the carboxyl side of aromatic (ring) residues: Trp, Phe, Tyr; and large hydrophobic residues: Met.
• Trypsin – Cleaves after Lys (K) or Arg (R)
• Positive charge
• Elastase – Cleaves after small residues: Gly,
Intro to Bioinformatics
Protein Folding 10
The Protein Folding Problem
• Central question of molecular biology:
“Given a particular sequence of amino acid
residues (primary structure), what will the tertiary/quaternary structure of the resulting protein be?”
• Input: AAVIKYGCAL…
Output: φ1ψ1, φ2ψ2…
Intro to Bioinformatics
Protein Folding 12
Protein Folding – Biological perspective
• “Central dogma”: Sequence specifies structure
• Denature – to “unfold” a protein back to
random coil configuration
• β-mercaptoethanol – breaks disulfide bonds
• Urea or guanidine hydrochloride – denaturant • Also heat or pH
• Anfinsen’s experiments
• Denatured ribonuclease
• Spontaneously regained enzymatic activity
Folding intermediates
• Levinthal’s paradox – Consider a 100 residue
protein. If each residue can take only 3
positions, there are 3100 = 5 × 1047 possible
conformations.
• If it takes 10-13s to convert from 1 structure to another, exhaustive search would take 1.6 × 1027
years!
• Folding must proceed by progressive
stabilization of intermediates
Intro to Bioinformatics
Protein Folding 14
Forces driving protein folding
• It is believed that hydrophobic collapse is a key
driving force for protein folding
• Hydrophobic core
• Polar surface interacting with solvent
• Minimum volume (no cavities)
• Disulfide bond formation stabilizes
• Hydrogen bonds
Folding help
• Proteins are, in fact, only marginally stable
• Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form
• Many proteins help in folding
• Protein disulfide isomerase – catalyzes shuffling of disulfide bonds
Intro to Bioinformatics
Protein Folding 16
The Hydrophobic Core
• Hemoglobin A is the protein in red blood cells
(erythrocytes) responsible for binding oxygen.
• The mutation E6→V in the β chain places a
hydrophobic Val on the surface of hemoglobin
• The resulting “sticky patch” causes hemoglobin
S to agglutinate (stick together) and form fibers which deform the red blood cell and do not
carry oxygen efficiently
• Sickle cell anemia was the first identified
Sickle Cell Anemia
Intro to Bioinformatics
Protein Folding 18
Computational Problems in Protein Folding
• Two key questions:
• Evaluation – how can we tell a correctly-folded protein from an incorrectly folded protein?
H-bonds, electrostatics, hydrophobic effect, etc.
Derive a function, see how well it does on “real” proteins
• Optimization – once we get an evaluation function, can we optimize it?
Simulated annealing/monte carlo EC
Heuristics
Fold Optimization
• Simple lattice models
(HP-models)
• Two types of residues: hydrophobic and polar • 2-D or 3-D lattice
• The only force is hydrophobic collapse
Intro to Bioinformatics
Protein Folding 20
• H/P model scoring: count noncovalent
hydrophobic interactions.
• Sometimes:
• Penalize for buried polar or surface hydrophobic residues
What can we do with lattice models?
• For smaller polypeptides, exhaustive search can
be used
• Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process
• For larger chains, other optimization and search
methods must be used
• Greedy, branch and bound
Intro to Bioinformatics
Protein Folding 22
• The “hydrophobic zipper” effect:
Learning from Lattice Models
• Absolute directions
• UURRDLDRRU
• Relative directions
• LFRFRRLLFFL
• Advantage, we can’t have UD or RL in absolute • Only three directions: LRF
• What about bumps? LFRRR
• Bad score
Intro to Bioinformatics
Protein Folding 24
Preference-order representation
• Each position has two “preferences”
• If it can’t have either of the two, it will take the “least favorite” path if possible
• Example: {LR},{FL},{RL},
{FR},{RL},{RL},{FR},{RF}
• Can still cause bumps:
“Decoding” the representation
• The optimizer works on the representation, but
to score, we have to “decode” into a structure that lets us check for bumps and score.
• Example: How many bumps in:
URDDLLDRURU?
• We can do it on graph paper
• Start at 0,0
• Fill in the graph
Intro to Bioinformatics
Protein Folding 26
A two-dimensional array in PERL
Setting up the grid
foreach $move (@moves) {
Intro to Bioinformatics
Protein Folding 28
More realistic models
• Higher resolution lattices (45° lattice, etc.)
• Off-lattice models
• Local moves
• Optimization/search methods and φ/ψ representations
Greedy search
Branch and bound
The Other Half of the Picture
• Now that we have a more realistic off-lattice
model, we need a better energy function to evaluate a conformation (fold).
• Theoretical force field:
• ∆G = ∆Gvan der Waals + ∆Gh-bonds + ∆Gsolvent + ∆Gcoulomb
• Empirical force fields
Intro to Bioinformatics
Protein Folding 30
Threading: Fold recognition
• Given:
• Sequence:
IVACIVSTEYDVMKAAR… • A database of molecular
coordinates
• Map the sequence onto
each fold
• Evaluate
• Objective 1: improve scoring function
Intro to Bioinformatics
Protein Folding 32
Secondary Structure Prediction
• Easier than folding
• Current algorithms can prediction secondary structure with 70-80% accuracy
• Chou, P.Y. & Fasman, G.D. (1974).
Biochemistry, 13, 211-222.
• Based on frequencies of occurrence of residues in helices and sheets
• PhD – Neural network based
• Uses a multiple sequence alignment
Chou-Fasman Parameters
Name Abbrv P(a) P(b) P(turn) f(i) f(i+1) f(i+2) f(i+3)
Intro to Bioinformatics
Protein Folding 34
Chou-Fasman Algorithm
• Identify α-helices
• 4 out of 6 contiguous amino acids that have P(a) > 100
• Extend the region until 4 amino acids with P(a) < 100 found
• Compute ΣP(a) and ΣP(b); If the region is >5 residues and ΣP(a) > ΣP(b) identify as a helix
• Repeat for β-sheets [use P(b)]
• If an α and a β region overlap, the overlapping
Chou-Fasman, cont’d
• Identify hairpin turns:
• P(t) = f(i) of the residue × f(i+1) of the next residue × f(i+2) of the following residue × f(i+3) of the
residue at position (i+3)
• Predict a hairpin turn starting at positions where:
P(t) > 0.000075
Intro to Bioinformatics
Protein Folding 36
Chou-Fasman Example
• CAENKLDHVRGPTCILFMTWYNDGP
• CAENKL – Potential helix (!C and !N)
Residues with P(a) < 100: RNCGPSTY
• Extend: When we reach RGPT, we must stop • CAENKLDHV: ΣP(a) = 972, ΣP(b) = 843 • Declare alpha helix
• Identifying a hairpin turn
• VRGP: P(t) = 0.000085 • Average P(turn) = 113.25