The weights on the terms in the Rosetta energy function can be optimized to create energy functions that perform better in certain applications. For example, there is considerable in- terest in obtaining an energy function that can reliably discriminate the native structure of a protein from thousands of low energy decoys. Such an energy function would be extremely useful in ab initio structure prediction. Alternatively, one may want an energy function that can correctly predict the experimental change in free energy for mutations to a protein. The weights currently in Rosetta were parameterized so that redesigned proteins would reproduce
Residue No. correct core No. native core No. designed core No. correct/ No. native core No. correct No. native No. de- signed No. cor- rect/ No. native No. correct surface No. native surface No. de- signed surface No. cor- rect/ No. native surface LEU 66 122 92 0.54 180 378 454 0.48 18 54 174 0.33 GLY 56 69 66 0.81 302 400 368 0.75 164 210 205 0.78 ASP 10 20 29 0.5 80 268 354 0.3 49 167 221 0.29 SER 20 37 106 0.54 79 258 348 0.31 25 119 128 0.21 GLU 2 7 16 0.29 54 289 308 0.19 31 180 158 0.17 PHE 37 60 69 0.62 98 193 302 0.51 6 23 68 0.26 LYS 2 11 9 0.18 62 325 285 0.19 30 193 148 0.16 ARG 4 14 20 0.29 37 185 262 0.2 10 85 76 0.12 ALA 57 103 94 0.55 118 385 259 0.31 9 144 30 0.06 ILE 41 83 70 0.49 105 242 231 0.43 7 33 39 0.21 THR 19 41 62 0.46 61 291 225 0.21 30 145 110 0.21 TYR 9 40 27 0.22 39 158 224 0.25 6 23 98 0.26 VAL 47 98 67 0.48 107 308 194 0.35 8 60 37 0.13 HIS 10 18 41 0.56 30 118 176 0.25 5 37 32 0.14 GLN 3 17 6 0.18 18 186 172 0.1 10 94 110 0.11 ASN 3 14 11 0.21 33 206 158 0.16 20 110 120 0.18 TRP 10 23 21 0.43 32 76 145 0.42 4 12 56 0.33 PRO 6 18 7 0.33 61 229 77 0.27 29 144 40 0.2 MET 10 31 25 0.32 20 90 70 0.22 1 20 5 0.05 CYS 0 12 0 0 0 27 0 0 0 2 0 0 Total 412 838 0.492 1516 4612 0.329 462 1855 0.249
Table 2.1: Native sequence recovery of Rosetta redesigns. Native sequence recoveries by amino acid type are reported for a set of 32 proteins redesigned with Rosetta.
native protein sequences. This result was obtained by maximizing the Boltzmann probability of the native amino acid over all positions in a set of 30 proteins. Sharabi et al. (25) recently described an optimized energy function for protein-protein interface design.
A new, flexible, weight-fitting protocol was recently implemented in Rosetta (Andrew Leaver-Fay, in preparation). The protocol works by searching the space of all weights for a combination that maximizes the fitness of the selected objective function(s). For example, the protocol can be used to train weights for highest native sequence recovery. In this case, the protocol starts by calculating the unweighted energies of every possible rotamer at every position in a set of proteins. The positions surrounding the one being considered are held fixed at their native amino acid. The dot product of a candidate set of weights with the unweighted energies is used to obtain a fitness. In the case of native sequence recovery, the fitness is
calculated according to the equation below: F itness= � proteins � positions −ln � e−E(aanat) � aa,ie−E(aai) �
where E(aanat) is the energy of the native amino acid at a position and the denominator is the partition function for all 20 amino acids at that position. Instead of multiplying many small probabilities, the sum of the inverse log of the probabilities is minimized. If the protocol is instead used to optimize for predicting changes in free energy, the fitness is the sum of squared differences between the predicted and experimental ∆∆G for all mutants. Candidate weight sets are obtained by using particle swarm optimization (PSO) to search weight space. The best weight set found by PSO in each round is minimized using conjugate gradient-based minimization. If optimizing for native sequence recovery, the minimized weight set is then used to do full protein redesigns. This complete redesign step ensures that weights optimized in a fixed environment are still good for whole protein redesigns. More details of the protocol are provided in the Methods section of chapter 3.
Added later to the weight fitting protocol was the ability to optimize the weights so that redesigns had native-like amino acid (AA) composition. Unlike native amino acid probability or ∆∆G prediction, energy function weights cannot be optimized directly for AA composition. Instead, AA composition is optimized for after the complete redesign step of the protocol, by adjusting the reference energies up or down depending on whether that residue type is designed too much or too little. The cross entropy, a measure of the difference between two distributions, between the designed and native amino acid distributions was used to determine if AA composition was becoming more native-like. Only weight sets that increased the overall sequence recovery and decreased the cross entropy were accepted.
The ideal energy function for protein design would produce native-like designed proteins and be accurate in predicting changes in stability for point mutants. Using the weight-fitting protocol described above, a great amount of effort was spent in optimizing weights and different energy term combinations during the development of the score term described in chapter 3. Energy function optimization is a very hard problem because of the vast size of weight space,
the number of options present in Rosetta, and the variety of metrics which must be examined for each energy function. The first generations of weight optimization runs trained for overall native sequence recovery. The only metrics considered were core and overall native sequence recovery. Shortly later, the goal was changed to optimizing weights to do well at sequence recovery and ∆∆G prediction. This added the ∆∆G correlation coefficient to the list of metrics that had to be considered for a set of weights. Many different options in Rosetta were tested to see what effect they had on the metrics: extra rotamers at surface positions, extra rotamers throughout, multiple packing runs, inclusion of crystal structure rotamers, and modifications to the pair energy term and solvation term. Over time, the list of metrics expanded to include total hydrophobic surface area, percent of residues on the surface that were hydrophobic, and AA composition. In the end, because of difficulties in accurately predicting ∆∆G, energy function weights were only optimized for native sequence recovery and AA composition.
Training weights to accurately predict ∆∆G proved considerably more difficult than ex- pected. We found that if weights were trained to reproduce changes in stability, designing proteins with this energy function resulted in proteins that were composed almost entirely of hydrophobic residues. This result makes sense because the folded state is almost always desol- vated to some extent compared to the unfolded state. Because solvation energy represents one of the biggest contributions to total energy, hydrophobic residues are favored on the surface because of the favorable energy of desolvation. We also found many mutations were predicted to be significantly more destabilizing than they were in reality because of high Lennard Jones repulsive energies. This result made us realize that predicting ∆∆G depends greatly on how the mutant structures are created. In fact, a study describing different methods of creating mutant structures and what effect that had on ∆∆G prediction(26) was published around the same time we were experimenting with ∆∆G prediction. The authors found that allowing more conformational freedom to relax away clashes during mutant structure creation greatly improves the correlations that are obtained. Instead of trying to optimize weights for ∆∆G prediction on a set of poorly made mutant structures, we elected to optimize energy functions only for native sequence recovery and AA composition and then test ∆∆G prediction accuracy using the protocol described by Kellogg et al.(26).
References
1. Berman, H., Henrick, K., Nakamura, H., and Markley, J. (2006) The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic acids research ISSN 0305-1048
2. Rohl, C., Strauss, C., Misura, K., and Baker, D. (2004) Protein structure prediction
using Rosetta. Methods in enzymology383, 66–93. ISSN 0076-6879
3. Grigoryan, G., Ochoa, A., and Keating, A. (2007) Computing van der Waals energies in the context of the rotamer approximation. Proteins: Structure, Function, and
Bioinformatics68, 863–878. ISSN 1097-0134
4. Brooks, B., Bruccoleri, R., Olafson, B., et al. (1983) CHARMM: A program for macro- molecular energy, minimization, and dynamics calculations. Journal of computational
chemistry 4, 187–217. ISSN 1096-987X
5. MacKerell Jr, A., Bashford, D., Bellott, M., Dunbrack Jr, R., Evanseck, J., Field, M., Fischer, S., Gao, J., Guo, H., Ha, S., et al. (1998) All-atom empirical potential for molecular modeling and dynamics studies of proteins. The Journal of Physical
Chemistry B 102, 3586–3616. ISSN 1520-6106
6. Neria, E., Fischer, S., and Karplus, M. (1996) Simulation of activation free energies in
molecular systems. The Journal of chemical physics105, 1902
7. Eisenberg, D. and McLachlan, A. (1986) Solvation energy in protein folding and binding. Nature
8. Marshall, S., Vizcarra, C., and Mayo, S. (2005) One-and two-body decomposable
Poisson-Boltzmann methods for protein design calculations. Protein science14, 1293–
1304. ISSN 1469-896X
9. Vizcarra, C., Zhang, N., Marshall, S., Wingreen, N., Zeng, C., and Mayo, S. (2008) An improved pairwise decomposable finite-difference Poisson–Boltzmann method for
computational protein design. Journal of Computational Chemistry 29, 1153–1162.
ISSN 1096-987X
accurate continuum electrostatics and solvation. Protein science 13, 925–936. ISSN 1469-896X
11. Dzubiella, J., Swanson, J., and McCammon, J. (2006) Coupling nonpolar and polar solvation free energies in implicit solvent models. The Journal of chemical physics
124, 084905
12. Lazaridis, T. and Karplus, M. (1999) Effective energy function for proteins in solution.
Proteins: Structure, Function, and Bioinformatics 35, 133–152. ISSN 1097-0134
13. Kortemme, T., Morozov, A., and Baker, D. (2003) An orientation-dependent hydrogen bonding potential improves prediction of specificity and structure for proteins and
protein-protein complexes. Journal of molecular biology326, 1239–1259. ISSN 0022-
2836
14. Simons, K., Ruczinski, I., Kooperberg, C., Fox, B., Bystroff, C., and Baker, D. (1999) Improved recognition of native-like protein structures using a combination of sequence-
dependent and sequence-independent features of proteins. Proteins: Structure,
Function, and Bioinformatics34, 82–95. ISSN 1097-0134
15. Dunbrack Jr, R. and Cohen, F. (1997) Bayesian statistical analysis of protein side-chain
rotamer preferences. Protein Science6, 1661–1681. ISSN 1469-896X
16. Kuhlman, B. and Baker, D. (2000) Native protein sequences are close to optimal for their structures. Proceedings of the National Academy of Sciences of the United States of
America97, 10383
17. Voigt, C., Gordon, D., and Mayo, S. (2000) Trading accuracy for speed: a quantitative comparison of search algorithms in protein sequence design1. Journal of Molecular
Biology299, 789–803. ISSN 0022-2836
18. Desmet, J., Maeyer, M., Hazes, B., and Lasters, I. (1992) The dead-end elimination
theorem and its use in protein side-chain positioning. Nature 356, 539–542. ISSN
0028-0836
19. Gordon, D. and Mayo, S. (1999) Branch-and-terminate: A combinatorial optimization
algorithm for protein design. Structure7, 1089–1098. ISSN 0969-2126
20. Eriksson, O., Zhou, Y., and Elofsson, A. (2001) Side chain-positioning as an integer pro- gramming problem. In Proceedings of the First International Workshop on Algorithms
in Bioinformatics, WABI ’01, 128–141. Springer-Verlag. ISBN 3-540-42516-0
21. Leaver-Fay, A., Kuhlman, B., and Snoeyink, J. (2005) An adaptive dynamic program- ming algorithm for the side chain placement problem. In Pacific Symposium on Biocomputing, volume 10, 16–27
22. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., Teller, E., et al. (1953) Equation of state calculations by fast computing machines. The journal of chemical
physics21, 1087. ISSN 0021-9606
23. Kirkpatrick, S., Jr., D., and Vecchi, M. (1983) Optimization by simmulated annealing.
science220, 671–680. ISSN 1095-9203
24. Dunbrack, R. and Karplus, M. (1994) Conformational analysis of the backbone- dependent rotamer preferences of protein sidechains. Nature Structural & Molecular
Biology1, 334–340
25. Sharabi, O., Yanover, C., Dekel, A., and Shifman, J. (2011) Optimizing energy functions for protein–protein interface design. Journal of Computational Chemistry ISSN 1096- 987X
26. Kellogg, E., Leaver-Fay, A., and Baker, D. (2011) Role of conformational sampling in computing mutation-induced changes in protein structure and stability. Proteins: Structure, Function, and Bioinformatics ISSN 1097-0134