2.2 Background
2.2.3 Single-state Design Models
Single-state design is arguably the current state-of-the-art. It has been used in the design of a wide range of de novo protein topologies[8, 19], de novo catalytic proteins[29], nanomolar affinity small molecule binding domains[31], and higher order protein assemblies[17, 18, 32]. Relative to models with more complex representation, single-state design is computationally efficient. Replicate simulations can also be run independently in a highly parallelized fashion. Generally, many outputs from replicate simulations are run through additional down-stream filtering criteria to select designs for final analysis and experimental characterization.
2.2.3.1 Representation Molecules in single-state design are typically represented as full-atom
models. Every nucleus in the protein is represented as a point. Each atom is assigned one of a set of atom types (e.g. aliphatic carbon, aromatic nitrogen, etc.) that defines its atomic properties, including partial charge, hydrogen-bonding capacity, Van der Waals radius, Lennard-
Figure 2.1: Conceptual diagram and process flowchart for A) single-state, B) ensemble, and C) multi-state design models. Circles represent the backbone conformations used in the model. Ensembles sampling the same energy well are shown as stacked circles of uniform color. Models sampling different energy wells are shown as spaced circles that are colored uniquely. Below each diagram is a flowchart of generic steps used in each approach.
Jones potential well depth, and desolvation potential. Electrons and electron density are not explicitly represented in full-atom representation as they are in quantum mechanical simulations. Protein backbones remain in a fixed conformation throughout single-state design. This is a feature that differentiates single-state design from other methodologies (Figure 2.1, page 15). The only degrees of freedom are in the side-chains, which are allowed to rotate around their chi angles. The discrete conformation of a side-chain is referred to as a rotamer. Rotamer libraries have been generated that sample the most probable rotamers based on statistics from the Protein Data Bank (PDB)[33, 34], and some design algorithms use continuous probability functions for continuous rotamer sampling[35, 36].
2.2.3.2 Sampling Sampling in single-state design is performed by changing the rotamers (side-
chain conformations) that are on the backbone. This allows for sampling the space of rotamer configurations, where a configuration is a set of rotamer conformations over the entire protein. Sampling can be performed with rotamers for a fixed amino acid sequence or with a set of rotamers that include rotamers for multiple amino acids to simultaneously sample sequence space and rotamer configuration space.
Monte Carlo simulated annealing is one method that optimizes the rotamers to output low energy conformations[8, 37]. In each step of the simulated annealing algorithm, a randomly selected rotamer for a randomly selected position is trialed and evaluated by the Metropolis Criterion (Figure 2.2, page 17). Thereby, a change to the model is accepted if it lowers the model energy, but moves that increases the model energy are accepted with a probability that is dependent on the change in energy ( E) multiplied by a temperature factor (kT ). During
Figure 2.2: Application of the Metropolis criterion to simulated annealing. The probability (equation, top) of accepting (green) or rejecting (red) a Monte Carlo move (e.g. rotamer substitution) is shown as a function of the change in biophysical energy after the move ( E) at three different temperatures (labels, right).
temperature cools. One caveat to simulated annealing is that it is not ensured to converge on the lowest energy rotamer configuration, and the output is stochastic. Convergence on a solution (e.g. rotamer configuration, sequence, or score) can only be tested by multiple repetitions of the simulation. The probability of the simulation converging on a small set of similar solutions is a function of the size of rotamer configuration space, thus at the number of design positions is limited. The maximum number of designable positions in current implementations of simulated annealing will be affected by factors such as protein topology and the number of rotamers per positions, but less than 100 positions is a reasonable limit for many applications.
In contrast, dead-end elimination (DEE) is a provable method that outputs the global minimum energy rotamer configuration[38, 39]. While the importance of the global minimum energy rotamer configuration depends on the appropriateness of the model, identifying the global minimum avoids local minima that can potentially trap a simulated annealing trajectory. One further caveat to DEE is that the simulation frequently does not converge on a solution
and no output is produced. Strategies that implement stochastic steps in sampling side-chains allow for the identification of non-optimal but low energy configurations[40], but the probability of convergence decreases as a function of the combinatorial size of rotamer configuration space, so a reasonable number of design positions in DEE is 10-20 positions with the same caveats as for simulated annealing.
2.2.3.3 Scoring A scorefunction for computational design or molecular modeling must have
predictive power for what sequences and rotamer configurations will stabilize a given fold. Scorefunctions can be statistical potentials drawn from large datasets such as multiple sequence alignments [41, 42] and PDB statistics [36, 43–46]. Scorefunctions can also be explicitly parameterized to only include terms based on physical first principles [39, 47, 48]. Many scorefunctions are hybrids of the two, containing terms for both statistical physical potentials [49–51].
2.2.3.4 Optimization Single-state design is frequently followed by a minimization protocol that
allows the backbone and side-chain torsion angles to adjust to the final design sequence. Many single-state design protocols now iterate between design and minimization steps such as the FastDesign protocol in the Rosetta protein design software suite. This level of backbone movement is a deviation from our definition of single-state design, but it is generally lower magnitude than that in design simulations that explicitly sample backbone movement.