Multi-state Design Models - Mapping structural and environmental constraints on the mutational

2.2 Background

2.2.5 Multi-state Design Models

We define multi-state models as having a framework to represent a macromolecular system with a group of functionally distinct conformations or states. We note that some definitions of multi- state models require the inclusion of negative states (off-target interactions or unfolded states). Under these definitions, a model that only includes positive states is a multi-constraint model. We chose to use the term multi-state here because we examine models that have the mathematical capacity to include negative states. The inclusion of negative states is not explicitly addressed

2.2.5.1 Representation In a multi-state model, a protein is represented as a group of states.

While ensemble models represent a protein with conformers from the same well on the energy landscape, states in a multi-state model are conformers from different wells. States are frequently conformations that can be characterized individually by structural methodologies such as x-ray crystallography and NMR spectroscopy. The structural difference between states may constitute a fold change, a change in small molecule binding, a protein-protein complex interaction, or loop motions that sample multiple separate low energy conformations.

2.2.5.2 Sampling Rotamer Configuration and Scoring Individual States Rotamer optimization and

scoring forms an internal loop in the multi-state design algorithm (Figure 2.1C, page 15). Optimization of rotamer configuration is performed using the same algorithms used in single-state design, except that the sequence is not optimized simultaneous to the rotamer configuration. As with ensemble design applied to enumerated sequence spaces, sequences on each individual state are evaluated by their score with an optimized rotamer configuration. Because each change to the sequence is followed by rotamer optimization for that sequence on all states, full atom minimization is not practical in many cases.

2.2.5.3 Sampling Sequences The external loop in multi-state design is an algorithm for sampling

changes to sequences. Many algorithms have been employed, including random Monte Carlo sampling [66] and genetic algorithms[67–69] that both sample randomly and merge successful random moves. Sampling of sequence space is typically slow. For runtimes of 1 week or less, the probability that repeated simulations will converge on similar optimal sequences decreases as a function of the number of design positions and the number of states, and the maximum for systems with only 2 states is ~25-30 designable positions.

Figure 2.3: Fitness for each state is modeled by a sigmoidal function. Fitness (blue line) is a function of the biophysical energy of the state. The sigmoid has three regimes (labels, top) based on the overall fitness of the state and how responsive the fitness is to changes in the biophysical energy as a result of mutation.

2.2.5.4 Objective Function for Design Each sequence that is sampled must be evaluated, and

design moves in sequence space must be accepted or rejected. Monte Carlo sampling utilizes the Metropolis criterion and genetic algorithms propagate a population by carrying a high scoring fraction of sequences from the current round of sampling to the next. In both cases, the scores must be integrated from all states into an objective function or fitness function that is optimized over the sampled sequence space. Many simulations use the straightforward mean energy of all states, but this method can yield poorly optimized outputs if contributions from one state dominate the other states. Furthermore, this approach does not take into account the observation that the states in a functional protein (e.g. an enzyme bound to the transition state of the chemical reaction step or and enzyme bound to the reaction products) will need to be differentially stabilized.

Warszawski et al. developed an objective function for multi-state computational design that used a sigmoidal fitness model. Scores for each state (Estate) were converted to a fitness score

representing the predicted probability that the trial sequence stabilizes the state (Equation 2.2, page 23). Because the sigmoid functional form for each state have a responsive regime surrounded by two regimes where fitness is not responsive to changes in energy, compromise is possible

Figure 2.4: Two user-set parameters modulate the sigmoid fitness function. A) The offset parameter (o) sets the center of the fitness curve on the energy axis. B) The steepness parameter (s) sets the rate of change along the fitness curve.

(Figure 2.3, page 22). A user-defined offset parameter (o) determines the midpoint, and a steepness parameter (s) modulates the magnitude of the fitness differential with respect to energy (Figure 2.4, page 23). Effectively, the offset is an energy normalization factor and the steepness term is an inverse temperature factor. Ultimately, the total fitness for the multistate system is calculated as the joint probability of fitness over all states (Equation 2.4, page 23).

f itnessstate= f = _{1 + e}_s_(E1_state_≠o) (2.2)

f itnesstotal = F = mj fj = mj

1 + es(Ej,state≠o) f or each j state (2.3)

2.2.5.5 Alternate Approaches The Constrained Optimization of Multistate Energies by Tree

Search (COMETS) algorithm was developed to identify provably optimal sequences for a multistate simulation using a tree search algorithm [70]. As with other provable optimization methods like dead-end elimination, COMETS slows relative to stochastic methods as the number of design positions increases. Alternative approaches have focused on increasing the performance of multi-state design with algorithms that trade exactness for efficiency.

recapitulate the energies from single state design for a training set of sequences[37, 71]. The resulting scorefunction predictions can then be used to rapidly sample sequence space without requiring time intensive rotamer optimization. This algorithm was used to design heterodimeric coiled coils that bind with high specificity as well as affinity [72] where the average error in scorefunction energy prediction is 1-2 units. These units are approximated as kcal/mol based on correlation with experimental measurements of Gmutation, so the error in the scorefuction

predictions from CLEVER is still within a regime that is usable with computational design. For other protein topologies, the CLEVER cluster expansion results in errors of 5-30 energy units.

Rosetta Mean Field Prediction (MFPred) uses a mean-field approach to estimate sequence preferences[61]. The mean field approach does not perform rotamer optimization, but it performs iterative calculations of probability-weighted averaging of all two-body energies, where rotamer probabilities are calculated from the energies in the previous round. Prediction performace with MFPred was dependent on pre-minimization of the input structures, which is expected to give strong bias to the starting sequence. Nevertheless, predictions of sequence preferences from MFPred had higher sequence entropy than predictions from single-state design (Equation 2.4, page 24).

entropy= ≠ tyr_alapiLog20pi f or each pi probability of each i amino acid (2.4)

Finally, the Rosetta Restrained Convergence (RECON) algorithm allows the sequence on each state to optimize independently and then drives states to converge, unlike multi-state approaches that enforce convergence throughout the optimization[73, 74]. During the RECON simulation, positions that converge to a single amino acid on all states are fixed, and a convergence bonus

designed with a greedy algorithm as follows. An unconverged position is selected randomly and all sequence substitutions are sampled at that site. The best amino acid is chosen, and the position selection continues until all positions are assigned an amino acid. When RECON was applied to a small benchmark set of proteins, its performance improved over genetic algorithms for design cases of more than 30 positions, but RECON generally output predictions with very little variation over repetitions.

In document Mapping structural and environmental constraints on the mutational landscape of functional proteins using high-throughput screening and computational modeling (Page 40-45)