1.3 Proteins: Modelling and Applications
1.3.5 Parameter Inference
Given a protein model and energy function, it is a non-trivial task to optimize parameters, such as hydrogen bond strength or atomic radii, to produce accurate and reliable results. A wide variety of tech- niques have been developed. General purpose AA models optimize their parameters to match quantum mechanical calculations (108) or experimentally derived properties of small molecules (109).
In contrast, when optimizing parameters for CG models, especially those which include some form of bias, some terms of the energy function may not have a well-defined physical force associated with them. In these cases force field parameters should be optimized so that the native state is found at the (physiologically relevant temperature) global free energy minimum (62).
Traditionally CG models have used statistical-knowledge-based potentials (110) and the parameters have been tuned to best reproduce features, such as dihedral angle distributions or atomic distance distri- butions derived from a training set. This procedure relies on theBoltzmann hypothesis, the assumption that within native structures the features are statistically independent and are distributed according to the Boltzmann distribution. There is some empirical evidence for this hypothesis (111), yet the statistical independence of features is likely to be a poor assumption.
Over twenty years ago Maiorov et al. and Goldstein et al. developed native structure discriminant methods for parameter inference (112, 113) and these remain popular to this day (114). These methods optimize the parameters so that the native state has the lowest energy when compared to a decoy set of protein-like conformations. A disadvantage of these methods, however, is that they do not take temperature and hence protein thermodynamics into account: only the strength of the intermolecular forces relative to the decoys is used in the parameter estimation.
An alternative class of optimization algorithms use the principle of maximum likelihood (ML). Given a data set of observed experimental (or computationally generated) samples Ω and a set of force field
parametersΘ, an appropriate likelihood function, typically the Boltzmann distribution at an appropriate temperature,L(Θ|Ω), is introduced. ML methods tune their force field parameters in order to maximize this likelihood function, iteratively improving by following the gradient of the logarithm of the likelihood function.
Models which use parameters that maximize the likelihood function will produce the (suitably defined)
closest distribution to the original dataset and ML (also called relative entropy) approaches have been used successfully to infer parameters for CG water (107) and polyalanine (115) models, using samples from AA models as the data set. Parameter estimation methods for general protein models using the PDB database as the data set have also been developed (105, 106, 116).
Contrastive Divergence
For the simple case of a single parameter Θ={θ}, a single conformationΩ={Ω0} and setting the inverse thermodynamic temperatureβ = 1, the gradient of the log-likelihood required for ML methods is given by ∂lnL ∂θ = ∂E(Ω, θ) ∂θ −∂E(Ω∂θ0, θ),
whereE(., θ) is the (potential) energy function using parameter valueθand the angular brackets corre- spond to the thermodynamic expectation of the system using energy function E(., θ). A full derivation of this and the general case can be found in Chapter 3.
Although∂E(Ω0, θ)/∂θcan be calculated directly, the thermodynamic average can only be accurately estimated by running an MC or MD sampling algorithm until equilibrium is reached and then taking the expectation of a large number of equilibrated samples. This is an expensive procedure which needs to be carried out for a differentθ for every iteration of the ML procedure. For example, Winther and Krogh estimate the thermodynamic average by running extensive REMD simulations for each ML iteration (106).
A few methods which aim to reduce the computational expense have been developed. For example Shellet al. reweight samples from one iteration for use at later iterations so as to reduce the number of long equilibration runs required (115). Here we focus on an alternative method, known as contrastive divergence (CD). CD is a statistical machine learning technique, initially developed to efficiently learn the parameters of Boltzmann machines (117).
For each ML iteration of the CD procedure, rather than running until equilibration, Ω0 is evolved onlyK MC steps to conformation ΩK. K is a tunable parameter and theoretically can be as low as 1.
Rather than using the true log-likelihood gradient we replace it by
∂E(ΩK, θ)
∂θ −
∂E(Ω0, θ)
∂θ
when updatingθ.
The idea behind this approximation is that even after onlyKsteps, the data distribution has drifted towards the equilibrium distribution; ΩKis closer than Ω0to the equilibrium distribution (for the current value ofθ), wherecloser is appropriately defined. The drift in the observed energy gradient can then be used to guide the update procedure. A full justification and further discussion can be found in Chapter 3.
CD is significantly computationally cheaper than traditional ML methods and therefore a larger data set can be used for parameter inference. For example, Winter and Krogh were restricted to 24 different 11–14 residue-long protein fragments (106), whereas using CD, Podtelezhnikovet al. were able to use a
database of 247 protein PDB files as a data set (105). A large data set is important for transferability; Winter and Krogh found their force field performed poorly when used with proteins and peptides not in their data set.