1.4 Target Identification
1.4.4 Modelling Three-Dimensional Protein Structure
As discussed, thanks to the advances in crystallographic methods, there is now a vast amount of protein structural data available. However, the number of known protein sequences remains significantly higher than the number of solved structures. This is known as the sequence- structure gap.154 Fortunately, for many of the proteins for which there are no published 3D
structures, it might be possible to use protein structure prediction to enable structural analysis. Indeed, modelling techniques have now matured to a point of routine use in complementing experimental techniques.175
Computational methods for protein structure modelling are widely used in the pharmaceutical industry and a great deal of time and effort has been devoted to expanding the scope and improving the accuracy of the models. Current methods can be categorised as one or a combination of three approaches.176 Firstly, homology or ‘comparative’ modelling; in which a
model of the target protein of interest is generated based on protein sequence alignment with a homologous protein for which an experimental structure is available as a template. This technique is derived from the work of Šali and Blundell.177 It relies on the observation that
evolutionarily related sequences typically adopt similar 3D structures as they retain folds characterised by core structures that are robust against sequence modifications. It is therefore most effective when the structure of a closely related protein family member is available.154
Secondly, where no structures of proteins with significant sequence similarity are available, fold recognition or ‘threading’ methods can be utilised. In threading, the target protein sequence is systematically aligned to a library of proteins of known structure and the fit is assessed by
energetics. The best match, producing the most reliable model, is identified as that with the lowest quasi-energy score, which represents the structural similarity of the target and template proteins. This approach is reliant on the hypothesis that because there are more known proteins than folds; the folds of a protein with unknown structure are likely to resemble known folds.176
Thirdly, when neither comparative modelling nor threading can be utilised due to a lack of available templates, de novo methods can be employed. De novo methods are used to predict the protein structure directly from the primary sequence using the physical principles of protein folding. Information from determined structures may be incorporated but without assumption of any evolutionary relationships. Whilst versatile, this process is hugely computationally demanding and therefore successes tend to be limited to predicting folding of short peptides.176
Of these approaches, homology modelling is considered the most accurate and is therefore the most commonly employed in drug discovery research.178
Template-based protein structure prediction is founded on two assumptions: that similar protein sequences will adopt similar folds, and that individual regions of a protein will exhibit the same folds already observed in the PDB. The process of homology modelling involves several phases: template selection, target-template sequence alignment, model building, model refinement and model quality estimation, Figure 1.12.154 These stages are amenable to
implementation of automated pipeline workflows, in which the user inputs the target protein sequence and the pipeline outputs a predicted structure.176 For example, the SWISS-MODEL
server (http://swissmodel.expasy.org)179 enables automated comparative modelling of 3D
structures.
In template selection, the experimental structure most appropriate for modelling is identified. Generally this is the 3D structure of the most closely related protein available, however there are additional criteria that might be considered. Single experimental structures and, by virtue of this, models only represent one conformation of the large range exhibited by proteins, which, as discussed, are intrinsically highly dynamic. Given that proteins can undergo a substantial rearrangement in order to accommodate a ligand, in some cases it can therefore be favourable to use an experimental structure depicting the protein in a ligand-bound state.154 In addition,
the quality of the template structure should be taken into account. Crystal structures with poor resolution or which are missing critical residues or loops may not provide adequate structural information to build a reliable model. In practice, programs such as NCBI BLAST (Basic Local Alignment Search Tool)180 are used to search the PDB for suitable templates using the protein
Figure 1.12. Homology Modelling Pipeline. A typical homology modelling workflow of template selection, target-
template sequence alignment, model building, model refinement and model quality estimation, annotated with suggested software for each step.
Target-template sequence alignment is performed by programs such as EMBL-EBI Clustal Omega.181,182 The sequence identity of the target-template alignment is good indication of the
quality of the resulting model. Models where the target-template sequence identity is above 50% are regarded as highly accurate and can be employed in drug discovery research. The protein core is typically modelled with high accuracy due to good evolutionary conservation in this domain and any flaws will likely only be observed in the packing of side chains or loop regions. Models based on less than 30% sequence identity are considered low accuracy models. Ambiguous alignment becomes a severe problem and it is possible that an entirely incorrect fold can be predicted.176 In attempt to prevent this, more sensitive methods can be employed in
homologue detection, such as DELTA-BLAST (Domain Enhanced Lookup Time Accelerated BLAST).183 However, threading methods may provide better results in these instances.
Model building can be accomplished by two approaches. The first, rigid fragment assembly, was first implemented in 1987 by the Blundell lab.184 The core of the model is constructed first from
the best structurally conserved regions of one or more templates. Any inserts or deletions in the target-template alignment, such as loops, are then incorporated as fragments, each individually modelled on a template of close resemblance.185 The SWISS-MODEL server employs a rigid
fragment assembly modelling procedure.179 The second is satisfaction of spatial restraints. Here
spatial restraints are derived from a range of sources, including the target-template alignment, other known protein structures and molecular mechanics force fields. The target protein is then folded into the conformation which best satisfies these restraints. The most widely adopted approach that uses satisfaction of spatial restraints, which is also the industry standard in homology modelling, is MODELLER (https://salilab.org/modeller/).177
Following generation of primary protein models, refinement is required to optimise geometry and stereochemistry and remove any unfavourable contacts. This typically involves energy minimisation utilising a molecular mechanics force field, which may be followed by molecular dynamics to improve side chain contacts and rotamer states, and Monte Carlo sampling to improve accuracy of backbone conformations and core side chains.176 Model building software
often incorporates model refinement and evaluation capabilities. For further optimisation, drug discovery software such as Discovery Studio186 and MOE (Molecular Operating Environment)187
offer automated preparation functions to prepare and minimise models.
Model quality estimation is important as models can be produced with significant inaccuracies.154 The geometrical accuracy and completeness of the model is evaluated and it is
determined whether the proposed structure is energetically reasonable. Models can be assessed as to whether they possess the correct folds, which can aid in detecting errors in template selection, fold recognition and target-template alignment.188 A scoring system, such as
the DOPE (Discrete Optimised Protein Energy) score incorporated into MODELLER189, can be
used to identify the best of a series of models. The quality of model required is highly dependent on its intended use, and in some instances, even if it is low resolution, ‘any level of physical characterisation of a protein, as opposed to its absence, is valuable’.190 Lower accuracy models
can be sufficient in designing mutagenesis experiments or in preliminary target validation, whereas greater accuracy is required for structure-based virtual screening applications.191
Homology modelling capabilities are constantly improving but there are a few challenges that remain to be overcome. The accuracy of template-based modelling is limited by the availability of appropriate template structures and even at high sequence identity, although overall protein folds are well conserved, substrate specificity and mechanisms of catalysis vary greatly, indicating structural divergence. Approaches that specifically scrutinise and refine local structure could be used to complement homology-based modelling in these instances. There are also prevailing difficulties in refining models away from the template and toward the target structure, especially at low target-template sequence identity where significant rearrangement can be necessary.176