RMSD bins from native structure (A)
10-9 9-8 8-7 7-6 6-5
F ig u r e 4.1: Overview of the structure refinement procedure of protein models pre dicted by ab initio methods. The iterative process of the structure refinement can take a large amount of processing time and resources. The aim of this method is to identify the native fold at an early stage of refinement in order to both accelerate the procedure and identify structural features from related structures that could improve the model.
The flowchart in figure 4.2 describes a general overview of the steps involved in the prediction and refinement of protein tertiary structure from amino acid sequence. From a given protein sequence, established ab initio methods would be employed to generate a number of low resolution structural models, i.e. predictions at the s ta rt of the structure refinement process. The m ethod proposed in this chapter then attem p ts to recognise the most likely fold of this target structure by comparing
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 155
these approxim ate models to a database of known structures. A consensus of the results from the d atabase searches of all these models is then taken in order to assign th e m ost likely fold. A fter the native fold has been identified by this m ethod, it is th en proposed th a t further stru ctu ral refinement could be driven by constraining th e models w ith highly conserved stru c tu ra l features identified from the related superfam ilies w ithin th e native fold. This last step will not be covered in this chapter, however the identification of conserved stru c tu ra l features such as inter residue contact is discussed in detail in chapter 2.
Amino acid s e q u e n c e a b initio m e th o d s Tertiary structure predictions S tructure c om paris on S ea rch structure d a ta b a se s C o n s e n s u s fold recognition Identify the native fold A n a lys e structural te m p la te s identify co n serv ed fea tu res of native fold
O p tim ise p re d ic te d s tructure using na tive constraints
G en erate refined structure
F ig u r e 4 .2 : Flowchart describing a procedure to generate high quality ab initio
predicted structures.
In sum m ary, a m ethod is presented which aims to recognise the native fold of a set of ab initio predicted model structures during an early stage of stru ctu re refine m ent, thus reducing the tim e and increasing th e accuracy of further refinement. This protocol presents an alternative m ethod of fold recognition th a t com plem ents the more established threading m ethods currently in use. Also th e accelerated refine m ent tim e, together w ith advancing ab initio m ethods, could enable an application
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 156
of ab initio approaches to far larger d atasets of protein sequences such as genomic d ata.
The m ethod presented has been developed to assist fold recognition for pro teins w ith stru ctu ral relatives using ab initio approaches. T hreading m ethods have already been shown to perform well for some such targets. However, for very dis ta n t homologues or more diverse analogues, the potentials used in threading may no t m odel the sequences sufficiently to distinguish th e correct fold. Ab initio ap proaches using more flexible approaches, rath er th a n the static tem plates used in threading, may perform b etter. Therefore, the fold recognition perform ance of this m ethod was com pared to th e perform ance of trad itio n al threading results.
T he work discussed in this chapter was conducted in collaboration w ith Xavier de la Cruz a t University College, London and was published in Proteins: Structure, Function and Genetics (de la Cruz et a l, 2002).
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 157
4.2
M ethods
4.2.1 Definition of Terms
M ethods for assessing the consensus fold recognition protocol presented in this chap ter can be separated into two parts. The first describes the procedure for generating the different datasets of protein models th a t the fold recognition procedures will be applied to. This is discussed in more detail in section 4.2.2. The second part describes the two different structure comparison procedures used for the fold recog nition (see section 4.2.3).
Ib 5 4 lajO
F ig u r e 4.3: Example of a fold, or topology, relationship in CATH. Both PDB struc tures, 1554 and lajO, share highly similar folding arrangements and are classified in the same TIM-barrel topology (3.20.20 in CATH). However there is insufficient evo lutionary evidence to guarantee a common ancestor so are classified into two different homologous superfamilies in CATH
Throughout this chapter the topology, or fold, of a protein is defined by the first three numbers in the CATH classification database. Structures classified in the same topology in CATH share the same general spatial arrangem ent and connectivity of secondary structures. This can be illustrated by comparing the PDB structures lb54 and lajO (see figure 4.3), which are both contained in the triose phosphate isomerase (TIM) barrel fold in CATH, classification code 3.20.20. The first structure, lb54, is a hypothetical protein found in Baker’s yeast, Saccharomyces cerevisiae^
which binds pyridoxal-5’-phosphate (vitamin B6 complex). It is classified in the
alanine racemase superfamily in CATH with the classification code 3.20.20.10. The second structure, lajO, is a dihydropteroate (DHP) synthetase enzyme from E. Cali,
and is a member of the DHP synthetase superfarnily in CATH (classification code 3.20.20.20). Both these proteins are enzymes and both have highly similar folding
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 158
arrangem ents, however since there is currently insufficient evolutionary evidence to guarantee th a t they diverge from a common ancestor, they are classified into the same TIM -barrel topology, but different superfam ilies, in CATH.
Since fold recognition aims to identify stru ctu ral, rath er th a n specifically evo lu tio n ary relationships, m atching a relative in th e correct topology is considered correct recognition. For clarity, it should be mentioned th a t the protein models th a t are used to search th e stru ctu re d atab ase are referred to as query structures th ro u g h o u t this chapter.
Also, when assessing the fold recognition protocol, care was taken to ensure th a t any stru c tu ra l relationships in the stru ctu re d atabase th a t could have been identified sim ply by sequence sim ilarity were removed, unless explicitly stated otherwise. This was accom plished by removing any stru ctu res from the database search th a t had >35% sequence identity to th e query structure.
4.2.2
G enerating th e D ata sets
4.2.2.1
Sum m ary o f D atasets
The ability to recognise the native fold from non-native stru ctu res was tested by exam ining protein models derived from three sources, covering the m ost frequently used techniques in different ab initio m ethods.
• Low resolution versions of native stru ctu res provided by Xavier de la Cruz (de la Cruz et a l, 1997).
• ab initio predictions kindly provided by the David Baker group (Simons et a l,
1997).
• ab initio predictions by various m ethods from the CASP3 protein stru ctu re prediction com petition.
4.2.2.2
Low R esolution Versions o f N ative Structures
The aim of this d ataset was to provide a set of protein stru ctu res th a t would ap proxim ate models generated by ab initio m ethods. Due to th e enormous size of conform ational space th a t a polypeptide chain can possibly adopt, m any ab initio
m ethods a tte m p t to lim it this search by restricting th e chain to certain states, such as restricting torsion angles to a given set of values or restricting th e position of residues to the nearest points in a 3D lattice. Work by de la Cruz et a l (1997) suggested a protocol to build a range of low resolution protein stru ctu res from the
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 159
native experim ental structures. Using this protocol, this d ataset of approxim ate p rotein stru ctu res was generated and provided by Xavier de la Cruz.