Stru ctu re P red ictio n
4.1.2 P red ictin g Structural Features from Sequence
The ab initio prediction of protein stru ctu re still poses one of the m ost challenging problem s in S tru ctu ral Biology. This difficulty arises from two im p o rtan t factors: th e enorm ous num ber of possibilities of conform ational space (Dill, 1993) and the subtle interplay between the chemicophysical properties involved in protein stru ctu re stability, especially when regarding the co-operative effects of large networks of residues. As a result, current ab initio m ethods of protein folding are often lim ited by th e huge com putational effort involved in sim ulating the folding process even for sm all peptides. An additional problem is the difficulty in locating the energy m inim um corresponding to th e native conform ation w ithout converging on other non-native local energy minima.
T here are four m ajor types of stru c tu ra l predictions th a t can be derived from the am ino acid sequence inform ation alone. These are th e prediction of class, secondary
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 150
stru ctu re, inter-residue contacts and te rtia ry structu re. Each of these four areas of
ab initio prediction will be discussed in th e following sections.
4 .1.2 . 1 C la s s P r e d i c t i o n
M any a tte m p ts have been made to predict general stru c tu ra l properties for proteins given th e com position of the amino acid sequence. At th e m ost basic level, composi tion is given as th e fraction of each of th e 20 am ino acids in the sequences. This has also been extended to examine the com position of sequence fragm ents, for exam ple using blocks of two or three residues, ra th e r th a n th e com position of individual residues.
This type of analysis has been used to predict th e secondary stru ctu re content of a given protein sequence (i.e. percentage of helix, stra n d and coil) w ith a reasonable degree of accuracy, certainly of a com parable accuracy to th e results from experi m ental m ethods such as circular dichroism (Rost & Sander, 1993; E isenhaber et a i,
1996b,a).
4 .1 .2. 2 S e c o n d a r y S t r u c t u r e P r e d i c t i o n
W hen exam ining the secondary stru ctu re sta te (o-helix, ^ -stra n d or random coil) of residues in known structures, it can be noted th a t m any amino acids display striking differences in propensity to adopt these different secondary stru ctu re states. For example, steric clashes between the pyrrolidine side chain of proline and the Cp atom of th e preceding residue generally restricts this am ino acid from being found within an o-helix (although it can appear a t the first tu rn of the helix). These intrinsic propensities for secondary stru ctu re were analysed by Chou & Fasm an (1974a) using a very lim ited d ataset of protein stru ctu res (only 15 protein structures were available in 1974) and this was also used in a predictive m ethod (Chou & Fasm an, 1974b). A m ore successful m ethod for using am ino acid propensities to predict secondary stru ctu re was presented by G am ier, O sguthorpe and Robson in the C O R m ethod (G am ier et a l, 1978). Instead of using propensities for single am ino acids, this approach applied techniques taken from inform ation theory to analyse a window of eight residues either side of th e am ino acid being predicted.
As the sequence database has grown m any groups have atte m p ted to use large alignm ents of related sequences to identify conserved p a tte rn s of am ino acids ty p ically seen in secondary structures. Exam ples of such p attern s include repeats of hydrophobic and hydrophilic amino acids every three or four residues. Since there are 3.6 residues per tu rn of an a-helix, this recurring p a tte rn often indicates a side
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 151
of an a-helix facing into the protein core and out to th e solvent for hydrophobic and hydrophilic side-chains respectively. Also, positions in the sequence alignm ent w ith insertions and deletions usually coincide w ith random coil secondary secondary stru ctu re, often on th e surface of th e protein. It is only when m any sequences are com pared th a t th e random evolutionary changes, i.e. noise, can be differentiated from conserved sequence p a tte rn s derived from conserved features in the protein stru ctu re.
The application of neural networks for th e analysis of the sequence p attern s in these m ultiple sequence families has so far proved the m ost successful m ethod for au to m ated prediction of secondary structure. The PH D m ethod (Rost et a l, 1994) train ed a neural network on profiles built from m ultiple sequence alignm ents. This was able to correctly assign the secondary stru ctu re states of around 70% of residues for previously unseen sequences. More recently, this accuracy score was increased to around 77% w ith the PS IPR E D m ethod (Jones, 1999b) by im proving th e quality of th e sequence profiles th a t are used to tra in the neural networks.
4.1.2.3
Inter-R esidue Contact P rediction
Since secondary stru ctu re can be predicted w ith reasonable accuracy, the next level of com plexity in ab initio stru ctu re prediction is to predict how these secondary stru c tu re elem ents may pack together. To this end, considerable research effort has been spent on the prediction of interactions between residues w ithin a protein from sequence alone. Knowledge of sufficient num bers of these points of contact could th en be used to constrain th e secondary stru ctu re elem ents and generate a reasonable model of the te rtia ry structure.
Prediction of these inter-residue contacts can be m ade by exploiting th e phe nom enon of correlated m utations (Gobel et a i, 1994; Taylor & H atrick, 1994; Thom as et a i, 1996; O rtiz et a i, 1998; Fariselli et a l, 2001; Pollastri & Baldi, 2002). These correlated m utations arise due to the local steric and physicochemical environm ent changing following a given residue m utation. M utations a t positions close in sp atial proxim ity acting to com pensate for these changes are more likely to be accepted th an the random changes observed in evolution. For this reason it is suggested th a t com pensatory changes observed between two residues a t a sim ulta neous point in the evolutionary ancestry may arise from the residue positions being close in th e protein structure.
Again, m any groups have a tte m p ted to recognise th e sequence p attern s result ing from correlated m utations by train in g neural networks on m ultiple sequence
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 152
alignm ents from known sequence families (O rtiz et a i, 1998; Fariselli et a l, 2001; P ollastri & Baldi, 2002). Having been trained, these neural networks are th en used in a predictive capacity w ith previously unseen protein sequences. However, un like secondary stru ctu re prediction, the prediction of inter-residue contacts by such m ethods has proved difficult and unreliable due to the enorm ous num ber of related sequences required to recognise such sequence p a tte rn s and th e large num ber of false positives. One reason for the lack of su bstan tial success w ith th is approach is th a t com pensatory m utations could occur across networks of residues ra th e r th an sim ply between two residues. This would make the sequence p a tte rn s for correlated m u tation s more likely to be specific for a given stru c tu ra l family ra th e r th a n follow predictable rules across the stru ctu ral space.
4.1.2.4
Tertiary Structure P rediction
As m entioned previously, the m ain goal of protein stru ctu re prediction is to obtain th e te rtia ry fold directly from the am ino acid sequence. Generally, m ost m ethods for predicting protein te rtia ry stru ctu re can be broken down into two parts.
• A procedure for generating a series of possible conform ations of th e protein chain.
• A potential energy function which can evaluate these conform ations to cor rectly identify the native structure.
A general difficulty w ith ab initio prediction is th e enorm ous num ber of con form ations th a t a protein chain can possibly adopt. M any groups have chosen to sim plify this problem by restricting the residues in the chain to discrete points on a 3D lattice (Hinds & Levitt, 1994; Kolinski & Skolnick, 1994; P ark & Levitt, 1995) or by restricting th e protein chain to a small num ber of allowed torsion angles (Dan- dekar & Argos, 1994; Srinivasan & Rose, 1995).
True ab initio m ethods then evaluate these predicted structures based solely on the fundam ental physicochemical properties of am ino acid residues, e.g. size and charge. However, a more pragm atic approach is to introduce knowledge-based tech niques, i.e. m ethods th a t incorporate inform ation from databases of known struc tures. The advantage of true ab initio m ethods is th a t, when successful, th e results would be independent of any bias present in protein stru ctu re databases. M ethods which rely on knowledge-based approaches alone, e.g. threading (see section 4.1.2.5), will have th e inherent lim itation th a t they can only provide accurate models for se quences adopting previously observed folds.
C hapter 4. Fold Recognition to Im prove ab initio Protein Structure Prediction 153
4.1.2.5
Fold R ecognition
As m entioned in section 4 .1.2.4, predicting an initial conform ation for the protein
chain can present a difficult problem due to the large num ber of conform ational pos sibilities. To avoid this, a m ethod was proposed th a t T hreaded’ the query sequence into conform ations adopted by experim entally solved protein structu res or tem plates (Jones et a i, 1992). Each of these threaded stru ctu res were then assigned a global energy by com paring the distances between am ino acids on th is tem p late stru ctu re w ith th e distances seen in known structures. To allow for insertions and deletions, a double dynam ic algorithm (see chapter 1) was employed to find th e optim al align m ent between th e query sequence and tem p late stru ctu re, i.e. the alignm ent th a t provided the lowest global energy using th e same knowledge-based potential.
Therefore, threading m ethods avoid the com putational expense of the first step of m any ab initio procedures, i.e. generating p u tativ e conform ations of th e protein chain, by using known structures as tem plates. As a result th read in g offers a fast m ethod for recognising sequences th a t adopt known stru c tu ra l folds.
Profile-based sequence comparisons also provide a m eans of fold recognition using sequence inform ation alone. A sequence profile provides a highly detailed descrip tion of the observed residue changes for each position of a large m ultiple sequence alignm ent (discussed in more detail in chapter 5). The variability of residue substi tu tio n s observed a t each position in the sequence alignm ent reflects th e flexibility of these positions in 3D space. As a result, these sequence profiles im plicitly incorpo ra te a great deal of stru c tu ra l inform ation th a t is specific to th e fam ily of proteins they describe. The m ost powerful profile-based sequence m ethods, such as SAM (K arplus et a l, 1998), can reach levels of recognition com parable to structure-based th read in g m ethods (Orengo et a i, 1999).
Chapter 4. Fold Recognition to Improve ab initio Protein Structure Prediction 154
4.1.3
Aims
In a typical ab initio prediction method, a general packing arrangem ent of secondary structures is predicted, then this approxim ate protein structure undergoes a series of refinement stages. Often a large number of these models are generated using small variations in the param eters, then each is assessed for native-like structural features, such as solvent accessibility, good secondary structure packing and favourable inter residue interactions. This step is used to determ ine whether each model is a likely candidate or should be discarded from the refinement process. This refinement process can prove extremely time consuming and com putationally expensive since the protein chain can adopt so many conformational possibilities for each of these structures (see figure 4.1).