T esting th e algorithm - Global Score (2): Distance between the distribution of related and no

Global Score (2): Distance between the distribution of related and non related scores

2.3.4 T esting th e algorithm

T he ability of the algorithm to recognise th e native fold of a protein from p artial contact d a ta was examined by random ly removing contacts from the three test stru ctu res and identifying the point a t which th e correct fold is no longer recognised. T he te st structures were first searched against the d atab ase of stru ctu res using CONALIGN, then the resulting scores were ranked by decreasing contact overlap score. A consensus view of th e folds seen to occur in the to p five positions was then taken from these results, i.e. a running to ta l considering th e frequency of the first th ree GATH classification digits (C.A.T) for th e to p five contact overlap scores. If th e correct topology appears more frequently th a n any other topology in these top 5 positions, the CONALIGN algorithm is deemed successful as th e fold has been correctly recognised. This scheme is also described in chapter 4.

The reduced sets of contact d a ta (models) were generated for each of th e three te st stru ctu res by random ly removing a given percentage (between 10% and 90%) of th e contacts observed in the native structure. To account for possible irregularities when random ly removing contacts (i.e. some contacts may prove more im p o rtan t th a n others when attem p tin g to recognise th e protein fold), 2 0 different sets of ran

dom ly selected contact d a ta were generated for each test stru ctu re a t each percentage threshold (0, 10, 20...90%).

T he results of this experim ent were then p lo tted for each test stru ctu re by count ing th e num ber of models able to recognise th e correct fold w ithin each percentage bin (out of the m axim um num ber of 20 m odels). Figure 2.15 provides a histogram sum m arising th e results for all three test structures. This can be illu strated by high lighting a p articu lar set of results. For the fum arase protein (PD B code Ifup, chain A, dom ain 2), out of the 20 models generated by random ly selecting a set of 20%

of th e native contacts, only one model failed to recognise the correct fold. Indeed, even when using as few as 1 0% of the contacts observed in th e native stru ctu re, the

fold recognition protocol identified the correct fold in every model, for all three test structures.

Chapter 2. Inter-Residue Contacts for Structural Comparison 85

■ 1 fup, c h a in A (d o m a in 2 ) ■ 1cbj, c h a in A

□ 2 b o p , c h a in A

3 0 4 0 5 0 6 0 7 0 8 0

Percentage of native contacts

F ig u r e 2.15: Results from the sets of reduced contact data. For each of the three test structures, 20 models of randomly selected contacts were generated from a series of thresholds for the percentage of native contacts (e.g. 10, 20, ...90% of the contacts observed in the native structure). The y-axis simply counts the number of models that could be assigned the correct fold (from the maximum number of 20 models in each percentage bin.

Chapter 2. Inter-Residue Contacts for Structural Comparison 86

2.4 D iscussion

This chapter has described m ethods for displaying inter-residue contacts and has introduced simple scoring schemes for com paring contacts between 3D structures and identifying those contacts th a t are highly conserved across a family of related structures.

A proposal for future work is to invert th is problem by assessing the num ber of correlated m utations occurring in spatially close residues. This would involve com paring m utatio n m atrices for positions in a stru c tu ra l alignm ent th a t are known to be in contact. The CATH database (version 2.0) holds stru c tu ra l alignm ents for 362 homologous families and a procedure for identifying residue contacts w ithin these stru ctu ral alignm ents has already been presented in this chapter. This work would provide useful inform ation on th e occurrence and characteristics of correlated mu tatio n s between bo th individual and conserved contacts w ithin a stru c tu ra l family. For example, w hat is the to tal num ber of correlated m utations observed in a given homologous family? Are contacts conserved across a stru c tu ra l fam ily more likely to exhibit correlated m utation behaviour? Are correlated m utations more likely to occur in residues near an active site? Answers to these questions certainly have im plications for contact prediction and may also help to increase our understanding of the mechanisms involved in the evolution of protein structure.

An algorithm has been presented which allows a protein w ith lim ited stru ctu ral d a ta (e.g. inter-residue distance constraints from NM R d ata) to be scanned against a library of 3D structures in order to identify the correct fold group. The three CONALIGN param eters have been optim ised in a comprehensive m anner through the introduction of cross-validated scoring schemes.

The initial testing protocol, discussed in section 2.3.4, provided an interesting set of results as it suggested th a t the native fold could be found from a small num ber (as low as 10%) of native contacts for the three stru ctu res tested. However, these initial results cannot be viewed as an exhaustive exam ination of this stru ctu re com parison algorithm .

If this algorithm were to be considered as a possible application of fold recogni tion from contacts predicted by sequence, it would be necessary to answer further questions. For example, how well does th e m ethod perform when incorrect contact d a ta is included in th e reduced set of native contacts? Could th e inform ation used to compare residues be expanded to include predicted accessibility or residue sim ilarity scores from sequence su b stitu tio n m atrices? Is there any increase in perform ance when using highly conserved contact d a ta from m ultiple stru ctu re alignm ents?

Chapter 2. Inter-Residue Contacts for Structural Comparison 87

At the tim e of developing this algorithm , th e accuracy of ab initio prediction of contacts from correlated m utations was deemed to be very poor. The contact analysis tools developed in this chapter were used to assess subm issions to th e con ta c t prediction category of CASP3 (Orengo et a i, 1999) (see attach ed paper). The accuracy of these sets of predicted contact d a ta were scored by sim ply shifting the alignm ent between the predicted contacts and native stru ctu re one residue a t a tim e and calculating th e overlap score a t each step. This effectively provided a d istrib u tion of random contact overlap scores and allowed a significance score (Z-score) to be calculated for the original alignm ent. However, the m ost significant prediction only gave a Z-score of 3.4 (Casadio group, Fariselli & Casadio (1999)) and all other predictions gave Z-scores in the range 0.2-2.8.

Furtherm ore, the m ethod being developed by our collaborator Prof. Ikura was also unable to rapidly obtain NOE contact d ata. This reduced the value of using CONALIGN to speed up structu re d eterm ination by NM R since the most tim e con sum ing step was still th e assignm ent of peaks in the sp ectra (which precedes the assignm ent of NOE d ata). Therefore, ra th e r th a n fu rth er optim ising and testing CONALIGN as suggested above, it was decided to pursue other related research them es based on com paring inter-residue contacts for stru ctu re prediction and clas sification. However, this algorithm has dem onstrated sufficient prom ise to w arrant future research, especially since the accuracy of predicting contacts from sequence continues to improve (Lesk et a i, 2001).

C hapter 3

G eneration and A p p lication o f

In document Consensus templates for protein structure recognition (Page 85-89)