Background - Consensus templates for protein structure recognition

At th e end of 2002, th e P rotein D a ta Bank (Berm an et a i, 2000) contained the 3D co-ordinates for more th a n 19,000 protein chains. One common approach in rationalising this am ount of d a ta is to group together proteins based on sim ilarities in protein sequence, stru ctu re or function. M any classification schemes have been proposed in order to provide reliable clusters of related protein structures, using varying degrees of m anual intervention such as CATH (Orengo et a i, 1997; Pearl

et a i, 2 0 0 1b), SCO P (M urzin et a l, 1995; Lo Conte et a i, 2000), FSSP (Holm et ai,

1992) and 3Dee (Dengler et a i, 2001) (see section 1.4.1 for descriptions of each). The num ber of structu res available to these databases is expected to expand dram atically w ith the advent of several large-scale stru c tu ra l genomics initiatives (Pennisi, 1998). As a result the challenge now facing stru ctu re databases is to provide m ethods to cope w ith this influx of structures and also to fully utilise this w ealth of data.

G rouping related proteins together into evolutionary clusters w ith sim ilar stru c tures provides two m ain benefits. F irst, the enormous am ount of redundancy present in the d atab ase can be reduced by selecting a single stru ctu re to represent a whole

Chapter 3. Structural Templates 89

cluster of related proteins rath er th a n considering each stru ctu re individually. Also, once these evolutionary clusters have been defined, th e common stru c tu ra l features and highly conserved am ino acid positions can be identified to help to provide in sights into evolutionary relationships which may not be apparent from analysis of th e separate structures.

The technique of using the consensus inform ation from a series of related proteins to examine constraints on protein evolution is well established in th e field of protein sequence analysis. W hen aligning a series of related sequences, it is possible to identify recurring p attern s using either the identities or chemicophysical properties of am ino acids a t each position in th e alignm ent. If a p articu lar am ino acid or amino acid property is seen to appear in a large num ber of non-redundant sequences then it is likely th a t this residue feature has been conserved due to a functional or energetic constraints (M irny & Shakhnovich, 1999). This constraint may be stru c tu ra l in n atu re as it could represent an im p o rtan t interaction in the folding pathway, or it could be functional as it could represent an active site residue which is vital for the biological function of the protein. E ith er way, the accum ulation of th is consensus inform ation from a set of related proteins can be used as an identifying fingerprint th a t describes im p o rtan t evolutionary features.

The concept of gathering a consensus of inform ation from related proteins can also be applied to th e field of protein stru ctu re alignm ents. However, using stru ctu re ra th e r th a n sequence inform ation enables th is concept to be extended to even more d istan t evolutionarily relationships since protein stru ctu re is more conserved th an sequence during evolution (Sander & Schneider, 1991; Flores et a i, 1993; Orengo

et a i, 1993). This is illu strated in figure 3.1 by com paring haemoglobin, a-chain from pig (IQ P W , chain A) w ith haemoglobin, dom ain 1 from pig roundw orm (lA SH ) and haem oglobin, a-ch ain from horse (IIB E , chain A). T he proteins involved in both of these com parisons have highly sim ilar structu res (SSAP scores greater th a n 80) and sim ilar functional characteristics (oxygen-binding proteins), yet th e sequence sim ilarity between IIB E and lA SH is low (11% sequence identity).

For this reason, and also because te rtia ry stru ctu re contains so much m ore infor m ation th a n am ino acid sequence, alignm ents from stru c tu ra l com parisons usually prove to be far more robust th a n those based on sequence for detecting d istan t evolutionary relationships. As a result, stru c tu ra l alignm ents have often been used to validate sequence alignm ents of d istan t proteins (G otoh, 1996). Increasing the accuracy of the alignm ent in tu rn increases the ability to recognise conserved fea tures, especially when aligning large num bers of d istan t structures. Thus, m ultiple stru c tu ra l alignm ents provide a powerful tool for identifying residues w ith functional

Chapter 3. Structural Templates 90

85 96

F ig u r e 3.1: Comparison of the structures in the globin-like superfamily illustrating that protein structure can be conserved even at low sequence similarity. Sequence identity is shown as the first number with the SSAP structural comparison score in parentheses. (A) haemoglobin, a-chain from pig (IQ PW , chain A), (B) haemoglobin, domain 1 from pig roundworm (lA SH ), (C) haemoglobin, a-chain from horse (IIBE, chain A). For ease of reference this figure is repeated from figure 1.7

im portance and can therefore be used to assist the putative assignment of biological function.

The evolutionary information found in a multiple structural alignment is often encoded into a structural tem plate containing the conserved structural features at each position in the alignment. This is analogous to sequence ’profiles’ generated from sequence alignment protocols such as PSI-BLAST (Altschul et ai, 1997), a topic discussed in more detail in chapter 5. A representative family tem plate can often prove more powerful than a series of single structures when identifying distant relatives, as structural features th a t are highly conserved during the process of evo lution can be identified. G athering all the structural information within a family can also identify the degree of conservation, i.e. the relative im portance, of these conserved features. Thus, features such as secondary structure elements buried in the core and motifs integral to the function of the protein family would be given higher weighting in the tem plate. Conversely, highly variable regions (e.g. periph eral coils) can be recognised as noise and removed from the signal. These consensus

Chapter 3. Structural Templates 91

features act as a fingerprint for the whole family ra th e r th a n the individual members and can provide a fast and sensitive probe for finding stru c tu ra l relationships and homologies.

In document Consensus templates for protein structure recognition (Page 89-92)