G enerating Structural T em plates - Consensus templates for protein structure recognition

3.2.3.1 Selecting R epresentative Structures

To distill th e greatest am ount of inform ation from a large num ber of structures, a careful screening process is required in order to identify suitable representatives to include in the final tem plate. Ensuring th a t a tem p late identifies all th e im por ta n t stru c tu ra l features th a t have been conserved during th e process of evolution necessitates including proteins th a t are highly divergent (i.e. have low sequence identity). T h a t is, a stru ctu ral feature shared by a series of very d istan tly related proteins is likely to be the result of im p o rtan t stru c tu ra l or functional constraints. It is more difficult to identify highly conserved stru c tu ra l features when com paring closely related stru ctu res as the sim ilarity could be sim ply due to the lack of tim e for evolutionary divergence. However, if th e stru ctu res w ithin th e tem p late are too

Chapter 3. Structural Templates 101

divergent then im p o rtan t consensus features can be hidden or missed altogether by th e poor quality of the resulting alignm ent.

Also, it is often the case th a t some gene sequences and therefore protein stru c tures are more thoroughly researched th a n others. This leads to some families containing large num bers of proteins th a t have highly sim ilar sequences. It follows th a t these proteins will have near identical structures purely because there has not been tim e to a d ap t and evolve rath er th a n from any specific stru c tu ra l constraint. If proteins involved in these highly populated areas of stru c tu ra l space were included w ith th e same weighting alongside more d istan t structures, th e tem p late would be unfairly biased.

3.2.3.2 Selecting Structurally Coherent Sub-Groups

A m ultiple-linkage algorithm was w ritten in order to group th e stru ctu res w ithin each of the superfam ilies into stru ctu rally coherent clusters. The algorithm first reads a m atrix of pairwise stru ctu ral sim ilarity scores generated by th e SSAP stru ctu ral com parison algorithm for all proteins being clustered. The algorithm then selects the highest resolution stru ctu re for each sequence fam ily clustered a t 35% identity, i.e. no two representative structures have sequence identities greater th a n 35%. This helps to remove redundancy in th e stru c tu ra l tem plates as sequences th a t are more th a n 35% identical will nearly always have highly sim ilar 3D structures. S tartin g w ith th e highest SSAP score, these representative stru ctu res are th en clustered on th e basis th a t a stru ctu re can only join a cluster if it has a stru c tu ra l sim ilarity above a given threshold to all the existing members of th a t cluster.

T he multiple-linkage approach was chosen over single-linkage as th e objective was to define stru c tu ra l clusters th a t were internally consistent. Single-linkage cannot guarantee this consistency as it joins clusters on the requirem ent th a t only one stru c tu re from each needs to be similar. This allows clusters to be chained together and can contain very rem ote structures which, in tu rn , can result in poor quality stru c tu ra l alignm ents. Figure 3.6 illustrates th e differences between single-linkage and m ultiple-linkage clustering and highlights a single-linkage chain th a t results in two dissim ilar stru ctu res being clustered together.

A more robust m ethod m ight be to introduce a weighting scheme th a t would allow all stru ctu res to be included in the stru c tu ra l tem p late b u t would downweight th e contribution from proteins th a t have sim ilar sequences. This is a common fea tu re of sequence alignm ent m ethods when dealing w ith large num bers of sequences, e.g. CLUSTALW, Thom pson et al. (1994). However, in practice, th e m ajority of

Chapter 3. Structural Templates 102

MULTIPLE LINKAGE SINGLE LINKAGE

F ig u r e 3 .6 : Single and multiple linkage clustering. Single-linkage clustering only requires one comparison to meet the clustering criteria (e.g. SSAP score di > 80) for a structure to be included in a cluster. This allows structures to be chained together and can result in clusters containing very diverse structures (e.g. SSAP score d2 < 80). M ultiple-linkage will only allow a structure to join a cluster if the clustering criteria is m et with all members of the cluster (e.g. SSAP score for d \ , d2, ...,dn > 80).

stru ctu ral families are small and this simple approach of removing redundancy has been seen to work well (Orengo, 1999). As the stru c tu ra l genomics initiatives expand the population and diversity of these homologous superfam ilies a weighting scheme may well prove more appropriate in order to incorporate th e m axim um am ount of evolutionary inform ation.

3.2.3.3 Building the Structural Tem plates

Once th e stru ctu rally coherent sub-clusters had been selected w ithin an homologous superfamily, the CORA algorithm was used to generate a m ultiple stru ctu ral align ment. As discussed in section 3.1.2.2, CO RA analyses protein stru ctu ral families

then identifies th e consensus stru ctu ral features and variability a t each alignm ent position. The conserved stru c tu ra l characteristics of th e cluster are then stored as a consensus stru c tu ra l tem p late which can be used to align further structures.

In document Consensus templates for protein structure recognition (Page 101-103)