FLORAScore = n +
Chapter 4 Improving ab initio structure predictions by assigning
4.1.2 Assigning structural predictions to fold groups
Structure comparison methods have proved very successful in detecting distant structural relationships between experimentally derived structures (Orengo and Taylor, 1996; Holm and Sander, 1998; Kolodny et al, 2005). Indeed, Chapter 2 described the ability of the CATHEDRAL algorithm to assign a putative fold to novel structures in the PDB by scanning against previously characterised representatives from the CATH database.
A previous collaboration between the CATH group and the De La Cruz et al. (de la Cruz et al, 2002) explored the use of structure comparison for assigning a known fold to ab initio m odels generated by the Rosetta method. They found that the correct CATH fold could be recognised as the top hit using models within 6A of the native structure, for half the data set. Although this result showed that structural comparison methods can still be applied to theoretical models, it was only tested on 4 proteins. Furthermore, it relied on a relatively slow structural comparison algorithm (SSAP) and was not able to determine automatically good models in advance.
Simons et al. (Simons et a l, 2001) took a similar approach by comparing their Rosetta models against the PDB using DALI (see Section 1). Although the closest relative in the PDB w as only found for around 50% of models, for matches with a Z-score greater than 4, they showed that structural comparison methods were applicable for models that deviated from the native structure by as m uch as 7 k . They suggest that as ab initio methods improve, it may even be possible to recognise functional families for novel genes through an intermediate structure prediction stage.
4.1.2.1 Comparing protein structure models using MAMMOTH
In choosing structure comparison algorithms for matching ab initio models to fold groups in CATH or SCOP, an im portant consideration is how well the algorithm can cope with model structures in which the secondary structures are not well defined. A recent structure comparison method (MAMMOTH, (Ortiz et al., 2002)) was specifically designed for comparing theoretical models with experimental structures. The algorithm was designed to focus purely on C a co-ordinates, avoiding any dependence on primary sequence, secondary structure or contact maps. This can be especially im portant when using ab initio models where the latter two features may not be fully formed w ith respect to the native structure.
MAMMOTH calculates its alignments in four stages. Firstly, each protein structure is broken into heptapepide fragments. Each heptapeptide is then described by a set of unit vectors between successive C a atoms and translated to the origin. Using standard minimisation technique (McLachlan, 1979), a rotation matrix and unit vector root mean square (URMS) is calculated between all fragments pairs and converted to a similarity score based on the expected URMS between two random sets of n unit vectors (URMS11). Scores between all possible pairs of heptapeptides are then taken to populate a matrix, from which a global alignment is calculated using dynamic programming (Needleman and Wunsch, 1970). An overall structural similarity between two given structure is calculated using a
variant of the MaxSub algorithm to determine the percentage of corresponding residues (PSI) less than 4A in 3D. The PSI is then converted into a P-value using a distribution of random structural alignments from a data set of unrelated SCOP domains. MAMMOTH is able to detect 50% of fold matches at the 99% confidence level, compared to 60% for DALI. Given its superior speed, the authors suggest this makes it a relatively accurate tool for structure comparison of large databases. It certainly lends itself to suggesting putative fold matches, which may then be aligned with a more accurate, computational intensive method.
4.2
Aims
The purpose of the method presented here was to build on the work of de la Cruz group in Barcelona, Spain (de la Cruz et al., 2002) by developing a fast and novel protocol (MODMATCH) for determining the correct fold for a given target structure by comparing ab initio models from the Rosetta method to the CATH fold library. This w ork was undertaken in collaboration with Xavier de la Cruz.
The first objective was to reduce a large set of initial predictions (999 models per target structure) to a smaller sample, ideally of higher quality. This was to both increase the speed of the structure comparison and reduce the noise generated by erroneous hits between CATH library domains and bad models. The second aim was to optimise the accuracy of fold assignments by combining structural similarity scores from the MAMMOTH (Ortiz et al., 2002) and SSAP (Taylor and Orengo, 1989) algorithms using a Support Vector Machine (SVM).
For this work, the MAMMOTH algorithm was utilised to identify putative folds from a CATH library which could then be more accurately aligned with SSAP. This is analogous to the approach presented in Chapter 2 in the implementation of CATHEDRAL, where GRATH was used to pre-select
similar CATH folds w ithin multi-domain protein chains to be aligned by SSAP. However, CATHEDRAL was thought to be unsuitable for this work as it was not designed to handle low resolution models where secondary structures (which form the basis of the GRATH algorithm) may not be fully formed. The use of SSAP in this work as an accurate structure comparison method was thought to be an improvement on DALI (used by Simons et
al.(1999)) because DALI relies on conserved contacts to align residues, which
again may not necessarily be present in theoretical protein structure models. The overall goal was to im prove the assignment of folds to ab initio models by developing a fast, accurate protocol whereby the ab initio models could be assigned a fold in the CATH database, in a similar fashion to the way experimental structures are classified.
4.3
Methods
This section describes the data sets used to benchmark the MODMATCH protocol and the details of the superposition of structures and models used in this method.