Conclusion - Gewehr, Jan Erik (2007): New Methods for the Prediction and Classification

However, a combination of pattern regions may be a better indicator than each pattern itself, especially if PROSITE and PRINTS are involved. As a simple means to combine individual pattern locations into a final domain prediction for a protein chain, we used a fold-based consensus. The underlying algorithm is as follows: (1) merge all patterns of the same fold as long as there are less than 20 residues between them, and (2) abstain for all regions where differing folds are annotated.

We tested our fold consensus in a blind-test like setup using our predicted patterns from the CAFASP 4 experiment, which allows us to compare against our specialized SSEP- Domain protein domain prediction method, which will be described in the next chapter. As mentioned in section 4.4.4, of the 46 CAFASP 4 targets which we could find in the ASTRAL 1.71, for 23 we can make AutoSCOP predictions. We analyzed the corresponding pattern occurrences on these targets and compared them to the CAFASP 4 domain assignments. With respect to the predicted domain number, when compared to CAFASP on the 23 targets, AutoSCOP is in 18 cases correct and in 5 cases wrong. Using SCOP as a reference, AutoSCOP is in 21 cases correct and two predictions are wrong. In all cases, AutoSCOP and SSEP-Domain agreed in the predicted domain number. We further compared the average overlap between the predicted domain regions and the CAFASP 4 definitions using the score described in section 5.3.2 (subsection ”Overlap Score”). We find that, on the 23 targets, the AutoSCOP patterns achieve an average overlap score of about 84% as compared to 91% for our SSEP-Domain approach, i.e. the predicted locations are good, but not completely correct, as could be expected from the evaluations given above.

4.6 Conclusion

AutoSCOP is a simple yet effective sequence-pattern-based approach to SCOP prediction on different SCOP levels. For the domains in the test set with known folds/superfamilies we achieve sensitivity values of more than 93%. Here, especially the specificity values of about 98% are striking. On the Vorolign test set, AutoSCOP even achieves specificity rates of up to 99.9% (fold level). This means that, if a prediction is made, it is indeed very reliable. A test on CAFASP4 targets also shows that the predictions made by AutoSCOP can provide useful information in blind-test protein structure prediction scenarios.

The combination with profile-profile alignment underlines the potential of the Auto- SCOP approach by improving the sensitivity of our predictions over the best individual method by about four percentage points. On family and superfamily level, this combination even outperforms the structure alignment methods in our comparison. AutoSCOP can be used as a filter for template selection and fold or superfamily recognition in addition to alignment-based recognition methods.

The inclusion of unique pattern combinations does not significantly improve the performance over unique patterns alone. One possible reason for this is the high redundancy between the InterPro member databases. In Table 4.3 we observe that, with the excep-

tion of Pfam and SUPERFAMILY, leaving out one database does not change the performance very much, especially on superfamily and fold level. Even after leaving out the SUPERFAMILY database, we still observe sensitivity values well above 80% for these levels. Nonetheless, using all found patterns is beneficial for our approach: we could show that using the higher-level InterPro entries, for instance, clearly decreased the performance of our method.

The low coverage and the low sensitivity of AutoSCOP on family level can be explained by the focus of the pattern databases, many of which concentrate on less fine-grained similarities (an obvious example is the SUPERFAMILY database). This implies that many patterns that are unique on coarse levels can be found in more than one SCOP family, and therefore the prediction of the family is not possible. It seems that most patterns work best on superfamily level, which also explains the similar performance of AutoSCOP on superfamily and fold level, as all unique patterns on superfamily level have to be unique on fold level by definition. Thus, especially for the family level, inclusion of specialized data sources such as ASTRAL Family HMMs is useful.

One problem for AutoSCOP as well as for many other SCOP predictors is the handling and recognition of domains belonging to new folds, superfamilies or families. Many predictions for such cases could be traced back to changes in the SCOP versions. However, sometimes we observe only low sequence identities, and in such cases it remains difficult to discriminate between known and new classifications. Discarding such targets can in- crease specificity but comes with the loss of many good predictions in the twilight zone of sequence identities. This is an interesting point for future development.

The proposed method can easily be extended by including sequence patterns from other data sources, which we have shown here by including predictions from ASTRAL Family HMMs. AutoSCOP is further applicable to any protein domain hierarchy, with SCOP being one very popular example. For the time between releases of such hierarchies, reliable predictions of potential protein classifications are important also for proteins with already available structures. The combination with Vorolign (Table 4.6) shows that there is potential to detect and avoid errors in assignments made on the basis of structure alignments. AutoSCOP may also be a useful additional component for systems like SCOPmap [Cheek et al., 2004] that combine both sequence- and structure-based predictors into a larger system.

If an InterProScan run is necessary, the runtime of AutoSCOP was found to be about half the runtime of a PPA run (with included profile generation for the target), but slightly longer than a Vorolign scan as described in [Birzele et al., 2007], using up to a few minutes per target on an AMD Athlon XP with 1.8 Ghz. If annotated patterns are available the whole AutoSCOP process is mainly reduced to a database lookup which can be done in a few seconds.

Further, for the computation of the training data, which may be time-consuming, we could show that it is possible to still achieve good results with strongly filtered training data (e.g. the ASTRAL 25, which is filtered for 25% sequence identity); this means that the effort needed for generating the training data can be reduced by more than one order of magnitude (in our case) if necessary without a large reduction in accuracy.

4.6 Conclusion 63

Structurally classified protein sequences and structures are a useful basis for research in protein structure prediction. We therefore provide the AutoPSI database of precomputed, predicted SCOP classifications for new PDB entries as well as for about two million UniProt

sequences, which is available at

that can help in two ways: researchers with an interest in specific proteins may get a clue on structural classification and, associated with this, possible further properties such as a general function; method developers can use the database to derive and compare larger scale data for their purpose.

With respect to domain prediction, the AutoSCOP method and its derived database also offer first insights into the domain structure of these sequences. It is known that InterPro pattern locations can to some degree reflect the locations of protein domains (an evaluation of the performance of InterPro as a domain predictor was done by the assessors of the CAFASP 4 experiment and will be discussed in the next chapter). With AutoSCOP, we can annotate our highly-specific SCOP classifications to each pattern we have already seen in our training data, i.e. in the available SCOP/ASTRAL sequences. With these annotations, we can combine the pattern locations in a consensus which can then give a clear picture of the regions where the corresponding SCOP classifications are located, even if such a consensus region is comprised of many short PROSITE patterns, for instance. This can help users to get a direct impression on potential SCOP domain locations. Our evaluations show that AutoSCOP annotated patterns when available performed well on the CAFASP 4 data in a blind-test like setup, which confirms their applicability. Therefore, this approach can also be regarded as a step towards large-scale protein domain recognition. In addition to our precomputed results, on our website we provide a web server at

∗_{on domain sequences.}

Chapter 5 SSEP-Domain: Template-Based

Protein Domain Prediction

The methods presented so far (Preselection and AutoSCOP) deal mostly with protein domains and the corresponding classifications. However, given a target protein of unknown structure, also the domain content, i.e. the partition of the target into structural domains, is unknown. Here, AutoSCOP can give good hints but is not necessarily accurate with respect to the correct domain boundaries.

In this chapter, we present the SSEP-Domain protein domain prediction approach, which is based on the application of secondary structure element alignment and profile- profile alignment in combination with InterPro pattern searches. As we could show for Preselection and Refinement in chapter 3, secondary structure element alignment (SSEA) allows rapid screening for topologically similar and therefore potential domain regions while profile-profile alignment provides us with the necessary specificity for selecting significant hits. Including InterPro patterns, which have turned out to be a valuable resource for the AutoSCOP approach, we can also find regions on the target sequence that share similarities to known family or superfamily definitions which are not necessarily based on structural homologs.

In the CAFASP 4 experiment, SSEP-Domain performed well and was placed in the top group of domain prediction algorithms. Since then, we have introduced some changes that both significantly speed-up the procedure and slightly improve the performance. The description of the method and its evaluation on CAFASP 4 data presented here is an extended version of our journal contribution on the SSEP-Domain approach that appeared in Bioinformatics [Gewehr and Zimmer, 2006].

In addition, a newer evaluation on CASP 7 data shows that SSEP-Domain as well as its template database (SCOP) tend towards defining too few domains on the target sequences with respect to the definitions of the CASP 7 assessors. We therefore describe a variant of SSEP-Domain that includes an alternative structure-based domain assignment for the template domains (SSEP-Domain∗_{), i.e. a different view on protein domains than} the SCOP standard, that performs better on multi-domain proteins when evaluated on the CASP 7 data.

5.1 Introduction

Domain assignment in proteins is an important subtask of structure prediction, as domains are usually considered the basic units for protein folding, evolution, and function [Heger and Holm, 2003, Vogel et al., 2005], and thus the decomposition of proteins into domains can help in areas such as functional classification, homology-based structure prediction, and structural genomics [Liu and Rost, 2003]. Since 2004, the CASP and CAFASP experiments have included a domain prediction subcategory into their evaluations, which confirms the importance of this task.

The algorithm described in this chapter, SSEP-Domain, predicts protein domains using the amino acid sequence of the target on the basis of alignments to known SCOP domains. Other recent approaches for domain recognition from sequence are also often alignment-based, such as ADDA [Heger and Holm, 2003], the Dompred-DomSSEA approach [Marsden et al., 2002], and DOMAINATION [George and Heringa, 2002a]. DO- PRO [von ¨Ohsen, 2005] uses stochastic models on the alignment-based output of the Arby structure prediction server [von ¨Ohsen et al., 2004]. Besides alignments, the basis for such approaches can also be machine learning methods as described for BIOZON in [Nagarajan and Yona, 2004] and PPRODO [Sim et al., 2005], statistics as in the DGS method [Wheelan et al., 2000], taxonomy [Coin et al., 2004] or clustering as in MKDOM [Gouzy et al., 1999] or DIVCLUS [Park and Teichmann, 1998]. Some approaches make in- herently use of predicted structures, such as SnapDragon [George and Heringa, 2002b] or the Robetta servers, namely Robetta-Rosettadom [Kim et al., 2005] and Robetta-GINZU [Chivian et al., 2003], which are part of the evaluation in the results section.

Similar to the preselection approach described in chapter 3, for SSEP-Domain, secondary structure elements of proteins play an important role. Moreover, as the preselection- based speedup for fold recognition, also this approach is based on the observation that the fold class of a protein domain is often defined by the topology of its secondary structure elements (i.e. the elements and their lengths, the order of these elements, and the contacts between them). For protein domains, this means that their secondary structure element topology may be an indicator for fold class membership even if the sequence differs from all known members of the respective class. If the structure of a protein is unknown, we are still able to make use of its secondary structure elements and their order based on secondary structure prediction. Therefore, the comparison of subsequences of a target protein with known protein domains based on these features may reveal regions of the target that have the potential for being independent domains.

The DomSSEA protein domain prediction uses secondary structure element alignment (SSEA) for selecting PDB chains of potential templates. Other template-based methods such as Arby select many seed sequences (derived from scans by PSI-BLAST [Altschul et al., 1997], InterProScan [Zdobnov and Apweiler, 2001], predicted secondary structures, and more) and then apply elaborate and often costly alignment and scoring procedures such as threading or log-average profile-profile alignment (PPA). In SSEP- Domain, as we aim at both a fast and an accurate prediction method, we combine the quick SSEA with the accurate PPA, allow InterPro pattern regions directly as potential

In document Gewehr, Jan Erik (2007): New Methods for the Prediction and Classification of Protein Domains. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 77-83)