6.4 Optimized Profile-Profile Alignments
6.4.2 Modified Genetic Algorithm
The modifications introduced to the genetic algorithm are simple:
• Representation: our chromosomes now consist of matrix entries for the sequence- based matrix, matrix entries for the secondary structure-based matrix, gap open and gap extension costs and weighting factors for the sequence-based score and the secondary structure-based score.
• Diagonal entries: matrix entries are initially drawn randomly from [-1,1]. In addi- tion, in order to speed-up the process, we increased the initial weights of the diagonal
Fold opt. PPA PPA ∆ #Sign. Changed #Neg #Pos a.3 0.471 0.438 0.032 10.6% 0.7% 9.9% a.118 0.288 0.294 -0.006 4.3% 2.8% 1.5% b.34 0.353 0.347 0.006 13.2% 4.7% 8.5% b.47 0.570 0.532 0.038 20.4% 2.2% 18.1% c.26 0.336 0.319 0.016 9.1% 1.9% 7.1% c.3 0.446 0.456 -0.010 10.2% 8.1% 2.1% d.15 0.313 0.301 0.012 8.7% 2.8% 5.8% d.19 0.558 0.505 0.053 31.0% 0.9% 30.0% Avg 0.416 0.399 0.017 13.4% 3.0% 10.3%
Table 6.2: Average TM-Scores after optimizing PPA parameters using a modified genetic algorithm. Differences are shown in bold face whenever the optimization process leads to better results on the test data. In addition, the number of significantly changed TM-Scores (a difference of more than 0.1 TM-Score) as compared to the original PPA parameters on the test alignments is given together with the relative number of positive changes (where optimized parameters lead to better TM-Scores by more than 0.1) and the relative number of negative changes (where the original parameters performed better by more than 0.1).
entries of the matrices by drawing from [0.8, 2]. Before being used in the PPA pro- cedure, all matrices are normalized such that the entries sum up to 1.0.
• Other parameters: All other parameters such as the weights for the sequence and the secondary structure-based parts are initially drawn from [0,20].
We used a population size of 50 and a mutation rate of 0.4. As each population member has to be evaluated by generating profile-profile alignments and then computing the TM-Score, the complete optimization procedure takes considerably longer than the genetic algorithm proposed in the previous section (up to several days on a single CPU). This is the main reason why we used only 8 exemplary fold classes in this evaluation instead of all available fold classes (about 800) of the ASTRAL distribution.
6.4.3
Results
For each fold class, we ran our genetic algorithm five times and then chose the parameters that performed best on the training data for the evaluation on the test set. The average changes of the mean-length normalized TM Score are given in Table 6.2. Further, we evaluated the number of alignments for which the TM-Score was in- or decreased by more than 0.1 (we refer to these alignments assignificantly changed alignments in the following). Within our 8 randomly chosen fold classes, we find that for three of them (a.3, b.47 and d.19) there is a clear improvement in average TM-Score of 0.032, 0.038 and 0.053, respectively, whereas we observe only slight increases for three classes and slight decreases for the remaining two classes (within a range of 0.02 difference in TM-Score between the
6.5 Discussion 115
original and the newly generated parameters). With respect to individual alignments, we find that, on the test sets for each fold class, on average (over the fold classes, weighing each class equally) 13.4% of the alignments have been significantly changed, 10.3% being ”better” and 3.0% being ”worse”. In other words, on average, nearly 4 of 5 significantly changed alignments have been improved. In the best case (d.19), 29 out of 30 significantly changed alignments have become better with respect to TM-Score.
Keeping in mind that the genetic algorithm starts more or less without any knowledge and that the PPA parameters have already been optimized for fold recognition purposes in general, this shows that it is possible to generate optimized parameters from random initial choices, at least for some of the available fold classes. By splitting the domains and the corresponding alignments into training and test data, it is further possible to select those fold classes where optimized parameters might be useful in future applications and where not, though this would require quite some CPU time when applied large-scale.
The GA parameters for this evaluation were chosen intuitively, and the training sets as well as the numbers of runs were small due to the runtime requirements. Therefore, in a larger experiment using larger training sets, larger populations and more runs, it may be possible to improve over the results shown in Table 6.2.
6.5
Discussion
In the first section of this chapter, we have described the QUASAR package, a simple alignment ranking software which allows for combination and, to some extent, optimization of alignment scores with respect to structure-based benchmark scores. The software is
available attogether with documentation, a tutorial
and examples. It has been developed on a Linux/Unix system, but, being implemented in JAVA, can also be used on other platforms.
QUASAR may be of interest for users who work with alignments and want to either evaluate them with structure-based scores or rank them according to an optimized score combination that will hopefully rank ”good” alignments (with respect to the structural quality of the resulting coordinate model) above ”bad” alignments. Both situations occur in CASP-like environments: the former for training and evaluation of the servers and methods being developed, and the latter for finding those alignments and models that are finally used as predictions.
Subsequently, two similar optimization problems have been discussed: (1) alignment scoring and ranking for selection of potentially good structure models on the alignment level, and (2) the optimization of fold-class specific parameters for improving the quality of profile-profile alignment.
For the first problem, two different methods have been proposed, namely least-squares optimized linear combination of well-known scoring matrices and the genetic optimization of new scoring matrices. We find that the first approach works well in our comparison, as the scores obtained from linear combinations show an improved correlation with our benchmark score. Here, more elaborate combination techniques allowing more operators
than only addition and weighting may be an interesting starting point for future research. The second method, the genetic optimization of scoring matrices and their gap parameters produces competitive matrices to the well-known matrices in our comparison. Though no individual GA matrix performed better than the best known matrix, inclusion of the GA matrices into a combination of all evaluated matrices could further improve accuracy.
For the second problem, we modified the genetic algorithm such that it optimizes all necessary parameters for PPA. The results show that, for some cases, a clear improvement of alignment quality with respect to TM-Score can be reached. For this problem, the optimization procedure itself is time-consuming, but has to be done only once for each fold class in order to obtain the new parameters.
Overall, the evaluations in this chapter lead to two conclusions. First, given enough computing power, it is to some degree possible to optimize parameters for alignment rank- ing or alignment generation for specific environments (i.e. alignment procedures, template data etc.). Second, however, it is also obvious that the currently used matrices and their parameters already perform well and that one has to choose carefully when to use adapted parameters and when not, especially in the second case (alignment computation with op- timized parameters).
Chapter 7
Additional Tools for Protein Domain
Representation and Classification
In addition to the methods described in the previous chapters, in cooperation with col- leagues, some additional tools have been developed that can contribute to research in pro- tein structure prediction, three of which are the topics of this chapter. The first two will be described only briefly, namely the Vorolign structural alignment server [Birzele et al., 2007] and the ProML Schema for representation of proteins and protein sets. The third, the BioWeka library [Gewehr et al., 2007b] that extends the Weka data mining framework with bioinformatics data formats and methods, is described in more detail.
7.1
Vorolign: Structural Alignment and SCOP Clas-
sification Prediction
So far, we have described methods that predicted SCOP classifications based on a protein domain’s amino acid sequence (and derived information) under the assumption that the target’s structure is unknown. If we know the structure of the target, prediction accuracy can be improved by including this information into the prediction process. For the Au- toSCOP method, this was shown by combining AutoSCOP with a structural alignment method which will be described in this section, namely the Vorolign method.
Vorolign is mainly the work of Fabian Birzele (in joint work with the author and Gergely Csaba), therefore we will keep this section short and present only an overview which, together with the previous chapters, completes our efforts in SCOP classification prediction based on different types of data (sequences, patterns and finally structures). More details about Vorolign can be found in the corresponding publication [Birzele et al., 2007]. In this work, Vorolign was used as an additional predictor for the AutoSCOP database, which provides predicted SCOP classifications using both AutoSCOP and Vorolign for new PDB entries.
Figure 7.1: The similarity computation method of the Vorolign approach (taken from [Birzele et al., 2007]): Given their neighbors (left panel), Vorolign computes the similarity between two residues via dynamic programming (middle panel, low level matrix). These individual similarities are then used for the overall alignment as shown in the right panel (high level matrix).
The Vorolign Approach
The task of alignment and comparison of protein structures has been investigated for several years. When considering the proposed methods, it is usually possible to categorize them into one of the following two classes: (1) methods treating proteins as rigid (such as DALI [Holm and Sander, 1996] and CE [Shindyalov and Bourne, 1998]) and (2) methods allowing flexibility in proteins (such as FATCAT [Ye and Godzik, 2003]). Especially for the former class, the alignment criterion is often the root mean square deviation (RMSD) of the resulting superimposition, which may be regarded either locally (i.e. over certain, similar parts of the aligned structures) or globally (over the complete structures). Belonging to the latter class, the Vorolign method allows for flexible alignment of protein structures based on the following concept:
Vorolign assumes that two structurally similar residues are also similar with respect to their environment in the corresponding structures. Thereby, the environment of a residue is defined by its neighbors in the Voronoi tessellation of its structure. Given these neigh- bors, for the contained residues, similarity is computed using dynamic programming based on both amino acid as well as secondary structure exchange scores, and using these sim- ilarity scores between residues, an overall alignment is computed again using dynamic programming (see Figure 7.1). For details, we refer to [Birzele et al., 2007].
7.2 Representation of Protein Information in ProML 119