• No results found

4.9 Web server

4.11.4 Derivation of BTMX

BTMX is an extension of the TMX method previously described by us [62] for HMPs. Three potential sources of input data namely, conservation indices, positional frequency profiles and PSSMs were identified from the literature [17, 59,70,140] and all possible combinations were tested for the derivation of BTMX. These input factors were generated from the given protein sequence as described above. An rSASA value of 0.0 was used as the cutoff to label the residues as buried or exposed in the training and cross validation data set. This value was chosen so that both the classes (i.e. buried or exposed) are equally populated. Figure 21 shows that although the prediction accuracy of the BTMX method increases with an increasing cutoff value during the labelling of the data set, the number of residues labelled as buried increases and introduces a bias in the training and cross validation data set.

BTMX is a two stage classifier. In the first stage, a sliding window (centered at the target residue) consisting of the input factor was employed to obtain positional scores using a SVM for regression (ǫ − SV R) with a radial kernel. In the second stage, to again incorporate the contextual information for individual positional score, a sliding window consisting of positional scores obtained from the first stage was employed as the input for a Support Vector Classifier (c-SVC) with a linear kernel to predict the exposure status for each residue. Sliding windows of size ranging from 1 to 15 residues were tested based on the fact that

beta strands that span the OM (with a tilt of 20 − 45◦) mostly consist of 9 to 11

residues [26]. For the first stage, a leave-one-out test was conducted to optimize the C value and window sizes ranging from 1 to 7 and 1 to 15 (in steps of 2), respectively, based on higher prediction accuracy (see supplementary). Fisher’s analysis was then conducted on the parameters yielding the highest accuracies. The size of the sliding window in the second stage was optimized in a similar way. It is to be noted that in the first stage, linear regression and a SVR with a linear kernel were also tested. The R implementation of support vector classifier (SVC)/support vector regression (SVR) [107,116] was used for the current work. The use of multiple stages to incorporate contextual information is widespread in the literature [62, 64, 141].

5

TMBHMM: A frequency-profile based HMM

for predicting the topology of transmembrane

beta barrel proteins and the exposure status

of transmembrane residues

5.1

Overview

Acknowledgements: The software developed and the results presented in this chapter represent joint work by Aaron Goodman, a summer student from the University of Pennsylvannia who I supervised during a 2-month internship in summer 2008, by Nitesh Kumar Singh who I supervised during his Master thesis in bioinformatics from June 2009 to January 2010, and by myself. Besides supervising both students, my own contribution in this work was coming up with the initial idea and stating the problem statement. I generated the training data set and guided the aforementioned students to look at relevant literature. I reviewed the TMBHMM software during its implentation stagee and along with Nitesh and Aaron analysed the results generated by the TMBHMM program.

The existing sequence-based computational methods in the realm of TMBs can be classified into two main categories. The computational methods in the first category aim at identifying the TMBs in a given proteome based on the se- quence [1,22,25,137,142,143]. The computational methods in the other category focus on determining the structural topology of the given sequence [64,137,144]. There are also methods that combine both features by providing the structural topology of the identified TMBs [1, 25, 142]. In contrast to the extensively studied globular proteins following the pioneering work of Rost et al. [145], the problem of predicting exposed/buried residues has remained untouched for TMB proteins. In addition to predicting membrane spanning regions and the structural topology, prediction of the exposure status is of interest due to its implied applications in channel engineering and site-specific mutational stud- ies [146, 147]. As discussed above, by exposed residues we mean those residues that are in contact with the membrane lipids. In contrast, buried residues are hidden in the protein structure. To the best of our knowledge, so far no method gives the exposure status of the residues predicted to be in the transmembrane region of the putative TMBs.

In this work, we have developed a comprehensive computational method (TMBHMM) based on a Hidden Markov Model to predict the structural topol- ogy of TMBs by employing only the frequency profiles of the amino acids in a given sequence as input. The novelty of the method is that it also predicts the exposure status of the transmembrane residues. The prediction accuracy of TMBHMM has been compared with PRED-TMBB [25], which has been re- ported to have one of the highest reported prediction accuracies [142] and we show that TMBHMM is at least as good as PRED-TMBB in terms of strand prediction accuracy. We have also established the TMBHMM web server that accepts amino acid sequence or multiple sequence alignment as input and pre- dicts the structural topology of the given amino acid sequence annotated with the exposure status. The training of the TMBHMM was performed on a non- redundant data set of 19 TMBs. The self consistency test yielded Q2 accuracy

for beta strand of 0.95. In self consistency test the method predicted 83.0% of transmembrane residues with correct exposure status. The jack-knife test yielded Q2 accuracy of 0.86, Q3 accuracy of 0.83, MCC of 0.72 and SOV for

beta strand of 0.92. TMBHMM predicts the exposure status of the correctly predicted transmembrane residues with an accuracy of 83.21%.