Constantly increasing number of protein-RNA complex structures makes it possible for the development of various techniques for predicting RNA-binding proteins at different levels of functional details. Sequence-based techniques using machine-learning methods are ineffective in separating RNA-binding from non-RNA binding proteins, DNA-binding proteins, in particular. Our result shows that a template-based technique is the only viable approach for RNA-binding discrimination. On the other hand, for a known RNA-binding protein, the best machine-learning techniques are often more accurate in locating RNA-binding residues than a template-based approach. This is true particularly for those proteins that are not predicted as RBPs by the template-based approach. Only a few techniques have been developed to predict the types of RNA interacting with a RBP. A template-based approach can make a reasonable prediction based on the type of RNA in the matching template-RNA complex structure. Similarly, a template-based approach is the only reliable tool available for predicting protein-RNA complex structure. As more and more protein-RNA complex structures deposited into protein databank, one can expect that a template-based approach will be increasingly useful. An application of such an approach to human genome has yielded more than 2000 novel RBPs and a recovery of 42.1% in known RBPs and a recovery of 41.5% newly discovered 860 mRNA-binding proteins [17] [Zhao et al. submitted]. The consistency of the recovery (or sensitivity) in two separate datasets highlights the robustness of template-based tools for predicting truly novel RNA-binding proteins. Further, the machine-learning based and template-based approaches are likely complementary each other. Combining these two approaches will likely further improve the accuracy of RNA-binding function prediction.
Chapter 8 Structure-based prediction of carbohydrate-binding proteins, binding residues and complex structures by a template-based approach
8.1 Introduction
Carbohydrates perform essential roles in cell processes in living organisms by interacting with proteins through both non-covalent (carbohydrate-protein binding) and covalent (glycosylation) interactions. Glycosylation of proteins and lipids coats the surfaces of all living cells and tissues with carbohydrates. The spatial patterns of such carbohydrate coating change during cell development1 and tumor progression and metastasis [214,215]. Thus, recognition of cell-surface carbohydrates, one of the key functions of carbohydrate-binding proteins (CBPs), is subject of intensive studies for biomarker discovery and inhibitor design [214,216]. Abundant carbohydrates in human cell surfaces are also exploited by carbohydrate-binding proteins in pathogens for cell invasion and detection avoidance. As a result, CBPs in pathogens have been employed as potential drug targets [217]. Thus, it is critically important to locate all CBPs and elucidate their binding mechanisms.
Experimentally, glycan arrays have been developed for high-throughput searching of novel CBPs and investigation of their binding specificity [218–220]. However, it is challenging to construct a sizeable, diverse glycan array because of difficulty in synthesis and isolation of carbohydrates. Here, we focus on an alternative approach: prediction of CBPs and their binding residues by computational techniques.
Currently, predicting CBPs and their binding residues are treated as two separate problems [221–225]. Someya et al [221] predicted carbohydrate-binding proteins by combining protein sequences information with support vector machines (SVM). This approach employed triple sequence patterns and frequencies of grouped amino acids as features and has achieved 0.67 for Mathews correlation coefficient (leave-one-out cross validation) based on a dataset of 345 CBPs and non-CBPs. This method is limited to
CBP prediction. Most of the methods developed for predicting carbohydrate-binding residues, on the other hand, assume that their structures are known. For example, Shionyu-Mitsuyama et al. predicted binding residues by building empirical interactions rules [222]. Tsai et al. utilized 3D probability density maps [224]. Others employed machine-learning techniques based on binding propensity and solvent accessibility [226] or selected geometric and chemical features [227]. These methods, however, cannot distinguish CBPs from non-CBPs.
Here, we will introduce a single template-based method for prediction of CBPs and carbohydrate-binding residues. This work is inspired by our highly effective template-based technique named SPOT-Struc for structure-based prediction of DNA-/RNA- binding proteins and their binding sites [32, 34]. In this approach, the target structure is first structurally aligned to the proteins with known protein-RNA/DNA complex structures. Significantly aligned structures are then employed for building model complex structures between target structure and template RNA/DNA and for predicting binding affinities.
In this work, we will extend SPOT-Struc to CBPs. Such an extension is possible because of the existence of a reasonable size of complex structures of protein and carbohydrates in protein databank18 despite their low binding affinity and highly flexible structures of carbohydrates. This complex structure dataset allows us to develop the first distance-dependent knowledge-based energy function for protein-carbohydrate interaction that is essential for the accuracy of SPOT-Struc for CBPs. A distance-scaled, finite, ideal gas reference (DFIRE) state will be used as for proteins [33] and protein-DNA/RNA interactions [32,34]. This knowledge-based energy function is then combined with a recently developed structure alignment method SPalign [42] for predicting CBPs and binding residues. This method is tested on 122 non-redundant RBPs and 2880 non-RBPs and achieved the Mathews correlation coefficients of 0.61 and 0.58 for prediction of CBPs and carbohydrate-binding residues, respectively. The sensitivity and precision of CBP prediction are 45% and 85% respectively. A
similar-level sensitivity is achieved for APO and HOLO structures. Application of this method to structural genomics targets revealed several novel CBPs.