A novel method based on physicochemical properties of amino acids and one class classification algorithm for disease gene identification

(1)

A novel method based on physicochemical properties of amino acids

and one class classiﬁcation algorithm for disease gene identiﬁcation

Abdulaziz Yousef, Nasrollah Moghadam Charkari

⇑

Faculty of Electrical & Computer Engineering, Tarbiat Modares University, Tehran, Iran

a r t i c l e

i n f o

Article history:

Received 13 February 2015 Revised 4 June 2015 Accepted 26 June 2015 Available online 2 July 2015 Keywords:

Disease gene identiﬁcation

Physicochemical properties of amino acid One class classiﬁcation

Support Vector Data Description

a b s t r a c t

Identifying the genes that cause disease is one of the most challenging issues to establish the diagnosis and treatment quickly. Several interesting methods have been introduced for disease gene identification for a decade. In general, the main differences between these methods are the type of data used as a prior-knowledge, as well as machine learning (ML) methods used for identification. The disease gene identification task has been commonly viewed by ML methods as a binary classification problem (whether any gene is disease or not). However, the nature of the data (since there is no negative data available for training or leaners) creates a major problem which affect the results. In this paper, sequence-based, one class classification method is introduced to assign genes to disease class (yes, no). First, to generate feature vector, the sequences of proteins (genes) are initially transformed to numerical vector using physicochemical properties of amino acid. Second, as there is no definite approach to define non-disease genes (negative data); we have attempted to model solely disease genes (positive data) to make a prediction by employing Support Vector Data Description algorithm. The experimental results confirm the efficiency of the method with precision, recall and F-measure of 79.3%, 82.6% and 80.9%, respectively.

1. Introduction

Identifying disease genes is an important issue to enhance our knowledge about disease mechanism, and to improve clinical methods.

Traditional linkage and association studies have been carried out to identify a large number of candidate genes probably related with diseases[1]. Since using experimental approaches to identify disease associated genes from the vast numbers of candidates is an expensive task, the requirement of computational approaches has been taken into account. In this regard, many interesting machine learning methods have been introduced to ﬁnd the similarity fea-tures between unknown genes (candidate genes) with the known disease genes. These methods differ in two ways. First, the type of genomic data used for generating the feature vector, such as pro-tein–protein interactions (PPIs)[2–6], gene expression proﬁles[7], gene ontology (GO) [8]. Other methods integrate multiple data sources to prioritize candidate disease genes[9–11].

Second, the type of the algorithm which has been used for train-ing the prediction model. However, the majority of the studies

consider this issue as two class classification problem. Some studies have defined the known disease gene as positive set and the unknown disease gene as negative set[12,13]. Since the unknown genes set is often comprised with some disease genes, other meth-ods have attempted to reduce the confusion in classification process by selecting a small fraction of unknown genes as negative set

[14,15]. Nevertheless, these methods are not robust and reliable enough as the negative set which has been achieved from unknown genes, still suffers from noisy data.

All the above mentioned methods might not be implemented properly, because these methods rely on the information of pro-teins achieved from prior-knowledge (PPI network, gene ontology, and protein domains) which contains some errors. Moreover, they usually suffer from incompleteness. Therefore, a universal prior-knowledge would be required to tackle this problem. The only data which are available for all proteins and has inﬂuential role in solving many problems such as predicting a subcellular locations

[16,17], protein–protein interactions[18–20], and protein struc-tural and functional classes[21–23]is the sequences of proteins

[24]. On the other hand, there is no information about the negative data (non-disease gene). Also, there is no guarantee of using the unknown genes or fraction of them as a negative data. Hence, using two class classiﬁcation algorithm may be not appropriate.

http://dx.doi.org/10.1016/j.jbi.2015.06.018

⇑Corresponding author. Tel.: +98 21 82883301; fax: +98 21 82884325. E-mail addresses:yousef@modares.ac.ir,Azizyousef1@yahoo.com (A. Yousef),

charkari@modares.ac.ir(N.M. Charkari).

Contents lists available atScienceDirect

Journal of Biomedical Informatics

j o u r n a l h o m e p a g e : w w w . e l s e v i e r . c o m / l o c a t e / y j b i n

(2)

In this paper, we present a novel sequence-based one class classi-fication method for disease gene identiclassi-fication. Since the earliest pro-tein sequences and the structures were determined, it would be clear that the positioning and properties of amino acids are key point to infer many biological processes. For example, the first protein struc-ture, haemoglobin provides a molecular explanation for the genetic disease sickle cell anaemia [25]. Therefore, a sequence-translated method based on physicochemical properties of amino acid, is employed to construct a feature vector for each protein. To improve the performance of the proposed method, some efficient features are selected using Principal Component Analysis (PCA). Since, the mutation of genes is always possible to happen, and also there is a likelihood of this mutation leading to disease[26], there is no avail-able conception asserts that some proteins are not involved in disease (non-disease genes). Hence, we have attempted to only train the pos-itive data (disease genes) using Support Vector Data Description (SVDD) algorithm. We have used two type of data as negative data to evaluate the model. The fist type has selected randomly from unknown dataset. While the second one has been selected from the likely negative data used in positive-unlabeled technique[15].

The proposed method has been compared with ProDiGe method [14], Smalter’s method [12], and yang method [15]. The experimental results achieved 79.3%, 82.6% and 80.9% for precision, recall and F-measure, respectively. These results indicate that our method overpassed the current state-of-the art methods in precision.

2. Material and method

In this section, the proposed method for identifying disease genes is described. The method consists of three steps: (1) Translate corresponding gene products (proteins) into numerical feature vector using physicochemical descriptor; (2) Principal Component Analysis (PCA) algorithm is applied to extract appro-priate features; (3) training the positive data using SVDD algorithm to make the identiﬁcation. The proposed method schema is illus-trated inFig. 1.

2.1. Protein sequence translation

Many representation methods have been introduced to extract the information encoded in amino acid sequence, including Geary

auto correlation (GA)[27], auto covariance (AC)[28], Normalized Moreau–Broto autocorrelation (NA) [29], and Moran auto-correlation (MA)[30]. These methods are based on different physicochemical properties. In this paper, six joint physicochemi-cal properties of amino acid which were used in many applications (e.g. predicting protein structural, functional classes, protein–pro-tein interactions, sub-cellular locations. . .) have been selected. Since by adding one more physicochemical property, thirty fea-tures will be added to the feature vector, we have selected out the more effective physicochemical properties with the minimum number of these properties to avoid complexity (Time, and Computation). These properties are polarity (POL) [31], residue accessible surface area in tripeptide (RAS) [32], hydrophilicity (HY-PHIL) [33], polarizability (POL2) [34], hydrophobicity (HY-PHOB) [35], solvation free energy (SFE) [36], respectively. The original values of these physicochemical properties for each amino acid were normalized using Min–Max normalization method as shown in Eq.(1):

Pij¼

Pi;j Pj;min Pj;Max Pj;min

ð1Þ

where Pi;jis the j-th descriptor value for i-th amino acid, Pj;minis the minimum value of j-th descriptor over the 20 amino acids and Pj;Max is the maximum value of j-th descriptor over the 20 amino acids.

Table 1 shows the normalized physicochemical properties. Since GA method has achieved a good result in other application[20], in this work, we have applied GA method as a representation method. 2.2. Principal Component Analysis (PCA)

To increase the overall performance of the proposed method, we tried to extract the most relevant and useful features from the high-dimensional represented feature vectors (180 features) gener-ated by GA method. PCA is presented as a dimensionality reduction methods. PCA is an appropriate statistical technique to identify pat-terns in data, and to expiree the data in a way that to high light their similarities and differences. Therefore, utilized to determine signif-icant features which preserves most of the information and removes the redundant components[37]. PCA is a linear combination which transforms one set of variables in Rm_{space into another set in R}n space containing the maximum amount of variance in the data where n is smaller than m. This is obtained as the following steps:

Fig. 1. The schematic of the proposed method. As is shown, the proposed method includes three layers which are representation layer, Feature extraction layer, predictor layer.

(3)

1. Compute the covariance matrix as follow; S ¼1 N XN k¼0 ðxk mÞ xð k mÞT; where m ¼1 N PN

kxk, the mean of the original feature vectors. 2. Compute the eigenvectors and eigenvalues of the covariance

matrix by solving the following decomposition

kiei¼ Sei;

where kiis the eigenvalue associated with the eigenvector ei. 3. Sort the result in decreasing order of eigenvalue. 4. Choose the most important components (features). 2.3. Support Vector Data Description (SVDD)

Since there is no information about the negative data (non-disease gene) and solely the positive data (target data) is available, one class classification (OCC) method would be the proper approach to efficiently solve the disease gene identification problem. The OCC method makes a description of a disease gene set and detects new genes to resemble this set. Support Vector Data Description (SVDD) method is inspired by support vector machine which has used for novelty or outlier detection. it is a one-class classification that constructs a hyper-sphere by enclosing the target instances properly to assign a test instance to positive class or not [38]. The objective of SVDD is to find a minimum hyper-sphere with center a and radius R, including all the target instances.Fig. 2shows the SVDD method sketch-ma. SVDD algo-rithm like SVM algoalgo-rithm provides more flexible and stable description of the boundary by applying kernel functions (e.g. Gaussian kernel). For disease gene identification, SVDD has the fol-lowing benefits[38]:

1. Sparsity: fewer training samples are needed. As we mentioned above, a small fraction of genes would be predicted as disease gene.

2. Good generalization: SVDD method avoids overfitting and yields good generalization results when compared with other conven-tional methods.

3. Use of kernels: by exploiting the ‘kernel trick’, the SVDD method is able to accurately model the support of non-trivial, multi-modal distributions.

3. Results

At first, we have evaluated the effect of different sequence rep-resentation methods on the performance of the proposed method. Furthermore, the best number of features extracted by PCA algo-rithm for each of these methods has been investigated. After selecting the most effective sequence representation method, we have assessed the impact of each physicochemical properties of amino acid on the identification process. Finally, to confirm the efficiency of the method, we have conducted the comparison between our method and the state-of-the-art techniques on gen-eral disease genes identification.

3.1. Experimental data

The data used by[15]has been employed in our experiments. This dataset consists of 5405 known disease genes (positive data) spanning 2751 disease phenotypes, where all the genes have been extracted by combining GENECARD[39] and OMIM[40]disease gene data. In addition 16 k genes from Ensembl[41]has been selected as the unknown gene set.

3.2. Evaluation metrics

To measure the performance of the proposed identiﬁcation method, different metrics have been used. The Recall, Precision, and F-measure which are deﬁned as follows:

Recall ¼ TP

TP þ FN

Precision ¼ TP

TP þ FP

F-measure ¼2 Recall Precision

Recall þ Precision

where FP is false positive (non-disease genes ‘negative data’ which have been identified as disease gene), TP is True Positive (disease genes which have been correctly identified), and FN is false negative (disease genes which have been identified as non-disease gene ‘neg-ative data’ not correctly). The Recall is the ratio of the number of disease genes retrieved to the total number of disease genes in the dataset. Precision is the ratio of the number of relevant disease genes retrieved to the total number of irrelevant and relevant dis-ease genes retrieved. Finally, the F-measure is the harmonic mean of recall and precision.

Table 1

The normalized value of six physicochemical properties for each of Amino Acid (A–Y).

HY-PHOB HY-PHIL POL POL2 SFE RAS

A 0.281 0.453 0.395 0.112 0.589 0.222 C 0.458 0.375 0.074 0.312 0.527 0.333 D 0 1 1 0.256 0.191 0.416 E 0.027 1 0.913 0.369 0.285 0.638 F 1 0.14 0.037 0.709 0.936 0.75 G 0.198 0.531 0.506 0 0.446 0 H 0.207 0.453 0.679 0.562 0.582 0.666 I 0.792 0.25 0.037 0.454 0.851 0.555 K 0.198 1 0.79 0.535 0.325 0.694 L 0.783 0.25 0 0.454 0.851 0.527 M 0.721 0.328 0.098 0.54 0.957 0.611 N 0.12 0.562 0.827 0.327 0.319 0.472 P 0.253 0.531 0.382 0.32 0.702 0.388 Q 0.123 0.562 0.691 0.44 0.4 0.583 R 0.222 1 0.691 0.711 0 0.833 S 0.235 0.578 0.53 0.151 0.448 0.222 T 0.318 0.468 0.456 0.264 0.557 0.361 V 0.687 0.296 0.123 0.342 0.765 0.444 W 0.56 0 0.061 1 1 1 Y 0.922 0.171 0.16 0.728 0.787 0.861

(4)

4. Discussion

In order to minimize the overfitting of the identification model, 5-fold cross-validations has been carried out in all of the experi-ments. 5000 positive and 5000 unlabeled instances have been employed. Since, the disease genes are available, we used SVDD which requires only positive data for training. For testing and esti-mating the error rates, some negative data would be required. In this regard, we have employed two evaluation strategies. In The first strategy, some of unknown data have been selected randomly as negative data and have been used in testing process. In the sec-ond strategy, we have used positive-unlabeled method[15]which considered the farthest unknown instances from the mean vector of the all positive data as negative instances.

4.1. Performance of the sequence representation methods

To evaluate the robustness of the representation methods, we have attempted to train and test the model using SVDD algorithm based on AC, GA, MA, NA representation methods separately. The results of each predicator have been reported in two ways. First, the results when all the features have been employed. Second, the new results when the PCA feature reduction method have been employed.

Table 2shows the comparison between the identiﬁcation per-formances of the sequence representation method using all fea-tures.Table 3indicates the results of each representation method after applying the PCA feature reduction method. It can be found that the performance of the GA method is better than the perfor-mance of the other ones. Therefore, we select the GA as the main sequence representation method for the next experiments. Moreover, Table 3 indicates the positive effect of using PCA in improving the experimental results.

4.2. Assessment of physicochemical properties

Since the sequence representation method is based on the physicochemical properties, in this experiment, our objective is to understand which physicochemical property has more impact in identification disease gene. In this regard, six different GA fea-ture vectors have been built. The first one is generated using all physicochemical properties, and the five other ones are generated by deleting one of the physicochemical properties, respectively. In this experiment, we could observe that the deletion of the HY-PHIL property reduces the F-measure dramatically. While, deleting the others has a small negative effect on the performance.Fig. 3 illus-trates the priority of physicochemical properties in disease gene identification. However, the deletion of any physicochemical prop-erties will decrease the performance.

4.3. Evaluation of different One Class Classiﬁcation methods

To evaluate the performance of SVDD method for disease gene identiﬁcation, two other well known one class classiﬁcation meth-ods, i.e. Guassian Mixture Model, and Parzen density estimation, have been implemented in our study.

All the employed classifiers originate from the Matlab toolbox ddTOOLS [42]. Each of these classifiers has its own parameters. Parzen classifier has a smoothing parameter s, and GMM has the number of mixture components k of the positive class. While SVDD has two parameters C, and kernel parameter

a, where C is

the regularization parameter which controls the trade-off between the volume of the hypersphere and the number of accepted train-ing samples. We used trial and error to optimize the Parzen and GMM parameters, and the values s = 0.38, and k = 29 were selected. While, grid search has been carried out to optimize SVDD parame-ters. Accordingly, the trained value of optimal kernel parameter as well as the trained value of optimal regularization parameter are

a

= 0.128, and C = 16, respectively.

The results of the estimation of precision, recall, and F-measure scores are shown inTable 4. The results of precision indicate that there is no significant differences between the methods. While, the analysis of the results of the recall scores clearly shows that the SVDD method provides better performance than the other mentioned methods. The significant results of SVDD can be explained as this classifier prevents overfitting by utilizing only few instances to support a boundary between disease genes and other genes. While, for other methods like Parzen method, it takes into account the information of all training samples. Hence, it leads to more overfitting as well as more false negative rate.

Table 2

Comparison between sequence representation methods using all features. Methods # of Feature Precision (%) Recall (%) F-measure (%)

AC-SVDD 180 74 76.2 75

GA-SVDD 180 77.3 79.5 78.3

NA-SVDD 180 70.9 82.4 76.2

MA-SVDD 180 69.8 79.7 74.4

Table 3

Comparison between sequence representation methods using PCA extracted features. Methods # of PCA Features Precision (%) Recall (%) F-measure (%)

AC-SVDD 55 78.6 74.2 76.3

GA-SVDD 60 79.3 82.6 80.9

NA-SVDD 80 76.1 81.2 78.5

MA-SVDD 55 72 80.4 75.9

Fig. 3. The effectiveness of physicochemical properties on the disease gene identiﬁcation.

Table 4

Comparative evaluation between different one class classiﬁcation methods. Methods Precision (%) Recall (%) F-measure (%)

SVDD 79.3 82.6 80.9

Parzen 79.6 80.3 79.9

(5)

4.4. Comparison with other works

We have compared the proposed method with the state-of-art methods, including, Smalter’s method [12], Xu’s method [6], ProDiGe method[14], and Yang method[15].Table 5shows the comparison between the proposed method with other related methods. The main differences between the proposed method and the related ones are:

1. The prior-knowledge used to extract feature vector. In this work, the sequence of protein as a more universal prior-knowledge has been carried out, while in the previous work, the prior-knowledge suffered from noise effect.

2. The classification method used for disease gene identification problem. All the previous methods consider this problem as two class classification problem. while, in this work the disease gene identification problem is treated as one class classification problem. The reason is that, there is no guarantee about the negative instances which are extracted from unknown genes. Therefore, it would be preferred to train the model using only the positive instances.

To test our model, we have applied two types of negative data. The First one, to reduce the noise data, we have attempted to select some more likely negative data from unknown data; In this regard, we selected the more farthest unknown data from the mean vector of all positive data. The second type of negative data have been randomly selected from all unknown data. We used the ﬁrst type to make comparison with PUDI algorithm, while the second type have been used to compare with the remaining algorithms. The results indicate that the prediction performance of the proposed method is better than the other ones.

To investigate how the proposed method can improve the per-formance of disease gene identiﬁcation compared with the state-of-the-art methods, we implemented the leave-one-out cross-validation test and drew the ROC curves of the test set. As shown inFig. 4, the curve of the SVDD predictor shows higher pre-diction performance on the whole range of false positive rates, and produces an AUC score of 95.7%. While the AUCs achieved by PUDI, ProDiGe, Xu’s Method, are 93%, 88.7%, 83.2% respectively. In other Table 5

Comparison between disease gene identiﬁcation methods.

Methods Precision (%) Recall (%) F-measure (%)

Xu’s method (1)[6] 68.4 54.2 60.4

Xu’s method (5)[6] 66.8 56.3 61.1

Smalter’s method[12] 67.9 61.5 64.5

ProDiGe[14] 70.4 76.2 73.1

Proposed method (randomly negative data) 72.9 78.8 75.7

PUDI[15] 72.7 82.4 77.2

Proposed method (Likely negative data) 79.3 82.6 80.9

Fig. 4. The ROC curve of the test set as analyzed by PUDI, ProDiGe, Xu’s Method, and Proposed Method.

Table 6

Novel Disease gene predicted using SVDD algorithm.

Gene Score Phenotype

TRPV1 98.5 [42] GP5 98.2 Thrombocytopenia[43] Platelet disorder[44] ANGPTL1 96.2 Melanoma[45] Tumors[46] BDNF 93.6 Huntington[47] WSB1 92.8 Neurobalstoma[48] TRPAL 88 PHLDA1 80.1 Tumors[49] ODAM 79.6 ITGB1BP2 76.5 Hypertrophy[50] EIF2AK2 69.3 Inﬂuenza[51] Hepatitis c[52]

(6)

words, the SVDD classiﬁer can decrease both false negatives and false positives compared with other methods.

4.5. Novel gene identiﬁcation

We made an effort to discover novel disease genes that are not available in the disease gene data set. In this regard, we employed the positive data (disease genes) to train the model. Then, we tested the model using other unknown genes. We selected the top 10 genes based on their model likelihoods (SVDD probabili-ties). By implementing the literature search, we found that 8 out of 10 predicted disease genes are related at least with one disease.

Table 6enumerates the novel disease genes which are predicted

[43–53].

4.6. Identiﬁcation speciﬁc disease class

To investigate the reliability of the proposed method in detect-ing disease genes for speciﬁc disease classes, Parkinson’s disease (PD) has been selected. We have obtained 32 PD-related proteins from OMIM disease gene data. And 18 PD-related proteins based on UniProt database[54]. We provided the dataset containing 100 samples. In this regard, the 50 PD-related proteins make up the pos-itive data. While 50 unlabeled genes which have been selected from likely negative set randomly, make up the negative data. To evalu-ate the identiﬁer model, leave-one-out cross-validation test has been performed. As shown inFig. 5, the proposed method achieved the area of 97% under ROC. While, the results of, Xu’s Methods, ProDiGe, and PUDI are 91.4%, 93%, 95.2%, respectively. These results indicate that the proposed method has good performance since it can identify PD-related genes with high probability.

5. Conclusions

In this article, we proposed a one class classification based on sequence of protein for disease genes identification. GA represen-tation method based on the physicochemical properties of amino acids have been used to transform the amino acids sequences to numerical feature vectors. Then, the more significant features are selected using PCA algorithm. Finally, we trained the model of

disease gene using SVDD algorithm. The results demonstrate the importance of solving the disease identification problem as one class classification problem. In addition, it is worth to mention that the physicochemical properties of amino acids would be highly beneficial. As comparison with other methods, we found that the proposed method achieved better results than the previous stud-ies. For future work, to achieve better classification performance, more physicochemical properties will be taken into consideration such as amino acid composition, CC in regression analysis, and graph shape index. To get better performance, we will also apply a combination of different type of one class classification methods.

Conﬂict of interest

The authors declare that they have no conﬂict of interest. References

[1]A.M. Glazier, J.H. Nadeau, T.J. Aitman, Finding genes that underlie complex traits, Science 298 (5602) (2002) 2345–2349.

[2]S. Kohler et al., Walking the interactome for prioritization of candidate disease genes, Am. J. Hum. Genet. 82 (4) (2008) 949–958.

[3]P. Yang et al., Inferring gene-phenotype associations via global protein complex network propagation, PLoS ONE 6 (7) (2011) e21502.

[4]W. Zhang, F. Sun, R. Jiang, Integrating multiple protein-protein interaction networks to prioritize disease genes: a Bayesian regression approach, BMC Bioinformatics 12 (Suppl. 1) (2011) S11.

[5]S. Navlakha, C. Kingsford, The power of protein interaction networks for associating genes with diseases, Bioinformatics 26 (8) (2010) 1057–1063. [6]J. Xu, Y. Li, Discovering disease-genes by topological features in human

protein-protein interaction network, Bioinformatics 22 (22) (2006) 2800– 2805.

[7]U. Ala et al., Prediction of human disease genes by human-mouse conserved coexpression analysis, PLoS Comput. Biol. 4 (3) (2008) e1000043.

[8]J. Freudenberg, P. Propping, A similarity-based method for genome-wide prediction of disease-relevant human genes, Bioinformatics 18 (suppl 2) (2002) S110–S115.

[9]S. Aerts et al., Gene prioritization through genomic data fusion, Nat. Biotechnol. 24 (5) (2006) 537–544.

[10] T. De Bie et al., Kernel-based data fusion for gene prioritization, Bioinformatics 23 (13) (2007) i125–i132.

[11]Y. Li, J.C. Patra, Integration of multiple data sources to prioritize candidate genes using discounted rating system, BMC Bioinformatics 11 (Suppl. 1) (2010) S20.

[12] A. Smalter, S.F. Lei, X.-W. Chen, Human disease-gene classiﬁcation with integrative sequence-based and topological features of protein-protein interaction networks, IEEE.

(7)

[13]P. Radivojac et al., An integrated approach to inferring gene-disease associations in humans, Proteins 72 (3) (2008) 1030–1037.

[14]F. Mordelet, J.P. Vert, ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples, BMC Bioinformatics 12 (1) (2011) 389.

[15]P. Yang et al., Positive-unlabeled learning for disease gene identiﬁcation, Bioinformatics 28 (20) (2012) 2640–2647.

[16]K.C. Chou, Y.D. Cai, Prediction of protein subcellular locations by GO-FunD-PseAA predictor, Biochem. Biophys. Res. Commun. 320 (4) (2004) 1236–1239. [17]Y. Fukasawa et al., Plus ça change-evolutionary sequence divergence predicts

protein subcellular localization signals, BMC Genomics 15 (1) (2014) 46. [18]S.L. Lo et al., Effect of training datasets on support vector machine prediction of

protein-protein interactions, Proteomics 5 (4) (2005) 876–884.

[19]C.-Y. Yu, L.-C. Chou, D.T.H. Chang, Predicting protein-protein interactions in unbalanced data using the primary structure of proteins, BMC Bioinformatics 11 (1) (2010) 167.

[20]A. Yousef, N. Moghadam Charkari, A novel method based on new adaptive LVQ neural network for predicting protein-protein interactions from protein sequences, J. Theor. Biol. 336 (2013) 231–239.

[21]C.Z. Cai et al., SVM-Prot: Web-based support vector machine software for functional classiﬁcation of a protein from its primary sequence, Nucleic Acids Res. 31 (13) (2003) 3692–3697.

[22]C.Z. Cai et al., Enzyme family classiﬁcation by support vector machines, Proteins 55 (1) (2004) 66–76.

[23]L.Y. Han et al., Prediction of RNA-binding proteins from primary sequence by a support vector machine approach, RNA 10 (3) (2004) 355–368.

[24]Z.R. Li et al., PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence, Nucleic Acids Res. 34 (Web Server issue) (2006) W32–W37. [25]M.J. Betts, R.B. Russell, Amino-acid properties and consequences of

substitutions, Bioinformatics Geneticists 2 (2007) 311–342.

[26]X. Zhu et al., One gene, many neuropsychiatric disorders: lessons from Mendelian diseases, Nat. Neurosci. 17 (6) (2014) 773–781.

[27]R.R. Sokal, B.A. Thomson, Population structure inferred by local spatial autocorrelation: an example from an Amerindian tribal population, Am. J. Phys. Anthropol. 129 (1) (2006) 121–131.

[28]Y. Guo et al., Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences, Nucleic Acids Res. 36 (9) (2008) 3025–3030.

[29]Z.P. Feng, C.T. Zhang, Prediction of membrane protein types based on the hydrophobic index of amino acids, J. Protein Chem. 19 (4) (2000) 269–275. [30]J.F. Xia, K. Han, D.S. Huang, Sequence-based prediction of protein–protein

interactions by means of rotation forest and autocorrelation descriptor, Protein Pept. Lett. 17 (1) (2010) 137–145.

[31]R. Grantham, Amino acid difference formula to help explain protein evolution, Science 185 (4154) (1974) 862–864.

[32]C. Chothia, The nature of the accessible and buried surfaces in proteins, J. Mol. Biol. 105 (1) (1976) 1–12.

[33]T.P. Hopp, K.R. Woods, Prediction of protein antigenic determinants from amino acid sequences, Proc. Natl. Acad. Sci. USA 78 (6) (1981) 3824–3828.

[34]M. Charton, B.I. Charton, The structural dependence of amino acid hydrophobicity parameters, J. Theor. Biol. 99 (4) (1982) 629–644.

[35]R.M. Sweet, D. Eisenberg, Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure, J. Mol. Biol. 171 (4) (1983) 479–488.

[36]D. Eisenberg, A.D. McLachlan, Solvation energy in protein folding and binding, Nature 319 (6050) (1986) 199–203.

[37]I. Jolliffe, Principal Component Analysis, in Encyclopedia of Statistics in Behavioral Science, John Wiley & Sons Ltd., 2005.

[38]D.M.J. Tax, R.P.W. Duin, Support vector data description, Mach. Learn. 54 (1) (2004) 45–66.

[39] M. Safran, et al., GeneCards Version 3: The Human Gene Integrator. Database (Oxford), 2010, p. baq020.

[40]V.A. McKusick, Mendelian Inheritance in Man and its online version, OMIM, Am. J. Hum. Genet. 80 (4) (2007) 588–604.

[41]P. Flicek et al., Ensembl 2011, Nucleic Acids Res. 39 (Database Issue) (2011) D800–D806.

[42] D.M.J. Tax, DDtools, the Data Description Toolbox for Matlab, 2007. [43]N.J. Marshall et al., A role for TRPV1 in inﬂuencing the onset of cardiovascular

disease in obesity, Hypertension 61 (1) (2013) 246–252.

[44]K. Acar et al., Soluble platelet glycoprotein V in distinct disease states of pathological thrombopoiesis, J. Natl Med. Assoc. 100 (1) (2008) 86–90. [45]Q. Shi et al., Targeting platelet GPIbalpha transgene expression to human

megakaryocytes and forming a complete complex with endogenous GPIbbeta and GPIX, J. Thromb. Haemost. 2 (11) (2004) 1989–1997.

[46]A. Smagur, J. Szary, S. Szala, Recombinant angioarrestin secreted from mouse melanoma cells inhibits growth of primary tumours, Acta Biochim. Pol. 52 (4) (2005) 875–879.

[47]Y. Xu, Y.J. Liu, Q. Yu, Angiopoietin-3 inhibits pulmonary metastasis by inhibiting tumor angiogenesis, Cancer Res. 64 (17) (2004) 6119–6126. [48]C. Zuccato et al., Loss of huntingtin-mediated BDNF gene transcription in

Huntington’s disease, Science 293 (5529) (2001) 493–498.

[49]Q.R. Chen et al., Increased WSB1 copy number correlates with its over-expression which associates with increased survival in neuroblastoma, Genes Chromosom. Cancer 45 (9) (2006) 856–862.

[50]M.A. Nagai et al., Down-regulation of PHLDA1 gene expression is associated with breast cancer progression, Breast Cancer Res. Treat. 106 (1) (2007) 49–56. [51]V. Palumbo et al., Melusin gene (ITGB1BP2) nucleotide variations study in hypertensive and cardiopathic patients, BMC Med. Genet. 10 (1) (2009) 140. [52]J.-Y. Min et al., A site on the inﬂuenza A virus NS1 protein mediates both

inhibition of PKR activation and temporal regulation of viral RNA synthesis, Virology 363 (1) (2007) 236–243.

[53]Y. Hiasa et al., Protein kinase R is increased and is functional in hepatitis C virus–related hepatocellular carcinoma, Am. J. Gastroenterol. 98 (11) (2003) 2528–2534.

[54]Z.R. Li et al., PROFEAT: a web server for computing structural and physicochemical features of proteins and peptides from amino acid sequence, Nucleic Acids Res. 34 (Web Server issue) (2006) W32–W37.