Title : PREDICTION OF PROTEIN STRUCTURAL CLASSES BY PSEUDO AMINO ACID COMPOSITION USING IMPROVED HARMONY SEARCH RELATIVE REDUCT FEATURE SELECTION AND ROUGH SET CLASSIFICATION ALGORITHMS Author (s) : M. BAGYAMATHI, H. HANNAH INBARANI

(1)

www.ijicse.com 55

PREDICTION OF PROTEIN STRUCTURAL CLASSES BY PSEUDO AMINO ACID COMPOSITION USING IMPROVED HARMONY SEARCH RELATIVE REDUCT

FEATURE SELECTION AND ROUGH SET CLASSIFICATION ALGORITHMS

MRS. M. BAGYAMATHI DR. H. HANNAH INBARANI

Assistant Professor & Head Assistant Professor Dept. of Computer Science & Applications Department of Computer Science Gonzaga College of Arts and Science for Women Periyar University Krishnagiri, Tamil Nadu, India. Salem, Tamil Nadu, India. Email: [email protected] Email: [email protected] Abstract

The paper proposes classification of protein sequences using rough set based feature

selection and classification algorithm. Sequence-derived structural and physiochemical features

have been frequently used for analyzing and predicting structural, functional, expression and

interaction profiles of proteins. The pseudo amino acid (PseAA) composition can represent a

protein sequence in a discrete model without completely losing its sequence-order information,

and hence has been widely applied for improving the prediction quality for various protein

attributes. The general form pseudo-amino acid composition (PseAAC) has been widely used to

represent protein sequences in predicting protein structural and functional attributes. A two

major complexity is the large number of feature vectors and the accurate classification of its

structure. The proposed framework includes feature selection mechanism embedded with

classification technique for the efficient classification of the structure of the protein. In this

study, we employed rough set improved harmony search relative reduct algorithm for selecting

the optimum number of features, which is fed as an input to the rough set classification algorithm

to classify the protein sequence to its class. The performance analysis of the proposed framework

is compared with the Rough Set Improved Harmony Search Quick Reduct framework. The

experimental results revealed high accuracy of 90% from the proposed framework.

Keywords: Protein Structure Classification, Feature Selection, Improved Harmony Search, Relative Reduct, Rough Set Classification.

Introduction

Bioinformatics involves the

development and application of novel

informatics techniques in the genomic

sciences. Proteins play a fundamental role in

all living organisms and are involved in a

variety of molecular functions and

biological processes. Proteins are the most

essential and versatile macromolecules of

life, and the knowledge of their functions is

an essential link in the development of new

(2)

www.ijicse.com 56 development of synthetic biochemical such

as biofuels [23]. Traditionally,

computational prediction methods use

features that are derived from protein

sequence, protein structure or protein

interaction networks predict function [24].

Proteins are composed of one or more

chains of amino acids and show several

levels of structure [17]. In fact, according to

their chain folding pattern, proteins are

usually folded into four structural classes

such as all α, all β, all α + β and all α / β

[16].

Pseudo amino acid composition

(PseAAC) is an algorithm that could convert

a protein sequence into a digital vector that

could be processed by data mining

algorithms. The design of PseAAC

incorporated the sequence order information

to improve the conventional amino acid

compositions. The application of pseudo

amino acid composition is very common,

including almost every branch of

computational proteomics [21]. In this

study, the features are extracted from protein

primary sequence, based on amino acid

composition and K-mer patterns, or K-grams

or K-tuples [15]. Different classification

techniques have been used to classify the

protein sequence into its particular classes.

In the analysis of protein sequences and

structures based on their critical feature set.

High dimensional dataset degrades the

classification quality. The optimum number

of features helps to classify the sequence

into its family.

Rough set theory [14], provides a

mathematical tool which is used to deal with

vague and imprecise concept. A rough set

uses the concepts of lower and upper

approximations to classify the protein

sequences into disjoint categories. It can

also be used for feature selection and

knowledge discovery. It helps us to find out

the minimal attribute sets called „reducts‟ to

classify objects without deterioration of

classification quality.

In this study, both the feature selection

and classification technique is applied as a

single framework to predict the class of the

protein. The Rough Set Improved Harmony

Search Relative Reduct algorithm is

employed to reveal the optimum number of

features [13] and the Rough Set

Classification [12] algorithm is used to

classify the sequences precisely to its family

of classes.

The rest of the paper is structured as

Sections 2 to 5. Section 2 describes a

Review of the Literature; Section 3 explains

the Research Motivation; Section 4

(3)

www.ijicse.com 57 covers dataset description, feature selection

and classification algorithms and the

experimental analysis with the results and

discussion were described in Section 5 and

the chapter concludes with a discussion on

the interpretation and highlights the

possibility of future work in this area.

Review of Literature

In the past, many computational

intelligence techniques have been proposed

in the literature to provide deep insights into

protein sequence data. In this section, we

provide a concise overview of existing

feature selection and classification methods

involving different representations for the

purpose of function prediction and

classification. Thus, there is a need of a

highly accurate and efficient system that can

classify the newly discovered protein

sequences into existing super families in a

very short period of time [10].

References Technique Employed Operational Principle

Iqbal, MJ, et.al [18]

Decision tree, Naïve Bayes,

Neural Network, Random Forest

and SVM

Classification

Saha and Chaki [9] Neural Network, Fuzzy

ARTMAP, Rough Set Classifier Classification

Datta et al.[4] Principal Component Analysis

and Nearest Neighbor Classifier

Feature Selection and

Classification

Ahmad et al.[6] AAC and Growing

Self-Organizing Map

Feature Extraction and

Clustering

Bagyamathi and Inbarani

[17 ]

Rough Set Based Improved

Harmony Search algorithm

classification

Inbarani et al. [22] Rough Set based Particle Swarm

Optimization Algorithm

Classification

Inbarani et. al [28]

Hybrid Rough Set Improved

Harmony Search Quick Reduct

algorithm

classification

Pedergnana M et al. [25] Genetic Algorithm and Random Forest Classifier

Feature Extraction, Feature

(4)

www.ijicse.com 58

Jeong et.al [27] Random Forest Classification

Mansoori et al. [29] Fuzzy Rule based Classifier Classification

[26]

Harmony Search Relative

Reduct

classification

Azar et al. [30] Unsupervised Particle Swarm

Optimization (PSO) based

Relative Reduct.

Feature Selection

[34 ]

Harmony Search Quick Reduct

and Rough Set Classification

framework

classification

Research Motivation

Due to the growth of molecular

biology technologies and techniques, more

large-scale biological data sets are becoming

available. Identifying biologically useful

information from these data sets has become

a significant challenge. Computational

biology aims to address this challenge [31].

Many problems in computational biology,

e.g., protein function prediction, sub cellular

localization prediction, protein-protein

interaction, protein secondary structure

prediction, etc., can be formulated as

sequence classification tasks [32], where the

amino acid sequence of a protein is used to

classify the protein in functional and

localization classes.

In this study, the application of

feature selection techniques is focussed,

which do not alter the original representation

of the variables, but merely select a subset

of them. Thus, they preserve the original

semantics of variables, hence offering the

advantage of interpretability by a domain

expert [11]. While the feature selection can

be applied to both supervised and

unsupervised learning, this chapter

concentrates on the problem of supervised

learning (classification), where the class

labels are known in advance. In recent

decade, Population-based

algorithms have been attracting an increased

attention due to their powerful search

capabilities.

From the comparative analysis of the

(5)

www.ijicse.com 59 for only a particular kind of data set, but

may not perform well on a high dimensional

dataset of protein sequences. In protein

analysis, especially in protein sequence

classification, selection of features is most

important. Our main motivation in this study

is to design a classiﬁcation model that uses a

novel feature selection and classification

algorithms in a single framework in order to

improve the classiﬁcation accuracy.

Methods and Materials

The framework shown in Figure 1 has

been proposed for an efficient and highly

accurate classification of protein sequences

into super families. The framework is

divided into two phases: Feature Selection

and Classification. Both the techniques are

hybridized with rough set theory to deal the

problem of high dimensionality and

uncertainty in classifying the protein

sequences [34].

In the first phase of this study, the

protein primary sequences are collected

from Protein Data Bank (PDB) in fasta

format [16]. The fasta sequence file is used

as input data to the PseAAC-builder [5], that

generates the protein feature space using

amino acid composition and amino acid K-

tuples or K- mer patterns. The real valued

data generated by the PseAAC server to be

discretized, since the rough set theory best in

dealing with discrete values [17]. The

discretized dataset is the feature vector

extracted from the PseAAC server. In the

next step, the feature vector is fed as an

input to the Rough Set Improved Harmony

Search Relative Reduct (RSIHSRR) feature

selection algorithm. The RSIHSQR

algorithm is used to select the feature subset.

In the second phase, the feature subset

revealed by the RSIHSRR algorithm are

divided into training and testing based on

10-fold cross validation. The training set is

subjected to classification process by Rough

Set Classification (RSC) algorithm to

generate the decision rules. In the next step,

the generated rules are evaluated with the

test data to validate the RSC algorithm. The

various classification measures are applied

to evaluate the performance of the proposed

algorithm and also compared with various

classification algorithms for the accurate

prediction of protein structure.

Data Source

In this work, the protein primary

sequence dataset is derived from PDB

(http://www.rcsb.org/pdb) on SCOP

classification. The Structural Classification

of Proteins (SCOP) database is a largely

manual classification of protein structural

domains based on similarities of their

(6)

www.ijicse.com 60 The data set consists of sequences with 7623

of all α, 10672 of all β, 11048 of all α + β

and 11961 of all α / β [16]. Among one

thousand sequences with combinations of all

α , all β, all α + β and all α / β, each 250

sequences are taken for this study [17].

Feature Extraction

Protein sequences are consecutive

amino acid residues, and it is represented

with an alphabet A of size |A| = 20. Many

feature extraction methods have been

developed in the past several years.

Typically, these methods can be classified

into two categories. One is based solely on

amino acid composition. The other one is an

extension of the atomic length from only

one amino acid to a K amino acid tuple,

where K is an integer and larger than one,

which can also be referred as „Ktuple‟, such

as 2-tuple in [33].

In the first step of feature generation,

20 amino acid features are adopted as our

initial representative features, which is

simple and effective. Along with the 20

features, K-tuple features are introduced.

The high computational cost of K-tuple

prediction caused by large feature amount,

the Rough Set based feature selection

methods are introduced to reveal most

relevant K-tuples and eliminating the

irrelevant ones. In this study, the supervised

rough set based feature selection algorithms

such as Quick Reduct, PSO Quick Reduct,

and Improved Harmony Search Quick

Reduct were applied and the protein

sequence feature vectors are classified to

their respective structural classes [17].

Many researchers used one method

of feature extraction among the following

two different strategies. (a) Without

dimension reduction: the prediction can be

based on full K-tuple space without any

dimension reduction, and the maximum

value of K is set to 5. As a result, at most

205 = 3.2×106 features are extracted. (b)

With the dimension reduction: when the

proteins are represented in a high

dimensional space, the occurrences of many

K-tuples will be very scarce. Some K-tuples

just occur only once or even never occur in

the dataset. Thus, lots of them must be

irrelevant to the protein classification

process since they are too sparse. Motivated

by this phenomenon, in this chapter, the

feature selection techniques are adopted to

filter the K-tuple feature set. Therefore, only

the relevant tuples are selected as best subset

in order to reveal the better classification

result [17].

Feature Selection

An Improved HS algorithm was

(7)

www.ijicse.com 61 proposed algorithm, HMCR, PAR, bw

parameters are considered, but PAR and bw

are very important parameters in fine-tuning

of optimized solution vectors. The

traditional HS algorithm uses a fixed value

for both PAR and bw. In the HS method,

PAR and bw values are adjusted and cannot

be changed during new generations [2]. The

main drawback of the HS method is that the

numbers of iterations increases to find an

optimal solution. To improve the

performance of the HS algorithm and to

eliminate the drawbacks that lies with fixed

values of PAR and bw, IHS algorithm uses

variables PAR and bw in improvisation step

(Step 3) [3]. The pitch adjustment rate

parameter causes a musician to select a

value neighboring to its current choice. To

achieve better results, we define lower limit

and upper limit of the PAR and bw. PAR

and bw change dynamically with generation

number and expressed as follows:

PAR(gn) = PARmin + -

* gn (12)

Where, PAR(gn)is the Pitch

Adjusting Rate for each generation,

PARminand PARmaxis the minimum and

maximum Pitch Adjusting Rate respectively,

NIis considered as Number of

Improvisations and gnis the Generation

Number.

bw(gn) = bwmax* exp(c *gn); c = ln

[(bwmin / bwmax)] / NI (13)

Where, bw(gn)is considered as the

bandwidth for each generation bwminand

bwmaxare minimum and maximum

bandwidthrepectively.

The improved harmony search

reveals best features which yield high

classification accuracy when hybridized

with rough set quick reduct algorithm. The

Pseudocode of the RSIHSRR algorithm is

given in [7].

Classification

Classification is to find the best class

that is closest to the classified pattern. The

goal of classifying the selected features is, to

map every pattern of readings into an output

class, and the percent of correct mapping

forms the accuracy of the classifier.

We adopted rough set based

classification to predict the structure of the

protein sequences. The main advantages for

employing rough set theory in this study are

the following: RST does not require any

primary or additional information about the

dataset; RST can deliver a valuable analysis,

even with missing values; and RST can

(8)

www.ijicse.com 62 both quantitative and qualitative data [19].

The rough set classifier generates decision

rule based on equivalence relations of the

attribute set.

Rule generation is a critical task in

any learning system .In this approach, the

lower and upper approximations of the

training data set based on decision class are

constructed, based on the equivalence

relation [1]. In the next step, deterministic

rules are generated based on lower

approximation and non-deterministic rules

are generated based on upper approximation

[12]. The rules framed by the RSC

algorithm are validated with the test data set.

Experimental Analysis

For experimental studies, 1000

sequences were extracted from the SCOP

classification of Protein Data Bank. A

sequence of experiments was conducted to

show the efficacy of the proposed feature

selection algorithm. The proposed Improved

Harmony Search hybridized with Rough Set

Improved Harmony Search Relative Reduct

feature selection algorithm is implemented

and compared with the existing Rough Set

Improved Harmony Search Quick Reduct

algorithm, which produced best and minimal

feature set for the classification process [17].

Table 2. Classification results of the RSC

with RSIHSQR and RSIHSRR framework

Classification

Overall Accuracy (%)

When k=1 When

k=2 RSC with

RSIHSQR 93.2 93.7

RSC with

RSIHSRR 94.5 95.1

Results and Discussion

In this study, the feature subset

selected by the RSIHSRR algorithm is

classified with Rough Set Classification

(RSC) algorithm. Accurate classification of

protein classes is the most critical part of the

sequence analysis. The result in Table 2

depicts the overall classification accuracy of

the existing and proposed framework. The

accuracy obtained from the proposed RSC

with RSIHSRR algorithm is compared with

RSC with RSIHSQR algorithm. In the Fig.

1, the overall accuracy of proposed

framework discloses the better classification

result when it is applied to the feature subset

revealed by the RSIHSRR algorithm. And

(9)

www.ijicse.com 63 procedure and the accomplishment rates of

test results, the proposed method can perk

up the prediction excellence of protein

structure class, and can afford as a

functional harmonizing framework.

Fig 1. Classification accuracy of RSC with

RSIHSQR and RSIHSRR algorithm (when

K=1, K=2)

Conclusion

The proposed framework well

predicts protein sequences to its family of

class or sub class based on its structure,

which is an effective combination of feature

extraction, feature selection and

classification methods. Feature extraction

from the chain of amino acid sequences

produces a huge number of feature vectors,

so the performance of the classification

model could decrease; hence, feature

selection method is essential. In this paper,

we applied RSIHSRR feature selection

algorithm to select or eliminate a minimal

subset of relevant and non-redundant

features. The proposed classification

approach is based on rough set is a novel

way to achieve protein sequence

classification. A comparison between the

obtained results of rough set with the

existing RSIHSQR and Rough Set classifier

algorithm has been employed. The Rough

set classification showed a higher overall

accuracy rates and it generates more

compact rules, when it is applied to the

reduct set from RSIHSRR feature selection

algorithm. Hence the analysis section clearly

proved the effectiveness of rough set based

approaches for Protein Sequence

classification.

References

[1] Inbarani, H.H., Kumar, S.S., Azar, A.T. and Hassanien, A.E., (2014), November. Soft rough sets for heart valve disease diagnosis. In International

Conference on Advanced Machine Learning

Technologies and Applications(pp. 347-356).

Springer, Cham.

[2] Al-Betar M, Khader A, and Liao I, (2010), “A harmony search with multi-pitch adjusting rate for the university course timetabling”, In: Geem, Z.W.

(ed.) Recent Advances in Harmony Search

Algorithm. SCI, vol. 270, pp. 147–161. Springer, Heidelberg.

[3] Chakraborty P, Roy GG, Das S, Jain D and Abraham A, (2009), “An improved harmony search algorithm with differential mutation operator”, Fundamenta Informaticae, Vol. 95, No. 4, pp. 1–26. [4] Datta A, Talukdar V, Konar A, and Jain LC, (2009), "A neural network based approach for protein structural class prediction," Journal of Intelligent and Fuzzy Systems, vol. 20, pp. 61-71.

90 92 94 96

RSIHSQR RSIHRRR

1-tuple

(10)

www.ijicse.com 64

[5] Du P, Wang X, Xu C, and Gao Y, (2012), “PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou‟s

pseudo-amino acid compositions”, Analytical

Biochemistry Vol. 425, No. 2, pp. 117–119.

[6] Ahmad N, Alahakoon D and Chau R, (2008), "Classification of protein sequences using the growing Self-Organizing map," Colombo, pp. 167-172.

[7] Bagyamathi M and Inbarani HH, (2015), “Feature Selection using Relative Reduct hybridized with Improved Harmony Search for Protein Sequence Classification”, International Journal of Trend in Research and Development (IJTRD), ISSN:2394-9333, Special Issue | PCIT-15.

[8] Mahdavi M, Fesanghary M and Damangir E, (2007), “An improved harmony search algorithm for

solving optimization problems”, Applied

Mathematics and Computation, Vol. 188, No. 2, pp. 1567– 1579.

[9] Saha S and Chaki R, (2013), “A brief review of data mining application involving protein sequence classification”, International Journal of Database Management Systems, Vol. 4, pp. 469–477.

[10] Iqbal M.J, Faye E, Samir B.B and Md Said A, (2014) “Efficient feature selection and classification of protein sequence data in bioinformatics”, The Scientific World Journal, pp. 1–12.

[11] Saeys, Y., Inza, I.N., Larrañaga, P., (2007), A

review of feature selection techniques in

bioinformatics. Bioinformatics 23(19), 507–2517. [12]Hassanien AE, Jafar M.H. ALI (2004), “Rough Set Approach for Generation of Classification Rules of Breast Cancer Data”, Informatica, Vol. 15, No. 1, pp. 23-28.

[13] Inbarani HH, Bagyamathi M and Azar AT (2015), “A novel hybrid feature selection method based on rough set and improved harmony search”, Neural Computing and Applications, Springer London, pp. 1-22.

[14] Pawlak, Z., (1993), Rough Sets: Present State and The Future. Foundations of Computing and Decision Sciences 18(3-4), 157–166.

[15] Chandran, C.P., (2008), Feature Selection from Protein Primary Sequence Database using Enhanced Quick Reduct Fuzzy-Rough Set. In: Proceedings of International Conference on Granular Computing, GrC 2008, Hangzhou, China, August 26-28, pp. 111– 114, doi:10.1109/GRC.2008.4664758.

[16] Cao, Y., Liu, S., Zhang, L., Qin, J., Wang, J., Tang, K, (2006), Prediction of protein structural class with Rough Sets. BMC Bioinformatics 7(1), doi:10.1186/1471-2105-7-20.

[17] M. Bagyamathi and H. Hannah Inbarani, (2015), “A Novel Hybridized Rough Set and Improved Harmony Search based Feature Selection for Protein

Sequence Classification”, Big Data in Complex Systems: Challenges and Opportunities, Studies in Big Data series, Vol. 9, 2015, pp. 173-204. Springer-Verlag International, Switzerland.

[18] Iqbal M.J, Faye E and Samir B.B, (2014) “Data mining of protein sequences with amino acid position-based feature encoding technique”, Kuala Lumpur. Lecture Notes in Electrical Engineering, Vol. 285, pp. 119–126.

[19] Udhaya kumar S and Inbarani HH, (2016), “PSO-based feature selection and neighborhood rough set-based classification for BCI multiclass motor imagery task”, Neural Computing and Applications, Springer, DOI 10.1007/s00521-016-2236-5.

[20] Chinnasamy A, Sung W.K and Mittal A, (2005), “Protein Structure and Fold Prediction Using

Tree-Augmented Bayesian Classifier”, Journal of

Bioinformatics and Computational Biology, Vol. 3, No. 4, pp. 803–819.

[21] Du, P., Wang, X., Xu, C., Gao, Y, (2012), “PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou‟s

pseudo-amino acid compositions”, Analytical

Biochemistry 425(2), 117–119.

[22] Inbarani, H.H., Azar, A.T., Jothi, G, (2014), “Supervised hybrid feature selection based on PSO and rough sets for medical diagnosis”, Computer Methods and Programs in Biomedicine 113(1), 175– 185.

[23] Nemati, S., Basiri, M.E., Ghasem-Aghaee, N., Aghdam, M.H, (2009), “A novel ACO–GA hybrid algorithm for feature selection in protein function prediction”, Expert Systems with Applications 36(10), 12086–12094.

[24] Rentzsch, R., Orengo, C, (2009), “Protein function prediction-the power of multiplicity”, Trends in Biotechnology 27(4), 210–219.

[25] Pedergnana M, Marpu PR, Mura MD, Benediktsson JA and Bruzzone L (2012). “A Novel supervised feature selection technique based on Genetic Algorithms”, In Proceedings of IEEE International Geoscience and Remote Sensing Symposium, 60-63.

[26] M. Bagyamathi and H. Hannah Inbarani, (2015), “Feature Selection using Improved Harmony Search Hybridized with Relative Reduct for Medical Data Classification”, International Journal of Applied Engineering Research (IJAER), 10(20): 19476-19480.

(11)

www.ijicse.com 65

[28] Inbarani HH, Bagyamathi M and Azar AT, (2015), “A Novel Hybrid Feature Selection Method based on Rough Set and Improved Harmony Search”, Neural Computing and Applications, Springer–

Verlag International, Switzerland, doi:

10.1007/s00521-015-1840-0.

[29] Mansoori, E. G., M. J. ZolghadrI, and S. D. Katebi. (2009), “Protein superfamily classification using fuzzy rule-based classifier”. IEEE Transactions on Nanobioscience, 8(1): 92–99.

[30] Azar, A.T., Banu, P.K.N., Inbarani, H.H, (2013), “PSORR - An Unsupervised Feature Selection Technique for Fetal Heart Rate”. In: 5th International Conference on Modelling, Identification and Control (ICMIC 2013), Egypt, August 31-September 1-2, pp. 60–65.

[31] Wei, X.(2010), “Computational approaches for biological data analysis. Doctoral Dissertation”, Tufts Uiversity Medford, MA, USA, ISBN: 978-1-124-21198-5

[32] Wong, A., Shatkay H, (2013), “Protein Function Prediction using Text-based Features extracted from the Biomedical Literature: The CAFA Challenge”. BMC Bioinformatics 14(3), S14, doi:10.1186/1471-2105-14-S3-S14.

[33] Park, K.J., Kanehisa, M, (2003) “Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs”, Bioinformatics 19(13), 1656– 1663.