Fine Tuning Data Mining Algorithm for an Efficient Classification of E Coli

(1)



Abstract: E. coli is the first organisms to be sequenced as genome and the classification within the DEC pathotypes has epidemiologic and clinical implications for managing diarrheal diseases. In many developing countries E.coli leads to cause of diarrhea in children. where the mode of transmissions takes place via food and water. based on their pathogenic phenotype and diseases they cause it can be classified into 6 groups. consequently, our awareness of the spectrum of diseases and syndromes that they cause is quite limited. Also, because we cannot readily identify infected patients, there are many complexities in defining the modes of attainment, prevention and treatment strategies, and estimating the burden of infectious squealed. These infections create many challenges, and no progress will take place until the diagnostic potential for these agents got improved. Identifying E. coli isolate co-express LA reiterates the difficulty of assigning bacteria to groups on the basis of their adherence phenotype or genotype. Therefore the analysis of E -coli with molecular methods demonstrates that strains carry will represent more characteristics of typical EPEC and also the lack of AggR regulon, we propose a novel classification approach for classifying E-coli therefore to recognize pathogens. In addition, the ability to simultaneously induce attaching effacing lesions and biofilm production may enhance the potential of the strains to cause diarrhea and prolong bacterial residence in the intestines, thus worsening malnutrition in the patients.

Keywords: Ecoli , classification , BayesNet, Navie Bayes, RBFNetwork, SMO.

I. INTRODUCTION

The data is provided by Institute of Molecular and Cell Biology, Osaka University, Japan, used to predict the protein localization sites of E Coli bacteria. The dataset consists of 336 instances of E Coli and 8 Instances. A novel model is developed from the data set by using data mining tool “Weka”, to predict the protein localization sites of E coli bacteria. The problem is suitable for data mining solution, as historical data is readily available such that 336 instances, the problem is not well understood. therefore the problem can be characterized as an input-output relationship and the inputs are casually related to the output.

Revised Manuscript Received on November 05, 2019.

A.Rama,Assistant Professor (Sg), Department Of Computer Science And Engineering, Saveetha School Of Engineering, Simats, Chennai, India

Email: [email protected]

A.Gayathri,Associate Professor, Department Of Computer Science And Engineering, Saveetha School Of Engineering, Simats, Chennai, India

S.Christy, Assistant Professor (Sg), Department Of Information Technology,Saveetha School Of Engineering, Simats, Chennai, India

II. E-COLIDATASET A. Input attribute

The input attribute is defined as sequence name that deals with the Access number used for SWISS-PROT database, mcg stands for signal sequence recognition used by McGeoch's method , gvh which deals with signal sequence recognition for von Heijne's method, lip that takes Binary attribute dealing with von Heijne's Signal Peptidase II also called consensus sequence score.chg determine the presence of charge on N-terminus of predicted lipoproteins and it is a Binary attribute. aac resolve score of discriminant analysis of the amino acid content of outer membrane and periplasmic proteins. alm1 and alm2 has score of the ALOM membrane spanning region prediction program and score of ALOM program after excluding putative cleavable signal regions from the sequence respectively.

Table – I : E-Coli Data Set

B. Output Attribute

The data set represents a prediction task because the output is nominal. The output attribute is defined as

cp (cytoplasm)

im (inner membrane without signal sequence) pp (perisplasm)

imU (inner membrane, uncleavable signal sequence) om (outer membrane)

omL (outer membrane lipoprotein) imL (inner membrane lipoprotein)

imS (inner membrane, cleavable signal sequence) in the arff file, so that Weka treats the variable name as labels.

III. EVALUATIONMETHOD

This Here 10-fold cross-validation are chosen as the model evaluation method; this is because, it is one of the best methods to produce a more accurate model. This method carries out multiple train/test trials and they are normally more reliable than a single train/test trial. After the model is being built, the accuracy of the model is re-evaluate using the E Coli data set which is an “out-of-sample” data supplied to produce the performance results. The performance measure using is primarily the “Correctly Classified Instances” percentage, which shows the percentage of instances being classified correctly by the model.

Fine Tuning Data Mining Algorithm for an

Efficient Classification of E-Coli

A.Rama, A.Gayathri, S.Christy

Data Set

Input Output

E-Coli dataset

Sequence Name, Mcg, Gvh, Lip, Chg, Aac, alm1, alm2.

(2)

110

Published By:

Blue Eyes Intelligence Engineering & Sciences Publication

Retrieval Number: A3936119119/2019©BEIESP

(Good performing model generally has high percentage of “Correctly Classified Instances”.) When models have the same percentage of “Correctly Classified Instances”, we also use the “Mean Absolute Error” as the other performance measure. (Good performing model generally has low rate of “Mean Absolute Error”.)

IV. SEARCHFORBESTMODEL

We have checked whether each attribute in E Coli data set have missing values. The missing values is not found in this E coli data set. We have applied a filter to replace each missing value with the most frequent value: weka.filters.unsupervised.attribute.ReplaceMissingValues. In order to have a better accuracy of data to build the models, after all the missing data have been replaced, E Coli data set are now set without any missing value.

V. SELECTINGANALGORITHM

[image:2.595.300.555.304.414.2]

Using the pre-processed data, we have tried different algorithms from each classifier with their default parameter settings and used the percentage of “Correctly Classified Instances” and “Mean Absolute Error” as a basis to assess the models. The reason why we use another performance measure, “Mean Absolute Error” is because the result of “Correctly Classified Instances” could be the same for many algorithms.

Table II- Algorithm and Results from weka

Algorithm

Result from weka Correctly Classified Instances

Mean Absolute Error weka.classifiers.bayes

BayesNet 82.3881% 0.0543

NaiveBayes 85.3731% 0.0445

NaiveBayesUpdateable 85.3730% 0.0445

weka.classifiers.functions

Logistic 85.6716% 0.048

RBFNetwork 82.9851% 0.0495

SMO 85.9701% 0.1896

weka.classifiers.lazy

IB1 80.2985% 0.0493

KStar 80.8955% 0.0554

LWL 64.4776% 0.1212

weka.classifiers.meta

Decorate 86.2687% 0.1546

FilteredClassifier 78.806% 0.0634

LogitBoost 83.5821% 0.05

weka.classifiers.misc

HyperPipes 70.7463% 0.2036

VFI 69.5522% 0.1862

weka.classifiers.trees

J48graft 84.7761% 0.0483

BFTree 42.3881% 0.1822

RandomTree 62.0896% 0.1293

weka.classifiers.rules

DecisionTable 76.7164% 0.1062

DTNB 81.4925% 0.0597

PART 83.8806% 0.0482

With the highest percentage of instances being classified correctly and lowest mean absolute error (if the percentage of

instances being classified correctly is the same) from 7 different classifiers, we have chosen the best 4 algorithm from it to build different models and evaluated the best model. They are:

1. “Decorate” with percentage “Correctly Classified Instances” of 86.2687% from meta classifier.

2. “SMO” with percentage “Correctly Classified Instances” of 85.9701% from trees classifier.

3. “Logistic” with percentage “Correctly Classified Instances” of 85.6716% from trees classifier.

4. “NaiveBayes” with percentage “Correctly Classified Instances” of 85.3731% from rules classifier.

VI. SELECTINGALGORITHMTRAINING

PARAMETERS

We will now choose “Decorate”, 1 of the best algorithm that we have worked on earlier to explore their different training parameters. We will then re-train/re-test the model again using the E Coli data set to compare the percentage of “Correctly Classified Instances” and “Mean Absolute Error“ for each parameter setting.

Table III- Algorithm and Evaluation Method Algorithm and

Evaluation Method

weka.classifiers.meta.Decorate

Training Parameter Result from Weka

Classifiers Correctly

Classified Instances

Mean Absolute Error

FT 80.597% 0.0974

J48graft 85.0746% 0.1526

LADTree 85.0746% 0.0698

RandomForest 67.7612% 0.1844

SimpleCart 62.9851% 0.1891

This algorithm has many training parameters, for simplicity and efficiency we only change some of the classifiers in the training parameter as changing of the classifiers in the training parameter as changing of values in other parameters do not have significance improvement. Other parameters are in their default values (artificialSize = 1.0, debug = False, desiredSize = 15, numIterations = 50, seed = 1). The parameters in the classifiers of the training parameter will be in their default too. From the results above, you can see that with training parameter: classifiers = “J48graft” the result is the best with Correctly Classified Instances of 85.0746% and it is better than the default settings.

VII. DATAPREPROCESSING (A) Reducing input size

Before we do attribute selection, we should choose an attribute evaluator and a search method. We have selected the followings in Weka:

Table IV – Attribute Evaluator

Attribute Evaluator ClassifierSubsetEval (Classifier: meta.Decorate

with training parameter: classifiers = J48graft and the others in their default)

Search Method BestFirst (Direction: Forward)

Evaluation Mode 10-fold cross-validation (Folds = 10, Seed =1)

Attribute and Number of Folds (%)

[image:2.595.58.516.374.768.2]

(3)

Results for different cut-off number of folds are summarized as follows:

Table V – Algorithm and Training parameter

Algorithm weka.classifiers.meta.Decorate

Training

parameter classifier = J48graft (Other parameters in default)

Rule Attribute Selection Result from Weka

Cut-off Number

of Folds Removed Attribute

CorrectlyClassified

Instances,Mean Absolute

Error

2 AAT_E COLI, 4. 0.48, 5. 0.5, 6.

0.56, 7. 0.24, 8. 0.35 74.3284%, 0.1647

3 1. AAT_E COLI, 5. 0.5, 6. 0.56,

7. 0.24, 8. 0.35 75.8209%, 0.1589

4 1. AAT_E COLI, 6. 0.56, 7. _{0.24, 8. 0.35} 75.5224%, 0.1683

do not have significance improvement. Other parameters are in their default values (artificialSize = 1.0, debug = False, desiredSize = 15, numIterations = 50, seed = 1). The parameters in the classifiers of the training parameter will be in their default too. From the results above, you can see that This algorithm has many training parameters, for simplicity and efficiency we only change some of the classifiers in the training parameter as changing of the classifiers in the training parameter as changing of values in other parameters with training parameter: classifiers = “J48graft” the result is the best with Correctly Classified Instances of 85.0746% and it is better than the default settings.

VI DATA PREPROCESSING [A] Reducing input size

[image:3.595.305.556.267.408.2] [image:3.595.43.560.433.820.2]

Before we do attribute selection, we should choose an attribute evaluator and a search method. We have selected the followings in Weka:

Table VI – reducing input size

Attribute Evaluator ClassifierSubsetEval (Classifier: meta.Decorate with

training parameter: classifiers = J48graft and the others in their default)

Search method BestFirst (Direction: Forward)

Evaluation mode 10-fold cross-validation (Folds = 10, Seed =1)

ClassifierSubsetEval (Classifier: meta.Decorate with training parameter: classifiers = J48graft and the others in their default)

BestFirst (Direction: Forward

10-fold cross-validation (Folds = 10, Seed =1)

Table VII – Algorithm Performance Algorithm &

Evaluation

weka.classifiers.functions.SMO

Training parameter

Result from Weka

filterType Correctly

Mean Absolute

Error Normalize

training data

85.9701% 0.1896

Standardize training data

72.5373% 0.1925

filterType Correctly

Mean Absolute

Error

[B] Iteration for algorithm

“weka.classifiers.functions.SMO” Selecting algorithm training parameters:

This algorithm has many parameters. For simplicity, only the parameter filterType” are altered as changing of values in other parameters do not have significance improvement. Other parameters are in their default values (buildLogisticModels = False, c = 1.0, checksTurnedOff = False, debug = False, epsilon

= 1.OE-12, numFolds = -1, randomSeed = 1,

toleranceParameter = 0.001). From the results above, you can see that with training parameter: filterType = No normalization/standardization, the result is the best with Correctly Classified Instances of 86.5672% and it is better than the default settings.

Table VIII - Based on the attribute list from Weka: Attribute

Evaluator

ClassifierSubsetEval (Classifier:

functions.SMO with training parameters:

filterType = No normalization/standardization and the others in their default)

Search Method BestFirst (Direction: Forward)

Evaluation Mode

1. AAT_E COLI – 10(100%), 2. 0.49 – 10(100%), 3. 0.29 – 10(100%), 4. 0.48 – 10(100%), 5. 0.5 – 0(0%), 6. 0.56 – 2(20%), 7. 0.24 – 10(100%), 8. 0.35 – 10(100%)

Table – IX Reducing input size

From the above results, the highest percentage of “Correctly Classified Instances” (72.5373%) and lowest result of “Mean Absolute Error” (0.191) is when there are only 2 attributes are being removed. Although from the above “Correctly Classified Instances” results for cut-off number of folds = 3, cut-off number of folds = 4 and ncut-off number of folds =5 are the same. Therefore I chose to remove 8 attributes as the left attributes are important to the model and it has a better performance.

Table – X Iteration for algorithm “weka. classifiers. functions.Logistic”

Algorithm

weka.classifiers.bayes.NaiveBayes Training

parameter Debug = False (Other parameters in default)

Cut-off

Number of

Folds Removed Attribute

Correctly Classified

Instances, Mean

Absolute Error

2

1. AAT_E COLI, 4. 0.48, 5. 0.5, 6. 0.56, 7. 0.24, 8.

0.35 68.6567%, 0.0901

3

1. AAT_E COLI, 5. 0.5, 6. 0.56, 7. 0.24, 8. 0.35

67.7612%, 0.0934

4

1. AAT_E COLI, 6. 0.56, 7. 0.24, 8. 0.35

67.7612%, 0.0939

5

1. AAT_E COLI, 7. 0.24, 8. 0.35

63.5821%, 0.1062

Algorithm weka.classifiers.functions.SMO

Training parameter

filterType = No normalization/standardization (Other parameters in default)

Cut-off Number of Folds

Removed Attribute

Instances, Mean

Absolute Error

2

1. AAT_E COLI, 4.

0.48, 5. 0.5, 6. 0.56, 7. 0.24, 8. 0.35

72.5373%, 0.191

3 1. AAT_E COLI, 5. 0.5,

6. 0.56, 7. 0.24, 8. 0.35 71.0448%, 0.1915

4 1. AAT_E COLI, 6. 0.56,

(4)

112

Published By:

Blue Eyes Intelligence Engineering & Sciences Publication

Retrieval Number: A3936119119/2019©BEIESP

This algorithm has many parameters. For simplicity, only the parameter “debug” is altered as changing of values in other parameters do not have significance improvement. Other parameters are in their default values (debug = False, maxIts = -1, ridge = 1.OE-8).

[image:4.595.308.543.47.263.2]

From the results above, you can see that with training parameter: debug = False, the result is the best with Correctly Classified Instances of 85.6716% and it is same as and better than the default settings.

Table – XI Based on the attribute list from Weka Attribute

Evaluator

ClassifierSubsetEval (Classifier:

functions.Logistic with training parameters: debug = False and the others in their default )

Search Method BestFirst (Direction = Forward)

Evaluation Mode

1. AAT_E COLI – 10(100%), 2. 0.49 – 10(100%), 3. 0.29 – 10(100%), 4. 0.48 – 10(100%), 5. 0.5 – 0(0%), 6. 0.56 – 2(20%), 7. 0.24 – 10(100%), 8. 0.35 – 10(100%)

Table – XII

It is noted that, when cut-off number of folds = 4, there are 2 attribute being removed and the “Correctly Classified Instances” has the highest percentage (76.7164%) That means the rest of the 5 input attributes are relevant to define the problem.

Table – XIII Iteration for algorithm “weka.classifiers.bayes.NaiveBayes”

This algorithm has many parameters. For simplicity, only the parameters “debug” are altered as changing of values in other parameters do not have significance mprovement. Other parameters are in their default values (debug = False, displayModelInOldFormat = False, useKernelEstimator = False, useSupervisedDiscretization = False). From the results above, you can see that with training parameter: debug = False, the result is the best with Correctly Classified Instances of 85.4731% and it is better than the default settings.

Table XIV Based on the attribute list from Weka Algorithm

& Evaluation

weka.classifiers.functions.Logistic

Training parameter

debug

Correctly Classified Instances

Mean Absolute Error

FALSE 85.67% 0.048

TRUE 85.67% 0.048

Table – XV Reducing input size Attribute

Evaluator

ClassifierSubsetEval (Classifier: bayes.NaiveBayes with training parameters: debug = False and the others in their default)

Search Method

BestFirst (Direction: Forward)

Evaluation Mode

1. AAT_E COLI – 10(100%), 2. 0.49 – 10(100%), 3. 0.29 – 10(100%), 4. 0.48 – 10(100%), 5. 0.5 – 0(0%), 6. 0.56 – 2(20%), 7. 0.24 – 10(100%), 8. 0.35 – 10(100%)

From the above results, the highest percentage of “Correctly Classified Instances” (68.6567%) is when 2 attributes are being removed.

VIII RESULT ANALYSIS

[image:4.595.42.550.47.466.2]

The best performing models for each of the algorithms have been determined in previous parts of this report. Their parameters and performance measure, percentages of “Correctly Classified Instances”, are summarized in the following table. It is noted that the algorithm “weka.classifiers.meta.Decorate”: with training parameters: “classifer = J48graft”, and Other parameters in default with all 8 input attributes (i.e. no removed attributes) has the highest percentage of instances being classified correctly (86.5672%) and it is considered to be the final best performing model; this is our recommended model. Researchers can use the recommended model to predict the protein localization sites of E Coli.

Table – XVI Result Analysis Algorithm weka.cl

assifier s.meta. Decora te

weka.classi fiers.functi ons.SMO

weka.cla ssifiers.f unctions. Logistic

weka.classifiers. bayes.NaiveBay es

Training paramete r

Classifer = J48graft

filterType

= No

normaliza tion/stand

ardization Debug

= False Other

parameter

s in

default settings

Other parameters in

default settings

Removed input Attribute

None None

1. Seque

nce Name, 2. Mcg, 3.Gvh, 4.Lip, 5.Chg, 6.Aac, 7.alm1, 8.alm2

None

Algorithm _{Attribute Selection}

Training

parameter Removed Attribute

Rule 1. AAT_E COLI, 4. 0.48, 5. 0.5, 6. 0.56, 7. 0.24, 8. 0.35

Cut-off Number of Folds

1. AAT_E COLI, 5. 0.5, 6. 0.56, 7. 0.24, 8. 0.35

Instances, Mean

Absolute Error

2 1. AAT_E COLI, 6. 0.56, 7.

0.24, 8. 0.35 76.7164%, 0.0775

3 1. AAT_E COLI, 6. 0.56, 7.

0.24, 8. 0.35 76.4179%, 0.0798

4 1. AAT_E COLI, 6. 0.56, 7. _{0.24, 8. 0.35} 76.1194%, 0.0802

Algorithm weka.classifiers.bayes.NaiveBayes

Training parameter

Debug Correctly Classified

Instances Mean Absolute Error

False 85.4731% 0.0445

[image:4.595.313.558.503.839.2]

(5)

hoto

Author-1 Photo

Author-2 Correctly Classifie d Instances

85.0746

% 86.5672%

85.671

6% 85.3731%

IX CONCLUSION

Here in this paper we implemented various data mining algorithm to predict the protein localization site of E Coli, therefore the input information for the model (the 8 input attributes: 1.sequence name, 2.mcg, 3.gvh, 4.lip, 5.chg, 6.aac, 7.alm1, 8.alm2.) and the model will be able to predict accurately the localization site of protein. Although the recommended model is unable to give a 100% prediction about the protein localization site, this is an effective and efficiency way to provide an initial prediction, as it combines both the Knowledge base and model’s predictions and it helps to save a lot of time and resources.

REFERENCES

1. Qiang Yang and Xindong Wu, (2006) "10 Challenging Problems In Data Mining Research", International Journal of Information

Technology & Decision makingvol. 05, No. 04, pp.

597-604 ,https://doi.org/10.1142/S0219622006002258.

2. Norouzi, Monire & Souri, Alireza & Zamini, Majid. (2016). A Data Mining Classification Approach for Behavioral Malware Detection. Journal of Computer Networks and Communications. 2016. 1-9.

3. (https://doi.org/10.5120/ijca2017915922) Jindal, Rajni & Taneja, Shweta.

(2017). A Novel Weighted Classification Approach using Linguistic Text Mining. International Journal of Computer Applications. 180. 9-15. 10.5120/ijca2017915922.

4. Coenen, F., Leng, P., Zhang, L.: Threshold Tuning for Improved Classification Association Rule Mining, pp. 216–225. Springer, Heidelberg (2005)

5. Han, J., Kamber, M.: Data Mining: Concept and Techniques, 2nd edn., pp. 285–311. Morgan Kaufmann Publishers, San Francisco (2006)

6. Ismaeel, A. G., & Mikhail, D. Y. (2016),”Effective data mining technique

for classification cancers via mutations in gene using neural network”, arXiv preprint arXiv:1608.02888.

7. Nilashi, M., Ibrahim, O., Ahmadi, H., &Shahmoradi, L. (2017), “A

knowledge-based system for breast cancer classification using fuzzy logic method”, Telematics and Informatics, 34(4), 133-144.

8. Mahapatra, R., Majhi, B., & Rout, M. (2012),”Reduced feature based efficient cancer classification using single layer neural network”, Procedia Technology, 6, 180-187.

9. Nguyen, T., Khosravi, A., Creighton, D., &Nahavandi, S. (2015), “Hidden

Markov models for cancer classification using gene expression profiles”, Information Sciences, 316, 293-307.

AUTHORSPROFILE

DrA.Rama, Senior Assistant Professor, Department of Computer Science and Engineering, Saveetha School of Engineering,, Chennai, India.

Dr.A.Gayathri, received the B.E degree in Electronics and Communication Engineering from Periyar Maniammai College of Technology for Women (Bharathidasan University, India) in 2001 and the M.Tech (CSE) degree in Computer Science and Engineering specialization from Bharath University, Chennai, India in 2005. She completed the Doctorate in the Department of Information and Communication Engineering at Anna University. She is currently working as Associate Professor in Saveetha School of Engineering (Department of CSE), SIMATS, Chennai, and Tamil Nadu. She is the member of CSI, IAENG and ACM.

Dr.S.Christy, is one of the valuable faculty members in Saveetha School of Engineering (Department of IT). She has well over 13 years of teaching experience She has also been awarded Silver Medal in M.Tech. Degree