ROLE OF PRE PROCESSING IN A CONCEPTUAL FRAMEWORK FOR TOXICITY PREDICTION

(1)

A Monthly Double-Blind Peer Reviewed Refereed Open Access International e-Journal - Included in the International Serial Directories.

GE-International Journal of Engineering Research (GE-IJER) ISSN: (2321-1717)

31 | P a g e

GE-International Journal of Engineering Research

Vol. 4, Issue 1, Jan 2016 IF- 4.007 ISSN: (2321-1717) © Associated Asia Research Foundation (AARF) Publication

Website: www.aarf.asiaEmail : [email protected] , [email protected]

ROLE OF PRE-PROCESSING IN A CONCEPTUAL FRAMEWORK

FOR TOXICITY PREDICTION

Kavitha C.R1, Dr. Mahalekshmi T2

1

Research Scholar, R & D, Bharathiar University, Coimbatore, India

2

Principal, Sree Narayana Institute of Technology, Kollam, India

ABSTRACT

Performance and efficiency of algorithms are given much focus in any research,

based on data mining. The quality of input data is of great importance for these algorithms to

perform well. The input data can contain noise and errors which make the algorithms

useless. The input data has to be explored and pre-processed. Data pre-processing lays the

groundwork for data mining. Before the discovery of useful information/knowledge, the

target data set must be properly prepared. Unfortunately, pre-processing is being ignored by

most researchers in data mining due to its perceived difficulty. This paper outlines the

proposed conceptual data mining framework for toxicity prediction. Also in this paper, a

comparative study of the performance of the classification algorithms namely Jrip and J48,

using the complete and incomplete dataset to demonstrate the importance of missing value

imputation techniques of data pre-processing.

KEYWORDS –Classification, Classifier, Data Pre-Processing, Missing values, Toxicity

Prediction,

INTRODUCTION

Toxicity is an activity of chemicals that can cause harm to both human being and

animal. [1] During the past several years, it has been found that some chemicals can cause

severe adverse impacts like the deaths from methyl isocyanate exposure in Bhopal, India, or

(2)

32 | P a g e is the study of the qualitative and quantitative effects of chemicals on living systems. [2]

Several techniques such as Quantitative Structure-Activity Relationships (QSAR), in vitro

testing, in silico testing, etc. are employed in predicting the activity of drugs. [1] TOXCAST,

a project initiative of Environmental Protection Agency (EPA) [3] is also indulged in

investigating quick and reliable methods for predicting toxicity of compounds. Some

commercially available toxicity prediction programs are TOPKAT (toxicity-prediction by

computer assisted technology), DEREK (deductive estimation of risk from existing

knowledge), CSGenoTox, MetaDrug and HazardExpert. [4] Since these programs are

developed using a non-congeneric set of chemicals, they are classified as the "global" models.

[5]

Today data mining also plays an important role in predicting the adverse effects of

chemicals that lack experimental data in the pharmaceutical industry. Recent data mining

techniques are able to identify features that are relevant for toxicity in a comprehensive and

unbiased manner and use them for the identification of complex Structure-Activity

relationships. [6]

The toxicity datasets are available from different toxicity based chemical databases

[7]. These datasets may suffer from noise, missing values, duplication etc. So these dataset

must be preprocessed in order to consider it for further investigation. The main aim of this

paper is to develop a conceptual data mining framework for toxicity prediction and to

illustrate the importance of handling missing values through an experiment. A comparison of

how the missing values in the input dataset can affect the performance of the classifier is also

given in detail in this paper. The remaining paper is organized into 10 sections. Sect ion 2

explains the general process of data mining. The next section presents the conceptual

framework for toxicity prediction. The section 4 introduces the data pre-processing and the

different techniques used for handling missing values. The related work in handling missing

values are described in Section 5. Toxicity dataset has been discussed in section 6 which is

followed by experiment in Section 7 and the discussion is given in section 8. The conclusion

(3)

33 | P a g e

GENERAL PROCESS OF DATA MINING

The Data Mining [8] is the computer-aided extraction of useful information from datasets.

The aim of data mining is to automatically or semi-automatically discover hidden knowledge,

unexpected patterns and new rules from data.

The data mining process involves different steps as shown in the figure 1. First step is to

define the goal after identifying the problem. The relevant data is collected based on this goal

from different sources. Then these data must be transformed into the standard form that is

acceptable to the techniques of data mining. These transformed data is analyzed by using

various data mining techniques and models are built iteratively to find the best solution to the

[image:3.595.110.484.269.431.2]

defined problem. Finally, found solution is applied into practice and its effect is validated.

Figure 1 General Process of Data Mining

PROPOSED CONCEPTUAL FRAMEWORK

Data mining methods are used for extracting relationships between chemical structures (or

(4)

[image:4.595.158.438.39.283.2]

34 | P a g e Figure 2 Conceptual Framework for Toxicity Prediction

Figure 2 depicts a generic data mining framework for toxicity prediction. The

conceptual framework includes databases, assays, models and tools that help to predict the

toxicity. Toxicity can be predicted on the basis of chemical structure, physicochemical

properties, biochemical properties, or biological activity. The toxicity input data is obtained

from the various cheminformatics data sources. Preprocessing of a dataset with a

well-defined toxicity endpoint is performed. Features that is relevant for toxicity prediction is

identified. Models are developed that detects the relationships between these features and

toxic activities. The predictive performance of the model is evaluated and interpreted.

The first step in the data mining process is problem definition. So the first step in this

framework is to define the problem i.e. to predict the toxicity of Aliphatic carboxylic acids.

To predict the toxicity, information about the targets and mechanisms of action related to

toxicity must be collected. After defining the problem, the next phase is data preparation.

Toxicity depends on some chemical properties and biological properties, which are collected

and used for the evaluation of toxicity. The collected data is then analyzed i.e. pre-processed

to make it suitable for further steps to be done. Classification models are developed and

applied to the pre-processed dataset in order to predict the toxicity target classes. Thus, the

decision making in the proposed framework involves four main steps like collection of

toxicity input dataset, setting toxicity target classes (high toxicity, Low toxicity),

(5)

35 | P a g e appropriate classes. Last step is the model validation which involves the evaluation of the

performance and accuracy of the models applied to the pre-processed toxicity dataset.

Quality decisions are based on the quality of the input dataset. The data containing

duplicate or missing data may cause incorrect or even misleading decisions. Data

pre-processing is an important step in the data mining process which can give many economic

benefits such as the pre-processed data can be used for taking business decisions, knowledge

discovered from this dataset can be used for many purposes and can earn huge payoffs by the

early detection of data anomalies and rectify them. This paper presents and illustrates the

importance of data pre-processing by conducting several experiments.

DATA PRE-PROCESSING

After the data is extracted from different sources, it is unrealistic to expect the data to

be perfect. Thorough cleansing of the data is required since good models usually need good

data. Data pre-processing is an important step to improve the quality of data mining methods.

The different process exist in the pre-processing stage are dimensionality reduction, feature

subset selection, removal of noise from the data and the missing values imputation. Missing

data can also be a particularly harmful problem. Especially when the dataset is small or the

number of missing fields is large, not all records with a missing field can be deleted from the

[image:5.595.164.468.481.643.2]

dataset.

Figure 3 General Data pre-processing framework [8]

Data Preparation is critical for data mining, as real world data that is collected from

(6)

36 | P a g e amount of processing time. Figure 3 shows the general data pre-processing framework. Data

pre-processing includes data preparation compounded by data cleaning, data integration, data

transformation and data reduction. Data cleaning includes operations that correct bad data,

filter some incorrect data and reduce the unnecessary detail of data. [10] Data transformation

includes operations such as smoothing, aggregation, normalization, discretization and

generalization that convert the data. Data reduction refers to the set of techniques such as

feature selection, instance selection etc. are used to obtain a reduced representation of the

original data. The result obtained after the data pre-processing is a final dataset which is

[image:6.595.138.451.265.513.2]

considered correct and useful for further data mining algorithms.

Figure 4 Forms of Data Preparation [10]

Data from the real world is never perfect. The problems for the improper and

incomplete data are described here:

(i) Missing attributes and missing attribute values

Each record in dataset may consist of several numeric and non-numeric attributes. Sometimes

within each record, a number of attributes may not have values.

(ii) Improper types (numeric/symbolic data)

For an efficient and meaningful data analysis, the parameter values in each field have to be of

the same type (numeric or non-numeric). Sometimes the attribute types may not be

(7)

37 | P a g e (iii) Out-of-range data

Out-of-range data refers to the impossible data combination. E.g. (Income=100), (Gender:

Male; Pregnant: Yes)

Most commonly used approaches are to calculate a substitute value for missing fields

such as the median or mean of a variable, clustering approaches, predict from model, learning

methods etc. In this research work, we have used Naïve Bayes model to predict the missing

values in the toxicity dataset because it is a popular classification technique based on

computing probabilities [11]. It is known for its good performance, its simple form, fast to

train, fast to classify, not sensitive to irrelevant features and very high efficiency of

calculation.

Naïve Bayes model is a prediction model which replaces missing values with

predicted values. [12] This model can be used for supervised data sets or unsupervised data

sets. In a supervised data set one of the attributes is designated as class attribute (target

attribute or dependent variable). An unsupervised data set has no class attribute. This model

requires nominal (categorical) attributes. Numeric attributes are automatically discretized into

a specified number of intervals.

In a supervised data set, missing values are computed differently in training data and

in test/execution data. In training data, missing values in an instance (a case) are predicted

from the observed values in that instance. Observed value of every attribute, including the

class attribute, is used to predict missing values in the remaining attributes. In testing data

where the values of the class attribute are assumed to be unknown, the observed values of the

class attribute are not used to predict missing values. In an unsupervised data set missing

values in training data are computed the same way as in testing data. All attributes can be

used to predict missing values.

RELATED WORK

Anjana Sharma et al. used three techniques such as lit wise deletion, K-NN

imputation and mean/mode imputation for handling missing values. [13]. R. Malaryizhi et al.

used K-NN classifier and K-means clustering in missing value imputation. [14]. Thomas et

al. proposed that an existing fuzzy rule induction algorithm can consolidate missing values in

the training method in a very common way without any need for artificial replacement of the

(8)

38 | P a g e missing value by making rules establish on background knowledge but still lose some usable

rules [16]. Mei Ling Shyu et al. designed a framework named F-DCS for replacing missing

value which obtains the basic concept of conditional probability approach. This framework

can manage both nominal and numeric values with a high degree of certainty when it is

distinguished with other techniques such as using minimum, average and maximum value

[17]. Olga et al. implemented three methods named a Singular Value Decomposition (SVD)

based method, weighted K-nearest neighbours (K-NN) and row average. K-NN and SVD

based methods provide quick and proper ways of measure missing values for microarray data,

though K-NN is better than SVD [18]. Alireza Farhangfar et al. used mean, hot deck, c4.5,

CLIP4 and Naive Bayes for replacing missing value [19]. Minakshi et al. used e lit wise

deletion, mean imputation, KNN imputation for imputing missing values [20]. Edgar et al.

implemented four methods such as the case deletion method, mean imputation, median

imputation and KNN imputation procedure for replacing missing values [21]. Xi-Yu et al.

implemented EM algorithm and Naive Bayesian to deal with missing values that improves

the accuracy of classification [22].

DATA SET

The data set used for the experiment is the toxicity data of aliphatic carboxylic acids

which was downloaded from “http://vincentarelbundock .github.io/Rdatasets/datasets.html”

[23]. The toxicity dataset which was downloaded did not contain any noise or missing value.

The characteristics of the data set are summarized in the Table 1. The aim of the data set was

to predict the toxicity of carboxylic acids on the basis of several molecular descriptors like

toxicity, logKow, pKa, ELUMO, Ecarb, Emet, RM, IR, Ts and P. The problem is to predict

whether the given aliphatic acid is toxic or not. This is a two-class problem with class

variable positive and negative. The data set contains 38 observations and 11 variables with no

missing values reported. There are eleven variables, including the class variable, in this data

set; all other attributes are numeric- valued. The attributes are given below:

1. Toxicity - defined as log(IGC50^(-1)); typically the “response”.

2. logKow - the partition coefficient

3. pKa- the dissociation constant

4. ELUMO- Energy of the lowest unoccupied molecular orbital

5. Ecarb - Electrotopological state of the carboxylic group

6. Emet- Electrotopological state of the methyl group

(9)

39 | P a g e 8. IR - Refraction index

9. Ts - Surface tension

10. P – Polarizability

[image:9.595.202.395.159.269.2]

11. C – Class variable (positive or Negative)

Table 1 Characteristics of Toxicity data sets

Data Set Toxicity

No of Example 38

Input Attributes 10

Output Classes 2

Total No. of Attributes 11 Missing Attributes status No Noisy Attributes status No

EXPERIMENT

The main steps that were carried out in this experiment are given below:

1. The complete dataset (CD) is loaded into WEKA [24], a data mining tool and

classifiers such as J48 [25] and Jrip [26] were applied to this dataset. Then, the

performance of the classifiers are analysed based on the criteria such as correctly and

incorrectly classified instances, kappa statistics, Mean absolute error, root mean

squared error, relative absolute error and root relative squared error.

2. The dataset is made incomplete by deleting some of the values of the attributes of the

toxicity dataset. Now this dataset has some missing values and can be considered as

an incomplete dataset(ID)

3. The incomplete dataset (ID) is loaded into WEKA and the above said classifiers are

again applied. The performances of the classifiers are again evaluated.

4. The missing values are replaced with values predicted by Naïve Bayes model which is

a prediction model. This makes the incomplete dataset filled with values. This dataset

after handling missing values is referred to as the pre-processed dataset (PD).

5. The pre-processed dataset is loaded into WEKA and the same classifiers are applied

to the dataset. Again the performance of these classifiers is evaluated.

Table 2 shows the performance of the classifiers with the complete dataset (CD),

(10)

40 | P a g e Table2. Performance evaluation of classifiers before and after pre- processing

The Confusion Matrix can also be used for analyzing how well the classifier can

recognize tuples of different classes.

Figure 6 Confusion matrix [27]

True Positive (TP) is the positive tuples that were correctly labeled by the classifier

[28]. True Negative (TN) is the negative tuples that were correctly labeled by the classifier

[28]. False Positive (FP) is the negative tuples that were incorrectly labeled as positive [28].

False Negative (z) is the positive tuples that were mislabeled as negative [28]. Accuracy is

the probability that the test yields correct results. Accuracy is calculated as = (TP+TN)/(P+N)

(11)

41 | P a g e DISCUSSION

Classification on the complete dataset (CD), classification on the incomplete dataset

(ID) and classification on pre-processed dataset (PD) were conducted. A comparison was

made on the results of the experiment. Based on the confusion matrix comparison, it was

found that the accuracy of the prediction with the dataset containing missing values has

decreased from 98% to 85%. But after pre-processing is performed, accuracy has increased to

94%.

The Kappa statistics value for the incomplete dataset is low when compared with the

pre-processed dataset and the complete dataset. If the Kappa statistic value is = or > 0.7, then

it is considered as a good statistical correlation. The correlation is found to be better if the

Kappa statistic value is high. Therefore the complete dataset and the pre-processed dataset

have a better correlation.

When the results of the experiment were analyzed based on the errors, it is found that

the rate of error has increased for the incomplete dataset (ID). Relative absolute error value is

also better after pre-processing the dataset.

When analysing the other parameter also, it is observed that the pre-processed dataset

showed better results than dataset with the missing values.

It is concluded that the presence of missing value in the dataset surely affects the

performance of the classifier. Therefore it is important to handle the missing values before

applying the data mining techniques.

(12)

42 | P a g e From the table 2 and table 3, the performance of the classifier J48 is better than the

classifier Jrip. From the graph as shown in the figure 7, it is true that pre-processing surely

improves the performance of the classifiers.

Finally a comparison between this work and other research works were also

conducted. Anjana Sharma et al. in their work used three imputation techniques like litwise

deletion, Mean/Mode imputation and KNN and they used C4.5 classifier. KNN showed

highest accuracy (59%) than Lit wise deletion (48%) and Mean/Mode imputation (54%).

Malarvizhi et al. in their work had used KNN, K means clutering and Mean substitution

imputation methods and found that KNN howed 67% accuracy than Kmeans clutering(62%)

and Means substitution(60%). Alireza Farhangfar et al. in their work had used mean,hot deck,

c4.5, CLIP4 and Naive bayes imputation and performed an evaluation and found that C4.5

has the highest accuracy followed by the Naive Bayes. Through this comparative study, it is

found that Naive Bayes model is one of the best imputation technique for handling missing

[image:12.595.65.514.396.764.2]

values.

Table 4 Comparison based on related works of Missing Imputation Methods

Sno Author Dataset Classifier

Missing imputation method Accuracy (%) Highest Accuracy

1 Anjana Sharma et al

exam cell of university with student

records

C4.5

Litwise deletion 48

KNN Mean/Mode 54

KNN 59

2 Mostafizur Rahman et al

Hull and Dundee clinical sites

J48

J48 80

J48 KNN 80

FURIA 80 SVM 78 Mean 80

KNN

J48 71 KNN 81 FURIA 79 SVM 71 Mean 77

FURIA

(13)

43 | P a g e

3 Minakshi et al

Student

records C4.5

Litwise deletion 73

KNN mean 73

KNN 75

4 edgar Acuna et al Iris, Hepatitis, Sonar, heart, Bupa, Ionosphere, Crx, Breatw, Diabetes, Vehicle, German, Segment LDA case deletion Accuracy (unknown) KNN KNN mean effect on misclasification error rate mode

5 Malarvizhi

et al unknown

Mean substitution

KNN 67

KNN K means

clustering 62

6 Xi-Yu Zhou

seed

Multilayer Perceptron

without

imputation 79

NB-EM

EM 81

NB-EM 89

Columns

without

imputation 70

NB-EM

EM 73

[image:13.595.66.469.31.338.2]

NB-EM 78

Table 5 Results of the experiment using preprocessed dataset

CONCLUSION

In this work, we have proposed a conceptual data mining framework which would be

helpful for toxicity prediction. The prediction of toxicity of chemical compounds is a major

challenge in the pharmaceutical industry. We understand that data mining techniques proves

to be good for predicting the toxicity of chemicals from several researches. But the presence

of noise and missing values in the dataset will affect the accuracy of prediction, which is

proved by the experiment done in this research work. By this study we can understand that

pre-processing helps to remove the noise and missing values. This would help to improve the

accuracy of the prediction and will improve the performance of the classifier. It is observed

(14)

44 | P a g e The future scope of the research will be in developing a hybrid preprocessing

algorithm and feature selection which would further improves the prediction accuracy.

REFERENCES

1 Meenakshi Mishra, Hongliang Fei and Jun Huan, „Computational prediction of

toxicity’, Int. J. Data Mining and Bioinformatics, Vol. 8, No. 3, 2013

www.msmr.org /LAB_Notes_Toxicology.pdf accessed on 2/10/15

3 „TOXCAST‟, Available at

http://www2.epa.gov/chemical-research/toxicity-forecasting referred on 4/10/15

4 Ekins, S.: Computational Toxicology: Risk Assessment for Pharmaceutical and

Environmental Chemicals (Wiley Series on Technologies for the Pharmaceutical

Industry). Wiley-Interscience (2007)

5 White, A., Mueller, R., Gallavan, R., Aaron, S., Wilson, A.: A multiple in silico

program approach for the prediction of mutagenicity from chemical structure.

Mutation Research/Genetic Toxicology and Environmental Mutagenesis 539 (2003)

77{89(13)

6 Data Mining for Toxicity Prediction, Chapter 5, Artificial Intelligence and Data

Mining for Toxicity Prediction, Available at

https://openaccess.leidenuniv.nl/bitstream/handle/1887/ 12954/ Chapter% 205.pdf?

sequence=8 accessed on 6/10/2004

7 Kavitha C.R , Dr. Mahalakshmi T, „Chemical Databases: A Brief Walk’, International

Journal of Emerging Technology and Advanced Engineering, , Volume 3, Issue 8,

August 2013, http://www.ijetae.com/files/Volume3Issue8/IJETAE_0813_68.pdf

8 Jiawei Han, Micheline Kamber, Jian Pei”, Data Mining: Concepts and Techniques”,

third edition, Morgan Kauffman Publishers.

9 Data preprocessing techniques for data mining, Available at

http://iasri.res.in/ebook/win_school_aa/notes/Data_Preprocessing.pdf accessed on

7/11/2015

10 Salvador Garcia, Julian Luengo, Francisco Lerrera, „Data Preprocessing in data

(15)

45 | P a g e 11 M. Mostafizur Rahman and Darryl N. Davis, „Machine Learning Based Missing

Value Imputation Method for Clinical Dataset‟, IAENG transactions on engineering

technologies, springer Netherlands, 245-247

12 K. Ming Leung, „Naive Bayesian Classifier’, Available at

http://cis.poly.edu/~mleung/FRE7851/f07/naiveBayesianClassifier.pdf accessed on

5/10/15

13 Anjana Sharma, Naina Mehta, Iti Sharma, ” Reasoning with Missing Values in Multi

Attribute Datasets” ,International Journal of Advanced Research in Computer Science

and Software Engineering, Volume 3, Issue 5, May 2013 .

14 R. Malarvizhi, A. Thanamani,” K-NN Classifier Performs Better Than K-Means

Clustering in Missing Value Imputation”, IOSR Journal of Computer Engineering

(IOSRJCE), vol. 6, pp. 12-15, Nov. - Dec. 2012.

15 Thomas R. Gabriel and Michael R. Berthold, “Missing Values in Fuzzy Rule

Induction”, Systems, Man and Cybernetics, 2005 IEEE International Conference on

(Volume: 2).

16 S. McClean, B. Scotney and M. Shapcott, “Using Background Knowledge with

Attribute-Oriented Data Mining” Knowledge Discovery and Data mining (Digest no,

1998/310), IEE colloquiumon, 1998, pp. 1/1-1/4.

17 M. Shyu, I. P. Appuhamilage, S. Chen and L. Chang, “Handling Missing Values via

Decomposition of the Conditioned Set”, IEEE Systems, Man, and cybernetics society,

pp. 199-204, 2005.

18 Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie,

Robert Tibshirani, David Botstein and Russ B. Altman, “Missing value estimation

methods for DNA microarrays”, Bioinformatics (2001) 17 (6): 520-525.

19 Alireza Farhangfar , Lukasz Kurgan , Witold Pedrycz, Experimental analysis of

methods for imputation of missing values in databases

20 Minakshi, Dr. Rajan Vohra, Gimpy, “Missing Value Imputation in Multi Attribute

Data Set”, (IJCSIT) International Journal of Computer Science and Information

Technologies, Vol. 5 (4) , 2014, 5315-532, ISSN:0975-9646

21 Edgar Acuna and Caroline Rodriguez, “The treatment of missing values and its effect

in the classifier accuracy”, Classification, Clustering, and Data Mining Applications,

Part of the series Studies in Classification, Data Analysis, and Knowledge

Organisation pp 639-647, Available at “ ttp://sci2s.ugr.es/keel

(16)

46 | P a g e 22 Xi-Yu Zhou, Joon S. Lim,‟ Replace Missing Values with EM algorithm based on

GMM and Naïve Bayesian‟, International Journal of Software Engineering and Its

Applications Vol.8, No.5 (2014), pp.177-188

http://dx.doi.org/10.14257/ijseia.2014.8.5.14

23 Dataset, „Aliphatic Carboxylic acids data set‟, Available at http://vincentarelb

undock .github.io/Rdatasets/datasets.html accessed on

24 „WEKA‟, Available www.weka.net.nz/.

25 Aman Kumar Sharma, Suruchi Sahni, “A Comparative Study of Classification

Algorithms for Spam Email Data Analysis”, IJCSE, Vol. 3, No. 5, 2011, pp.

1890-1895

26 JRip, Available at https://en.wikibooks.org/wiki/Data_Mining _Algorithms _In_ R

/Classification/JRip accessed on 11/10/15

27 Confusion matrix, Available at http://www.gepsoft.com/gepsoft /APS 3KB

/Chapter09/Section2/SS03.htm accessed on 10/10/15

28 Ali serhan Koyuncugil, Nermin Ozgulbas, „Surveillance Technologies and Early

Warning Systems: Data Mining Applications for risk detection‟, Information Science