SUPERVISED MULTI ATTRIBUTE GENE MANIPULATION FOR CANCER

(1)

221 | P a g e

SUPERVISED MULTI ATTRIBUTE GENE MANIPULATION FOR CANCER

P.Lakshmidevi

¹

, J.Daphney Joann

²

1Student-Computer Science and Engineering, Kingston Engineering College, (India)

2Assistant Professor-Computer Science and Engineering, Kingston Engineering College,( India)

ABSTRACT

One of the major research areas in the field of medical pointed out the exact tumour types provides an optimized solution for the better treatment and toxicity minimization due to medicines on the patients. To get a clear picture on the insight of a problem, a clear cancer classification analysis system needs to be pictured followed by a systematic approach to analyse global gene expression which provides an optimized solution for the identified problem area. Molecular diagnostics provides a promising option of systematic human cancer classification, but these tests are not widely applied because characteristic molecular markers for most solid tumour save yet to be identified. Recently, DNA microarray-based tumour gene expression profiles have been used for cancer diagnosis. Existing system focussed in ranging from old nearest neighbour analysis to support vector machine manipulation for the learning portion of the classification model. It doesn’t provide a clear picture of supervised classifier (Supervised Multi Attribute Clustering Algorithm) which can manage knowledge attributes coming from two different knowledge streams. The proposed system takes the input from multiple sources, creates an ontological store, cluster the data with attribute match association rule and followed by classification with the knowledge acquired.

Keywords- Association Rules, Cancer, Classification, Clustering, Data Mining, Gene Expression Data

1 INTRODUCTION

Cancer classification of different tumor types is of great importance in cancer diagnosis and drug discovery. A major challenge in clinical cancer research is the prediction of prognosis at the time of tumor discovery.

Accurate prediction of different tumor types can help in providing better treatment and toxicity minimization on the patients. The advent of microarray technology has made it possible to study the expression profiles of a large number of genes across different experimental conditions. Microarray- based gene expression profiling has shown great potential in the prediction of different cancer subtypes. cancer leads to approximately 25% of all mortalities, making it the second leading cause of death. cancer develops mainly in epithelial cells (carcinomas), connecting/muscle tissue (sarcomas), and white blood cells (leukemias and lymphomas). A successive mutation in the normal cell that damages the dna and impairs the cell replication mechanism causes malignant tumors (cancers). There are number of carcinogens such as tobacco smoke, radiation, certain Microbes, synthetic chemicals, polluted water, and air that May accelerate the mutations. Thus, there is a need to identify the

(2)

222 | P a g e

mutated genes that contribute to a cancerous state. Symptoms of cancer depend on the type and location of the cancer. For example, lung cancer can cause coughing, heavy breathing, chest pain, etc. Colon cancer often causes diarrhoea, constipation, dysentery, and blood in the stool. Some cancers may not have any symptoms at all. In certain cancers, such as pancreatic cancer, symptoms often do not start until the disease has reached an advanced stage. cancer progression can be aggressive or benign. That corresponds to the suitable treatment required for ailing the cancer One of the methods for cancer identification is through the analysis of genetic data. The human genome contains approximately 10 million single nucleotide polymorphisms (snps). These snps are responsible for the variation that exists Between human beings. The microarray technology is used to obtain gene expression levels and snps of an individual. Due to the high cost, genetic data Removal Of uninformative genes decreases noise, confusion, and complexity and increases the chances for identification of the most important genes, classification of diseases, and prediction Of various outcomes, e.g., cancer type. By understanding the Role of certain gene expression levels in person‟s predisposition To a cancer, medicine will be in a better position to prevent And cure cancers Techniques to solve biological problems. include analyzing large biological data sets requires making sense of the data by Inferring structure or generalizations from the data. Examples of this type of analysis include protein structure Prediction, gene classification, cancer classification based on microarray data, clustering of gene expression data, Statistical modeling of protein- protein interaction. Cancer classification using gene expression data usually relies on traditional supervised learning techniques, in which only labeled data (i.e., data from a sample with clinical follow-up) can be exploited for learning, while unlabeled data (i.e., data from a sample without clinical follow-up) are disregarded.

Recent research in the area of cancer diagnosis suggests that unlabeled data, in addition to the small number of labeled data, can produce significant improvement in accuracy, a technique called semi supervised learning Indeed, semi supervised learning has proved to be effective in solving different biological problems including protein classification prediction of transcription factor–gene interaction and gene- expression based cancer subtype discovery

1.1 Microarray Technology

Compared with the traditional approach to genomic Research, which has focused on the local examination and collection of data on single genes, microarray technologies Have now made it possible to monitor the expression Levels for tens of thousands of genes in parallel. A microarray is a “virtual-lab” on a chip. It is a 2D array on a solid platform (usually a glass slide) that is studied for biological references. It is used to assay spots (usually called probes or reporters) that are analyzed experimentally under the microscopic vision as the findings are hard to notice by naked eyes. The popularity of the microarray technique notwithstanding, the analysis of the data is far from trivial. Essentially, the information which we capture by using DNA microarray is at the level of transcription (origin: molecular biology). The raw microarray data are images, which have to be transformed into gene expression matrices tables where rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of the particular gene in the particular sample. These matrices have to be analysed further, if any knowledge about the underlying biological processes is to be extracted. So the inferences and results obtained by such studies have to be validated. More refinement can be achieved in the results by working on multi-tier data. It may lead to putative regulatory signals in the genome sequences. The two Major types of microarray experiments are the cdna Microarray and oligonucleotide arrays (abbreviated oligo Chip).

(3)

223 | P a g e

Fig 1.A Gene Expression Matrix

Despite differences in the details of their Experiment protocols, both types of experiments involve Three common basic procedures.

Chip manufacture: A microarray is a small chip (made Of chemically coated glass, nylon membrane, or Silicon), onto which tens of thousands of dna Molecules (probes) are attached in fixed grids. Each Grid cell relates to a dna sequence. .

Target preparation, labeling, and hybridization: typically, Two mrna samples (a test sample and a Control sample) are reverse transcribed into cdna (targets), labeled using either fluorescent dyes or Radioactive isotopics, and then hybridized with the Probes on the surface of the chip.

The scanning process: chips are scanned to read the Signal intensity that is emitted from the labeled and hybridized targets. Generally, both cdna microarray and oligo chip Experiments measure the expression level for each dna Sequence by the ratio of signal intensity between the test Sample and the control sample, therefore, data sets resulting From both methods share the same biological semantics. And term the measurements collected via both methods As gene expression data.

1.2. Literature Review

Shital Shah et.al. says that an analysis of gene expression data leads to cancer identification and classification, which will facilitate proper treatment selection and drug development. An integrated algorithm involves a genetic algorithm and correlation-based heuristics for data pre-processing and data mining for high classification accuracy with the ability to identify the most significant genes. Computationally less complex resulting in faster gene selection. As the algorithm is easier to implement and requires minimal domain knowledge. The early and accurate detection as well as classification of lung cancer using the integrated gene- search algorithm will help in the selection of treatment options and developing drugs. Higher training classification accuracy can lead to over fitting the data. To check this, knowledge extracted from the training data set was used to predict the test samples.

Lipo Wang 2007 et al proposed the approach for cancer classification using an expression of very genes. There are two steps involved in this process. The first process is important gene selection, which is done by the use of the gene-ranking scheme. The second one is the classification accuracy of gene combination has been carried by using a fine classifier. Divide and conquer approach are used to attain good accuracy. The scoring method such as T-test, Class Separability is used for gene ranking. Datasets used in this experiment are Lymphoma Data, SRBCT Data, Liver Cancer Data and GCM Data collected from micro array gene expression data set. The data set contains some missing values those are filled by the k-nearest neighbour algorithm excluding GCM Data set. The classifier used here is a fuzzy neural network and Support vector machine (SVM).

At first need to divide the whole data set into two one for training and the remaining part for testing and then

(4)

224 | P a g e

ranking is performed by the use of the scoring scheme after that top genes have been selected from the ranked data set. Each selected gene is passed one by one into the classifier if no accuracy is attained, then the next process is performed that is gene combination here cross validation have been performed on the training data.

Two or three gene combination has been calculated from the top genes with the use of cross validation and it is then inputted into the classifier until good accuracy is achieved. The result of all data set specifies that finding gene minimum gene selections for cancer classification making predictions. Knowledge derived by the algorithm has provides very good classification accuracy as well as T-score and CS is the best approach for important gene selection.

N. Revathy et al defined a new method to process microarray data for cancer classification. There are several methods were available to rank the gene expression data. The most often used methods are T-score and ANOVA and so on. But those are not suitable for large data sets. To rectify this problem the author proposed the technique is the Enrichment score. The Classifier used here Support Vector Machine (SVM). The data set is randomly divided into two one for training and the remains for testing. The classifier is trained with the data.

The lymphoma data set is used for performance demonstration. There are two processes involved one is gene ranking done by using the proposed method called enrichment score. Top genes can be selected from the ranked data, which is passed into the classifier one by one if no good accuracy is attained, gene combination can be performed from the ranked data set. Again the combination of genes can be classified until good accuracy is achieved. The result can be evaluated with the use of SVM and the T - Score and SVM and Enrichment score.

The performance accuracy and classification time can be compared with one another. The SVM With the enrichment score performed well with higher accuracy than the SVM with T-Score.

2 GENE EXPRESSION DATA CLUSTERING

Gene-based clustering is the process of grouping genes into a set of classes (or clusters) according to their expression in given experimental conditions (samples or time points). Each cluster intends to contain co- expressed genes that exhibit a common expression profile. Gene-based clustering was investigated to understand biological processes since genes grouped in the same cluster are expected to be involved in common biological processes. Most popular gene-based clustering algorithms are partitional, self-organized maps (SOM) and hierarchical. Although clustering techniques proved to be useful for identifying co-expressed genes, interpreting gene co-expression without ambiguity has remained a challenge since it depends on other sources of knowledge such as expert knowledge of biologists. Indeed, a microarray dataset contains numerous groups of co-expressed genes. Then, a typical strategy for a biologist is to start from genes which are known to be closely related to a biological function and to browse a preliminary rough clustering result, to focus on a small subset of those genes which are supposed to play a role. Thus, currently, biologists follow exploratory strategies by manually selecting potential groups of genes according to their knowledge.

3 EXISTING SYSTEM

Gene expression is the activation of a gene that results in a protein which tends to identify only the Gene manipulation for Cancer therapeutics. The Raw microarray data are images which have to be transformed into gene expression matrices with tables and rows. Where rows represent genes and columns represent various samples. Analysis of gene expression data leads to cancer identification and classification, which will facilitate

(5)

225 | P a g e

proper treatment selection and drug development. In existing approach, identification of cancer by the gene expression has been implemented. The Genome of a differentiated cell contains all the genes required to find the affected cells using Microarrays to investigate the “Expression” of Thousands of Genes at a Time. Discretized gene expressions can be used as descriptors of the specific states of gene for the cancer prediction analysis.

Some of the methods are Splicing , Polyadenylation , Stability Supervised, semi-supervised learning techniques use existing domain knowledge for filtering the particular genes. The sequencing based technique called SAGE (Serial Analysis of Gene Expression) can be implemented in parallel to integrate the biological knowledge in all phases of the data mining process.The limitations include , Fragments of transposes are often confused for protein-coding exons of genes and direct DNA Injections leads to Low Efficiency identification.

4 PROPOSED SYSTEM

Predicting Cancer by analyzing gene and converting the gene expression is the proposed concept of the project, which leads to identifying and analyzing the cancer result set. Controlling Gene Activity from Gene to Functional Protein & Phenotype has also been analyzed in order to identify the cancer cells. The proposed methodology, the expert‟s documental DNA data methylation (Gene expression segments) is a kind of binding site for proteins which make DNA inaccessible to be in alive state. A Biomarker is a representation or an indicator of the severity and presence of some diseased state in a body.Molecular diagnostics technique is used to analyze biological markers in the genome and proteome. Which is used to diagnose and monitor disease, detect risk, decide which therapies will work best for individual patients. An Ontological store is to effectively combine data or information from multiple heterogeneous sources. Semantic Ontology based Mining Gene Expression analysis tends to compare the gene expression values by using the comparative Knowledge Consolidator. Supervised Multi Attribute Clustering Algorithm has been used to find the best rule classification in the gene expression to find the final prediction of cancer disease. Which has advantages in Comparison of genomes can help with gene finding , it can help to visualize transient gene expression and it can also help to identify, if tissue is stably transgenic, and Useful for cellular and ecological studies.

5 SUPERVISED MULTI ATTRIBUTE GENE MANIPULATION

The following architecture explains the mining process of gene expression data with the help of experts documental data. An experts documental data and patients gene expression data are taken as an input to the system which there by creates an ontological store. An Ontological store is one that is used to manage and effectively combine data from multiple heterogeneous sources. The supervised multi attribute clustering method relies on the similarity measurement to automatically form groups of relevant or similar data members. The generalization and an association rules are employed to find the relevant features of the gene. After the clustering process, user can apply some best classification rule to extract data pattern in each cluster for a better understanding of cluster model. From the classification results we can able to find the final disease prediction and provide suggestions for the better treatment in early stages and for drug development.

(6)

226 | P a g e

Fig 2. Mining Gene Expression for cancer identification

Supervised Multi Attribute Clustering Algorithm INTERSECT(P1,P2)

Answer←( )

While P1≠ NILL and P2 ≠ NIL Do if docID(P1) = docID(P2)

then ADD(Answer , docID(P1)) P1←Next(P1)

P2←Next(P2) else if docID(P1)<docID(P2)

then P1←Next(P1) else P2←Next(P2) Return Answer.

Gene Knowledge Extraction

In this module based on the analysis from the data set input modules comparison will analyse the extracted gene to predict its characteristics and other features. An ODIS (Ontology Driven Information System) contains three kinds of components such as application programs, information resources and user interfaces.

Gene knowledge extraction module mainly focuses towards the performance of the individual gene expression data's. There are two kinds of methods in order to measure semantic similarity within ontology. They are edge counting methods and information theoretic based methods.

(7)

227 | P a g e

Ontological Mapping

In this module, ontological mapping is done which is nothing but the mapping of two different gene expression data's to find the difference in gene characteristics. During this step, KEOPS (Kinase Endo peptidase and Other Proteins of Small size) suggests building MODB (Mining Oriented Database) by mapping original data with ontology concepts. The database contains only bottom ontology concepts. The objective is to structure knowledge and data in order to process efficient mining tasks and to save time spent into data preparation. The idea is to allow generation of multiple datasets from the MODB, using ontology relationships without another preparation step from raw data. Furthermore, during ODIS construction, experts can express their knowledge using the ontology which is consistent with data. Databases often contain several tables sharing similar information. However, it is desirable that each MODB table contains all the information semantically close and it‟s important to observe normal forms in these tables. During datasets generation, it‟s easy to use join in order to create interesting datasets to be mined. Ontological mapping is seen as a solution provider in today's analysis which will also provide insights on the pragmatics of ontology mapping towards gene expression elaboration.

Gene Expression Design

In this module genes work co-ordinately as gene expression or gene networks in which it was designed to find gene expression patterns by grouping genes. Here the Genes whose expression is modulated by the genetic variants (different genes in the human body) will act as Trans-Regulated Gene Modules in Humans to identify the cancer. Tran‟s regulator elements are genes which may modify the expression of distant genes. A supervised Multi attribute clustering algorithm is used in the system in order to find the co-regulated clusters of genes whose collective expression is strongly associated with the sample categories of gene expressions.

6 CONCLUSION

Thus it concludes that a reliable and precise classification of tumors is essential for successful diagnosis and treatment of cancer. By allowing the monitoring of expression levels in cells for thousands of genes simultaneously, microarray experiments may lead to a more complete understanding of the molecular variations among tumors and hence to a finer and more informative classification. The ability to successfully distinguish between tumor classes using gene expression data is an important aspect of this novel approach to cancer classification. Also annotated that comparing the activity of genes in a healthy and cancerous tissue may give some hints about the genes that are involved in cancer. Albeit, this approach is very limited because many of the genes serve multiple functions and changes in gene expression can be due to factors not directly concerned with the particular experiment. Indeed a microarray data set contains numerous groups of co-expressed genes. Thus, currently biologists follow exploratory strategies by manually selecting potential groups of genes according to their knowledge. So, on experimenting with these “superficial” data and applying various data mining techniques to them, results are rather vague and imprecise. Therefore, the input data has to be concise and close to accurate to obtain the results of the same nature.

(8)

228 | P a g e

REFERENCES

[1] B. Collard, “How to semantically enhance a data mining process? Lecture Notes Bus. Inform. Process., vol. 13, pp. 103–116, 2009

[2] D. Jiang, C. Tang, and A. Zhang, “Cluster analysis for gene expression data: A survey,” IEEE Trans.

Knowl. Data Eng., vol. 16, no. 11, pp. 1370–1386, Nov. 2004

[3] G.-M. Elizabeth and P. Giovanni, (2004, Dec.). “Clustering and classification methods for gene expression data analysis.” Johns Hopkins Univ., Dept. of Biostatist. Working Papers. Working Paper 70. [Online].

Available:http://biostats.bepress..com/jhubiostat/paper7

[4] Lipo Wang, Feng Chu, And Wei Xie, "Accurate Cancer Classification Using Expressions Of Very Few Genes", IEEE/ ACM Transactions On Computational Biology And Bioinformatics, 4, 40-52,2007

[5] N. Pasquier, C. Pasquier, L. Brisson, and M. Collard, (2008). “Mining gene expression data using domain knowledge,” Int. J. Softw. Informat, vol. 2, no. 2, pp. 215–231.

[6] Revathy And R. Amalraj „Accurate Cancer Classification Using Expressions Of Very Few Genes‟

International Journal Of Computer Applications, Vol.14, No.4, 2011.

[7] S. Dudoit, J. Fridlyand, and T. P. Speed, “Comparison of discrimination methods for the classification of tumors using gene expression data,” J. Amer. Statist. Assoc., vol. 97, no. 457, pp. 77–87, Mar. 2002.

[8] Shasurya Jauhari ans S.A.M.Rizvi(2014) „Mining Gene Expression Data focusing cancer Therapeutics:A Digest‟ IEEE transactions on computational biology and bioinformatics,volume.11 no.3.

[9] Shital Shah and Andrew Kusiak „Cancer gene search with data-mining and genetic algorithms‟ ELSEVIER Vol No 37 2007.

[10] The Human Epigenome Project. (2013). [Online]. Available: http://www.epigenome.org