ROUGH SET BASED CLUSTERING OF GENE EXPRESSION DATA: A SURVEY

(1)

ROUGH SET BASED CLUSTERING OF

GENE EXPRESSION DATA: A SURVEY

J.JEBA EMILYN

Department of IT, Sona College of Technology, TPT Road, Salem, Tamilnadu, India*

DR.K.RAMAR

Principal, Sri Vidhya College of Engineering & Technology, Sivakasi Main Road, Virudhunagar, Tamilnadu , India

Abstract:

Microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. But the high dimensionality property of gene expression data makes it difficult to be analyzed. Lot of clustering algorithms are available for clustering. In this paper we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. Then we introduce rough clustering and its advantage over strict and fuzzy clustering is explored. We also explain why rough clustering is preferred over other conventional methods by presenting a survey on few clustering algorithms based on rough set theory for gene expression data. We conclude by stating that this area proves to be potential research field for the research community.

Keywords: Cluster Algorithm; Gene Expression data; Rough sets;

1. Introduction

1.1 Genes and Their Functions

1.1.1) What are Genes?:

The nucleus embedded in the cell contains DNA, a one dimensional molecule, made of two complementary strands, coiled around each other as a double helix. A gene is a segment of DNA, which contains the formula for the chemical composition of one particular protein. Genetic information is encoded in the linear sequence in which the bases on the two strands are ordered along the DNA molecule.

1.1.2) Functional analysis of Genes:

The functional analysis of genes is a way to find what roles the genes play in the living organism. It is important to understand what proteins the genes code, and where the genes are expressed (in tissues or organs) and when they are expressed. The methods for studying the expression of genes generally involve the expression at the transcription level. The important and effective method that can be used at the transcription level to analyse gene expression is the microarray method.

1.2 Microarray Technology

(2)

make possible the speedy and quantitative genomic scale analysis of gene expression patterns. It is often an important task to find genes with similar expression patterns (co-expressed genes) from microarray data. A microarray looks as shown below:

Fig.1.Microarray Image

Microarray is a powerful tool for gene function analysis. Some of the applications of microarrays can be disease diagnosis, gene discovery, drug discovery, and toxicological research.

1.3 Gene Expression Data

Gene expression data is obtained by extraction of quantitative information from the images/patterns resulting from the readout of fluorescent or radioactive hybridizations in an microarray chip. Usually, gene expression data is arranged in a data matrix, where each gene corresponds to one row and each condition to one column. Each element of this matrix represents the expression level of a gene under a specific condition, and is represented by a real number, which is usually the logarithm of the relative abundance of the mRNA of the gene under the specific condition.

Fig.2 Data on n genes from m samples

Gene expression matrices have been extensively analyzed in two dimensions: the gene dimension and the condition dimension. These analysis correspond, respectively, to analyze the expression patterns of genes by comparing the rows in the matrix, and to analyze the expression patterns of samples by comparing the columns in the matrix.

Several obvious aims of these data analysis are the following:

1. Identify genes whose expression levels reflect biological processes of interest (such as development of cancer).

2. Group the tumors into classes that can be differentiated on the basis of their expression profiles, possibly in a way that can be interpreted in terms of clinical classification. For example one hopes to use the expression profile of a tumor to select the most effective therapy.

3. Finally, the analysis can provide clues and guesses for the function of genes (proteins) of yet unknown role.

1.4 Curse on dimensionality

(3)

is due to Bellman and in statistics it relates to the fact that the convergence of any estimator to the true value of a smooth function defined on a space of high dimension is very slow. High dimensional data like gene expression data is difficult to work with for several reasons:1. Adding more features can increase the noise, and hence the error. 2. There aren’t enough observations to get good estimates. 3. Most of the data is in the tails

1.5 Clustering Gene Expression Data

Cluster techniques, which are essential in data mining process for exploring natural structure and identifying interesting patterns in underlying data, have proved to be useful in finding coexpressed genes. In cluster analysis, one wishes to partition the given dataset into groups based on the given features such that the data objects in the same group are more similar to each other than the data objects in other groups. The objects are clustered or grouped based on the principle of maximizing intraclass similarity and minimizing interclass similarity.

A wide variety of clustering algorithms are available for clustering gene expression data[1][2]. They are mainly classified as Partitioning methods, hierarchical method, Density based methods, Model based method, Graph Theoretic methods, soft computing methods etc.

2. Problem Statement

The activity of genes is uncorrelated over number of experimental conditions. Genes get coexpressed and coregulated in multiple experimental conditions. This phenomenon can not be handled well by the conventional clustering algorithms. The main idea is to find out methodologies or frameworks that can improve the goodness of the clusters.

3. Rough set based clustering

3.1. Rough Sets

Rough Set Theory (RST) [5] can be approached as an extension of the Classical Set Theory, for use when representing incomplete knowledge. Rough sets can be considered sets with fuzzy boundaries − sets that cannot be precisely characterized using the available set of attributes. The basic concept of the RST is the notion of approximation space, which is an ordered pair A=(U,R), where

• U: nonempty set of objects, called universe

• R: equivalence relation on U, called indiscernibility relation. If x, y א U and xRy then x and y are indistinguishable in A.

Each equivalence class induced by R, ie, each element of the quotient set R~= U/R, is called an elementary set in A. An approximation space can be alternatively noted by A=(U,R~). It is assumed that the empty set is also elementary for every approximation space A. A definable set in A is any finite union of elementary sets in A. For x א U let [x]_Rdenote the equivalence class of R, containing x. For each X ك U, X is characterized in A by a pair of sets − its lower and upper approximation in A, defined respectively as:

A_low(X) = {x א U | [x]_Rك X} A_upp(X) = {x א U | [x]_R∩ X ≠ ׎}

3.2. Clustering Using Rough Sets

(4)

3.3. Rough clustering Vs Fuzzy and Crisp Clustering

Crisp(Strict) clustering algorithms like k-means and k-medoid place a restriction that a data object can belong precisely to only one cluster during clustering process This can be too restrictive while clustering high dimensional data like Gene Expression Data because genes have a property of getting expressed in multiple conditions. Fuzzy set clustering like Fuzzy C-means allows data objects to belong to multiple clusters based on the degree of membership. Though this property is better than Crisp clustering, this can be too descriptive for interpreting cluster results as stated in [3]. Rough set based clustering proves to provide a solution that is less restrictivethan the traditional clustering algorithms like k-means and less descriptivethan the fuzzy clustering methods.

4. Cluster Algorithms Based On Rough Set Theory

Gene Expression data can be analyzed based on two dimension: gene dimension and the condition dimension. In gene-based clustering, the genes are treated as the objects, while the samples are the features. In sample based clustering, the samples can be partitioned into homogeneous groups where the genes are regarded as features and the samples as objects. Both the gene-based and sample based clustering approaches search exclusive and exhaustive partitions of objects that share the same feature space. The third category, called as biclustering, captures clusters formed by a subset of genes across a subset of samples.

An approach namely Rough Overlapping Biclusters (ROB) presented by Ruizhi Wang et al [4] finds potentially overlapping biclusters in the framework of generalized rough sets. The method mainly consists of two phases. First, it generates a set of highly coherent seeds (original biclusters) based on two-way rough k-means clustering. The membership of data object is the ratio as shown below:

(1) where d(v,mj) is the distance between itself and the centroid of cluster mj. And then, the seeds are iteratively

adjusted (enlarged or degenerated) by adding or removing genes and conditions based on a proposed criterion. The method is illustrated on yeast gene expression data. The result is a set of biclusters of maximum size, with stronger coherence, and particularly with a reasonable degree of overlapping simultaneously. By associating each bicluster with a lower and an upper approximation, the approach dynamically adjusts the memberships of genes and conditions. This approach proves to work better than Cheng & Church biclustering algorithm[6] and FLOC(FLexible Overlapped Biclusters)[7].

Lijun[8] used a new method combining correlation based clustering and rough sets attribute reduction together for gene selection from gene expression data is proposed. Correlation based clustering is used as a filter to eliminate the redundant attributes, and then the minimal reduct of the filtered attribute set is reduced by rough sets. The correlation coefficient between two genes is given as

(2)

where var(·) responds to standard deviation and cov(·) is covariance. A successful gene selection method based on rough sets theory is presented. The experimental results indicate that rough sets based method has the potential to become a useful tool in bioinformatics[8].

Jung-Hsien [9] presents a novel rough-based feature selection method for gene expression data analysis. The method(RBFNN) finds the relevant features without requiring the number of clusters to be known a priori and identify the centers that approximate to the correct ones. The average distances between two seed points is calculated using the following formula

(3)

(5)

Pradipta Maji [10] proposed a new clustering algorithm, termed as fuzzy–rough supervised attribute clustering (FRSAC), to find groups of coregulated genes whose collective expression is strongly associated with sample categories. The proposed algorithm is based on the theory of fuzzy–rough sets, which directly incorporates the information of sample categories into the gene clustering process. A new quantitative measure is introduced based on fuzzy–rough sets that incorporates the information of sample categories to measure the similarity among genes whereby redundancy among the genes are removed. The clusters are refined incrementally based on sample categories. The effectiveness of this algorithm is compared with other existing supervised and unsupervised gene selection and clustering algorithms and proves to be better. The better performance of the proposed FRSAC algorithm is achieved due to the fact that it uses the fuzzy–rough supervised similarity measure to generate co-regulated gene clusters with strong association to the class labels[10]. The fuzzy-rough property makes it possible to deal with uncertainty, vagueness, and incompleteness in the class definition.

5. Observation

Many of the researchers have stated that Rough set based methods prove to work better for gene expression data than other conventional methods. But it has been observed that very few rough set based clustering methods are available for gene expression data. This proves to be a potential research field for the research community.

6. Conclusion

We have presented a survey of the clustering methods for gene expression data that are based on rough set theory. From the list of approaches analyzed in Section 3, it is our opinion that rough set based Clustering algorithms helps in identifying hidden pattern and providing enhanced understanding of the functional genomics in a better way. Many other domains of applications like web mining, text mining and collaborative filtering are open to be explored using rough set based clustering algorithms.

References

[1] Daxin Jiang, Chun Tang, and Aidong Zhang (2004) Cluster Analysis for Gene Expression Data: A Survey, IEEE Transactions On Knowledge And Data Engineering, Vol. 16, No. 11, November 2004

[2] Sara C. Madeira and Arlindo L. Oliveira Biclustering Algorithms for Biological Data Analysis: A Survey IEEE/ACM Transactions On Computational Biology And Bioinformatics

[3] S.Thilagamani, N.Shanthi,(2002) “ Literature Survey On Enhancing Cluster Quality”, International Journal On Computer Science And Engineering Vol. 02, No. 06, 2010, 1999-2002

[4] Ruizhi Wang, Duoqian Miao, Gang Li, Hongyun Zhang ,(2007) Rough Overlapping Biclustering of Gene Expression Data Bioinformatics and Bioengineering, 2007

[5] Pawlak Z.(1982), Rough sets, International Journal of Computer and Information Sciences 2 (1982) 341–356.

[6] Yizong Cheng and George M. Church.(2000) Biclustering of expression data. In Proceedings of the 8th International Conference on IntelligentSystems for Molecular Biology (ISMB’00), pages 93–103, 2000.

[7] Jiong Yang, Wei Wang, Haixun Wang, and Philip Yu.(2003) Enhanced biclustering on expression data. In Proceedings of the 3rd IEEE Conference on Bioinformatics and Bioengineering, pages 321–327, 2003

[8] Lijun Sun Duoqian Miao Hongyun Zhang,(2007) Gene Selection with Rough Sets for Cancer Classification, Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007)IEEE 2007

[9] Jung-Hsien Chiang*, Senior Member, IEEE, and Shing-Hua Ho(2008) A Combination of Rough-Based Feature Selection and RBF Neural Network for Classification Using Gene Expression Data IEEE Transactions On Nanobioscience, Vol. 7, No. 1, March 2008