imarray An Integrated Data Management and Data Mining System for Microarray Data Analysis

(1)

imArray – An Integrated Data Management and Data Mining

System for Microarray Data Analysis

Proposal Summary

Microarray is a powerful tool for genomic research and it has great potential for clinical diagnoses in the future. Though this useful technique can monitor the gene expression levels of thousands of genes in parallel, it requires intelligent software to process its data. In this proposal, we introduce the framework of our imArray system, which is intended to provide a comprehensive data and information management environment for microarray image analysis and microarray data mining from multi-modalities. The imArray system consists of three important components, which are image analysis engine, data analysis engine, and knowledge discovery and data mining engine. The implementation of these key components is seamlessly integrated with IBM’s Unstructured Information Management Architecture (UIMA) which is a software agent for converting unstructured information into structured knowledge. With this architecture, it becomes easier to relate information from multi-modality sources which includes not only raw data, but also experimental results, literatures, and documents of other formats, to the related domain knowledge for further discovering the new knowledge which is previously hidden behind the unstructured information.

1. Background and Motivation

1.1 Introduction to Microarray

DNA microarray, also known as DNA chip or gene chip, was firstly introduced by Patrick Brown and Vishwanath Iyer in 1999 for analyzing gene expression on a global scale, and it has become one of the most widely used functional genomics tools[1]. The principle of microarray is akin to the reverse procedure of classical northern-blotting analysis which is a commonly used technique in molecular biology research to study gene expression[2]. The DNA microarray experiment includes the process of hybridization, fluorescent image acquiring and data analysis[3]. The experiment firstly extracts the cellular mRNA which will then be reverse-transcribed into their complimentary DNA (cDNA) with fluorescent label on them as the probe. Then, the probe is hybridized with the isolated DNA fragments affixed to a solid surface, such as glass, plastic, or silicon chip. The unbounded probe is washed off, and the bounded probe is then detected by using laser scanner with different wavelength. Afterward, the scanned images are processed, and the obtained data are analyzed by computer software. Thus, the microarray technology enables biologists to monitor the patterns of differential gene expression levels of thousands of genes simultaneously for profiling of complex diseases such as cancer, analysis of drug effects, and study of transcriptional regulation of genes under various conditions such as signaling activation or inhibition, showing its great potential in clinical diagnostics in the future.

1.2 The Microarray Image Analysis and its Problems

Though the microarray array experiment is very powerful, there are several sources of variation that would have impacts on its data precision. The sources of variation come from (1) the biological variation which is intrinsic to all organisms, (2) the technical variation which is

(2)

introduced during the operation of the experiment, and (3) Measurement error which is associated with obtaining the fluorescent signals[4]. In this proposal, we focus on the third variation.

After fluorescent image is obtained from the microarray experiment, analysis software uses a grid to address the positions of spots on the chip and then determines the boundary, also known as segmentation, for each spot. Since the microarray images are often accompanied with noise contamination, the automatic gridding and boundary determining results are usually unsatisfactory. In 2004, Ahmed et al. reported that the performance of spot segmentation will substantially affect the results of subsequent stages of the analysis[5]. Therefore, in order to reduce the effect of the inaccurate results from gridding and segmentation, human operators need to fine-tune the boundary for each spot manually, making this process time-consuming, labor-intensive, and error-prone. It takes an experienced technician approximately 20 to 40 minutes, in general, to fine-tune a single microarray slide with approximately 43,200 spots on it.

Since the automation of gridding and spot segmentation are both challenging tasks, they have been generally dealt with separately in the literature[6]. For automatic gridding, several methods have been proposed. These methods include the use of Markov random field (MRF), template matching and seeded region growing method, the axis projections of image intensity and so on. These methods usually require the numbers of rows and columns as mandatory input parameters[7-10]. For spot segmentation, there are several existing methods used to analyze microarray images: (1) The fixed circle segmentation and adaptive circle segmentation are two similar methods. The former uses a fixed radius circle on the spot while the later uses a circle with adjustable radius to fit the spot[6]. (2) Clustering-based segmentation uses k-means clustering and partitioning around medoids (PAM) to generate binary partition of the pixels based on the distribution of their intensities[11]. (3) Adaptive shape segmentation commonly uses watershed method or seeded region growing method which relies heavily on the selection of a seed point[6, 12]. (4) Adaptive thresholding method computes a threshold based on the Mann-Whitney (MW) test[13]. (5) Histogram methods use the histogram of a masked area to determine foreground and background[6]. (6) The Markov random field modeling[6]. (7) Gradient-based spot segmentation uses morphological analysis[14].

Our Preliminary Work:

In our preliminary work, we developed a three-step approach for automatic gridding and spot segmentation, including: background identification and noise removal, fully automatic gridding, and spot segmentation. The approach starts with a pre-labeling process to identify pixels that are most likely background pixels by using a local-global thresholding technique, and then remove them from the foreground. It is followed by a noise pre-eliminating process in which a voting method based on spatial connectivity is applied to eliminate the majority of noise. The cleaned and pre-labeled data is then passed on to the second step, in which blocks and grids are generated in a fully unsupervised manner, without any user intervention. The third step consumes the gridded data and applies a simple, progressive spot segmentation method to deal with inner holes and noises in spots. The segmentation method proposed here is based on a progressive two-class clustering scheme. Although more complex spot segmentation methods have been proposed in the literature, our spot segmentation method is much faster yet effective based on our experiments. This approach can deal with microarray images with heavy contamination, generate grids in a fully automatic fashion without any user input, and produce robust spot segmentation results [15].

(3)

1.3 The Microarray Data Mining and its Problems

Since DNA microarray is a high-throughput method for detecting gene expression level, a single slide can generate enormous amount of data. However, such huge volume of data prohibits human being to analyze, mine and extract knowledge from them manually. Thus, (semi)automatic yet powerful tool is essential in obtaining the pattern of differentially expressed genes, subsequently integrating the differential expression profiles with the existing knowledge in the literatures, and finally discovering the new knowledge.

To achieve the above goals, several statistical and data mining techniques were applied on grasping new information from the data generated from microarray experiments. Recall that one of three variation source comes from the technical variation which strongly affects the hybridization intensities. Thus, data normalization is an essential preprocessing procedure especially for comparing gene expression level across slides[16].

The first step in discovering the gene expression pattern is gene selection which identifies and selects the differentially expressed genes from the normalized data. The significance analysis of microarray (SAM) is used to determine a threshold for finding the truly up-regulated or down-regulated genes while reducing the false positives [17] [16]. Second, the exploratory data analysis extracts the pattern of the differentially expressed genes. Several clustering methods, such as hierarchical clustering, self-organization maps (SOM), k-means clustering, and principal component analysis, are used to group genes with similar behaviors together among all samples [16]. Once a pattern of differentially expressed genes is discovered, the discrimination analysis is carried out to associate the profile with a variable, such as disease or drug effects, by training a classifier [18, 19]. This classifier can identify the class for new samples, and thus is useful in clinical diagnostics [20].

In addition, gene expression is a result of a very complex mechanism. Analyzing microarray data from a pathway perspective can gain a higher level of understanding of the system. It is reasonable to assume that genes assigned to the same cluster could be involved in the same pathway. However, a higher level of network may be hidden behind[21]. To understand the entire network, we have to associate functional meaning to genes by using the annotations such as Gene Ontology (GO) which describes the roles of genes and their products with controlled vocabulary[22]. Researchers may rely on robust gene annotations to link gene function to transcription profile. Nowadays, various models, such as Bayesian network and Boolean network, are used to construct this network from microarray data [23]. Constructing this complicated network is an extension of current knowledge and it requires sufficient domain knowledge to support this task [24].

Despite the limited success of the existing microarray mining techniques, there is still a need for a cross-mining tool which can automatically mine and link related information from different modalities (literatures, microarray image data, etc.) together automatic literature collection and data mining tool. Since the literatures are published in different formats and experimental data may contain images, figures, tables or some other formats, existing techniques that usually deal with one single modality (especially textual information) are incapable of discovering and linking information from different modalities. Thus, one of our goals in this project is to take the advantages of the Unstructured Information Management Architecture (UIMA) to convert the unstructured information (text, images, etc.) into a well-organized database with cross-mining capability.

(4)

2. UIMA and the Proposed Research

2.1. Introduction to UIMA

Unstructured Information Management Architecture (UIMA) is an architecture and a framework that can be used in developing applications for analyzing large amount of unstructured data and supporting the discovery, organization, and delivery of relevant knowledge to the end user. The UIMA-based applications analyze and extract useful information embedded in various unstructured documents by using series of primitive analysis engines (AEs) which work as the annotator and finally generate the Common Analysis Structure (CAS) objects with annotations. Thus, UIMA converts unstructured information into structured information, and stores it in the database as collections of structured information for future use.

UIMA also provides the collection-processing engine (CPE) which controls the application of AEs to elements of a collection and manages the routing of results to CAS consumers that finally produce an arbitrary application-specific data structure, such as an index or database to the end user. CPE starts from the collection reader which determines the format and obtains the document from the collection, and then initializes a CAS with its contents and the original document meta-data considered appropriate for subsequent AEs processing. Afterward, AEs append the analysis result as the annotations in the CAS along with the document and then routes it to the CAS consumer for delivering query and presentation to the end user [25-27].

It is worth mentioning that UIMA serves as a bridge connecting unstructured information and structured knowledge. With its industrial strength and scalable integrating platform, it enables us to build a powerful information management system for unstructured data like microarray data which is essential for organizing the abundant documents and discovering useful knowledge that is previously unknown. UIMA can be considered as a software agent that can analyze documents of interests, identify and detect the related entities and relationships in them, and finally generate and index the information in structured forms for efficient searching in the future.

2.2. The Proposed Project

Biological experiments generate results in various formats such as numerical data, image, voice, video, and so on. These unstructured results contain enormous amount of direct and indirect evidence for discovering the useful knowledge so that UIMA perfectly fits our needs for managing different types of experimental data and the related references. We propose to design and implement different types of analysis engine for processing and mining the biological experimental data, especially microarray data. The analysis engines read and analyze the experimental data and then add the discovered information as annotations in the CAS. Finally, we store the structured CAS along with the raw data in a database for future data mining.

In this project, we aspire to develop and implement an information management system for microarray image analysis and data mining in a fully automatic manner. The analysis and mining components being developed in this project are expected to be reused by any party interested in microarray data analysis or related data queries. UIMA provides an ideal platform for such purpose due to its component-based nature.

(5)

This proposal describes the proposed imArray system which integrates the microarray image processing, data analysis, and knowledge discovery and data mining along with the Unstructured Information Management Architecture (UIMA). The entire project is divided into three phases.

2.2.1 Phase I – Implementing the Analysis Engine for Microarray Image Analysis and Information Extraction

Our short-term goal is to develop an aggregate analysis engine on the basis of our previous work on microarray image analysis. Though the raw images obtained from the microarray experiments are unstructured image files, we can turn this unstructured image into structured information including gene names, the intensity values of gene expression level, experimental condition, and so on. Based on that, an aggregate analysis engine for microarray image collections will be developed, which is capable of taking the specific information about the spot, the detailed record regarding to the experimental condition of the sample, and the raw images produced from the microarray experiments as its input, and then use our three-step approach to obtain the segmentation result for every spots. The aggregate analysis engine consists of several primitive analysis engines which are responsible for identifying background pixels, removing contaminated noise, detecting slide margins, finding blocks, generating grids, and finally producing the segmentation results. For each analysis engine, it will append the corresponding analysis result as the annotations to the CAS. The raw images along with the annotations are stored in the database for further analysis in the future. The Phase I implementation is expected to be finished within three months.

2.2.2 Phase II – Implementing the Analysis Engine for Microarray Data Analysis

In the second phase, our goal is to design the analysis engine to pull out the data stored in the database, apply data normalization and statistic analysis, and generate the statistical result as the annotations. The information generated in Phase I needs further processing to derive data for comparative study. Further analysis of the spot signals is needed to detect the gene expression pattern under certain experimental condition. Thus, in the second phase of this project, we will also implement an analysis engine for gene selection and performing the significance analysis (SAM). This analysis engine identifies and selects the differentially expressed genes from the normalized data and determines a threshold for finding the true positives. In addition, we also intend to implement another analysis engine which uses the exploratory data analysis to extract the pattern of the differentially expressed genes. All the analysis engines aforementioned will add new annotations to the CAS. This useful information will be saved in the database for knowledge discovery and data mining in the third phase. It is our expectation that new algorithms for analyzing microarray data will be developed in this project, which have better performance in terms of both robustness and computation efficiency. This phase will take about six months to implement a prototype system and another six to eight months to improve the performance of the analysis algorithms.

2.2.3 Phase III – Automatic knowledge discovery and data mining from microarray database The long-term goal of this project, which is to perform multi-modality data mining and knowledge discovery for microarray data related research, is reflected in Phase III. In this phase, a data collecting component will be developed for automatic searching and collecting information in multi-modalities from various resources such as the content of a microarray

(6)

website, related literatures, public databases, and so on. Second, UIMA architecture and the associated analysis engines will be used in converting the gathered unstructured information into well organized and indexed information at collection-level. Thus, not only the microarray experimental data, but the related literatures and other existing knowledge in this research area can be linked together. Last, but not the least, data mining and knowledge discovery techniques can be used to find the connections between the entities we are interested, which can be very application specific in many cases. Further, it is our long-term plan to implement a collection-processing engine (CPE) for analyzing a collection of the documents of interests, identify and detect the related entities and relationships in them, and finally generate and index the knowledge in structured forms. The documents in a collection do not have to be the same type. Instead, documents from different modalities but within the same discipline of study can be grouped and processed as a single collection. Phase III is the most time consuming phase in this project as it has to integrate an intelligent learning component to learn from the existing knowledge in literatures. However, it is also capable of discovering new knowledge from the abundant data. Since this is the ultimate goal of this project, it may take us at least another year for both designing the mechanisms and implementing the idea of the automatic knowledge discovery and data mining for microarray data.

3. The Impact of the Proposed Project

The imArray system proposed here is the extension of our previous work on the fully automatic gridding and segmentation for cDNA microarray image analysis. The imArray system takes the advantage of the UIMA to analyze and manage the unstructured information in multi-modalities and then uses the well-organized information to discover new knowledge which is hidden in the enormous amount of unstructured data. It is worth mentioning that UIMA plays a key role in the entire process of data handling, analyzing, and data mining. It is our belief that microarray technology will be widely used for clinical purpose around the world. And the proposed imArray system will serve as an integrated data management system which deals with nearly every aspect related to microarray data processing, indexing, and querying.

4. Collaborators and Student Training

Our project also involves top researchers from the Biostatistics Department at the University of Alabama at Birmingham (UAB) and the School of Medicine at UAB as list below: David B Allison: Professor, Dept. of Biostatistics, University of Alabama at Birmingham, Birmingham, AL, USA.

Yufeng Li: Research Assistant Professor, Biostatistician, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA.

This project, if funded, will support one Ph.D. student for one year at the Department of Computer and Information Sciences at UAB. The student being supported has related background and degrees in both Biology (B.S.) and Computer Sciences (M.S.).

A partial list of the publications and the submitted articles related to the proposed project is shown below:

(7)

1). “An Automated Gridding and Segmentation Method for cDNA Microarray Image Analysis,” accepted for publication, the 19th IEEE International Symposium on Computer-Based Medical Systems, June 22-23, 2006, Salt Lake City, Utah, USA.

2). “A Web-based system for clinical trials: to turn Access into a web-enabled secure information system,” under minor revision, Clinical Trials.

3). “A personalized information search and visualization system,” under minor revision, BMC Medical Informatics and Decision Making.

4). “An Efficient Hybrid Clustering Algorithm for Molecular Sequences Classification,” ACMSE 2006, Melbourne, Florida, USA.

References

1. Iyer, V.R., et al., The transcriptional program in the response of human fibroblasts to serum. Science, 1999. 283(5398): p. 83-7.

2. Wikipedia contributors. Northern blot. 2006 [cited 2006 07:57, April 13]; Available from:

http://en.wikipedia.org/w/index.php?title=Northern_blot&oldid=41171638.

3. Duggan, D.J., et al., Expression profiling using cDNA microarrays. Nat Genet, 1999. 21(1 Suppl): p. 10-4. 4. Churchill, G.A., Fundamentals of experimental design for cDNA microarrays. Nat Genet, 2002. 32 Suppl:

p. 490-5.

5. Ahmed, A.A., et al., Microarray segmentation methods significantly influence data precision. Nucleic Acids Res, 2004. 32(5): p. e50.

6. Demirkaya, O., M.H. Asyali, and M.M. Shoukri, Segmentation of cDNA microarray spots using markov random field modeling. Bioinformatics, 2005. 21(13): p. 2994-3000.

7. Jain, A.N., et al., Fully automatic quantification of microarray image data. Genome Res, 2002. 12(2): p. 325-32.

8. Yang, Y.H., et al., Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res, 2002. 30(4): p. e15.

9. Katzer, M., F. Kummert, and G. Sagerer, Methods for automatic microarray image segmentation. IEEE Trans Nanobioscience, 2003. 2(4): p. 202-14.

10. Katzer, M., F. Kummert, and G. Sagerer, A Markov Random Field Model of microarray gridding, in Proceedings of the 2003 ACM Symposium on Applied computing (SAC). 2003, ACM Press: Melbourne, Florida. p. 72-77.

11. Jung, H.Y. and H.G. Cho, An automatic block and spot indexing with k-nearest neighbors graph for microarray image analysis. Bioinformatics, 2002. 18 Suppl 2: p. S141-51.

12. Yang, Y.H., M.J. Buckley, and T.P. Speed, Analysis of cDNA microarray images. Brief Bioinform, 2001.

2(4): p. 341-9.

13. Chen, Y., E.R. Dougherty, and M.L. Bittner, Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics, 1997. 2(4): p. 364-374.

14. Buhler, J., T. Ideker, and D. Haynor, Dapple: Improved techniques for finding spots on DNA microarrays. 2000, University of Washington: Seattle, WA. p. 1-12.

15. Chen, W.B., C. Zhang, and W.L. Liu. An Automated Gridding and Segmentation Method for cDNA Microarray Image Analysis. in Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems (CBMS 2006). 2006. Salt Lake City, Utah.

16. Saviozzi, S., et al., Microarray data analysis and mining. Methods Mol Med, 2004. 94: p. 67-90. 17. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing

radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21.

18. Shipp, M.A., et al., Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med, 2002. 8(1): p. 68-74.

19. Pomeroy, S.L., et al., Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 2002. 415(6870): p. 436-42.

20. Golub, T.R., et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 1999. 286(5439): p. 531-7.

21. Pilpel, Y., P. Sudarsanam, and G.M. Church, Identifying regulatory networks by combinatorial analysis of promoter elements. Nat Genet, 2001. 29(2): p. 153-9.

(8)

22. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.

23. de Jong, H., Modeling and simulation of genetic regulatory systems: a literature review. J Comput Biol, 2002. 9(1): p. 67-103.

24. Stevens, R., C.A. Goble, and S. Bechhofer, Ontology-based knowledge representation for bioinformatics. Brief Bioinform, 2000. 1(4): p. 398-414.

25. Ferrucci, D. and A. Lally, Building an example application with the Unstructured Information Management Architecture. IBM SYSTEMS JOURNAL, 2004. 43(3): p. 455-475.

26. Mack, R., et al., Text analytics for life science using the Unstructured Information Management Architecture. IBM SYSTEMS JOURNAL, 2004. 43: p. 490-515.

27. Uramoto, N., et al., A text-mining system for knowledge discovery from biomedical documents. IBM SYSTEM JOURNAL, 2004. 43: p. 516-533.