MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

(1)

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

By

Vasundhara Akkineni

B.Tech, University of Madras, 2003

A Thesis

Submitted to the Faculty of the

Graduate School of the University of Louisville In Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Department of Computer Engineering and Computer Science University of Louisville

Louisville, Kentucky

(2)

MICROARRAY DATA ANALYSIS USING JAVA AND R

By

Vasundhara Akkineni

B.Tech, University of Madras, 2003

A Thesis Approved on

April 14, 2006

By the following Thesis Committee:

Dr. Eric C. Rouchka, Thesis Director

Dr. Dar-jen Chang

(3)

DEDICATION

Dedicated to my parents

Mr. Sarat Kumar Akkineni

and

Mrs. Surya Rani Akkineni

(4)

ACKNOWLEDGEMENTS

First, I would like to thank my thesis director, Dr. Eric C. Rouchka for his direction, assistance and guidance. I would also like to thank the members of my thesis committee, Dr. Dar-jen Chang and Dr. Thomas Knudsen for their time. I thank Tim Hardin, Elizabeth Cha, Yamini Rudraraju and Eric Stutzenberger for making the Bioinformatics lab an enjoyable place to come to each day. Additional thanks to the other members of the Bioinformatics Research Group. I thank all the friends I have made over the years at the University of Louisville. I have learned a lot from each one of you. Finally, I thank my parents with all due respect for their love, support and encouragement. Support for this project was provided by NIH-NCRR grant # P20 RR16481 (Nigel Cooper, PI).

(5)

ABSTRACT

MICROARRAY DATA ANALYSIS USING JAVA AND R

VASUNDHARA AKKINENI

APRIL 14, 2006

Microarray technology has become an essential tool in functional genomics for monitoring the expression of many genes in parallel. Gene expression values obtained from microarray experiments help biologists to understand the way in which a cell responds to varying conditions (including, but not limited to development over time, response to environmental stimuli, or disease states) by analyzing the increase or decrease in the expression level of genes. We have developed web-based software that provides biologists with several statistical solutions for analyzing gene expression data. This platform independent java servlet first performs normalization of the gene expression values in order to eliminate any systematic bias in the measured intensity values arising from the microarray process. Several normalization methods like Total Intensity Normalization, Median Normalization and Lowess Normalization have been implemented. After normalization, visualization of the experimental data can be performed using scatter plots, MA plots, RI plots and image maps of the intensity ratios. For detection of genes which are differentially expressed the software provides fold-change detection and t-test techniques. The tool also provides the users the ability to create a workflow of the different analysis tools used to study the uploaded

(6)

data. All the statistical routines used in this software were developed in R called from Java code. This software is a freely available tool to statistically analyze microarray experiments.

(7)

TABLE OF CONTENTS DEDICATION………iii ACKNOWLEDGEMENTS………iv ABSTRACT……….v LIST OF TABLES………..xi LIST OF FIGURES...……….……...xii 1. INTRODUCTION ... 1

1.1 Overview of molecular biology ... 4

1.2 DNA... 4

1.3 RNA ... 6

1.4 mRNA ... 6

1.5 Gene ... 7

1.6 Central Dogma of Molecular Biology ... 7

1.7 MicroRNA (miRNA) ... 8

1.8 Microarrays ... 9

2. MICROARRAY ANALYSIS TECHNIQUES... 13

2.1 Microarray data analysis ... 13

2.2 Log ratios ... 16

2.3 Normalization ... 17

2.3.1 Total intensity normalization ... 17

2.3.2 Median normalization ... 18

2.3.3 Lowess normalization ... 18

(8)

2.5 MA plot... 20

2.6 RI plot ... 21

2.7 Difference between MA and RI plots ... 22

2.8 Identifying differentially expressed genes ... 23

2.8.1 Fold change... 23 2.9 Clustering... 24 2.10 Types of clustering... 24 2.10.1 Hierarchical clustering ... 25 2.10.2 Dendrogram ... 26 2.10.3 Heat maps... 27 3. LITERATURE REVIEW ... 29 3.1 Bioconductor... 29 3.2 TM4 and MIDAS... 31

3.3 BASE: BioArray Software Environment... 32

3.4 WebArray: an online platform for microarray data analysis ... 32

3.5 SNOMAD (Standardization and Normalization of MicroArray Data)... 33

4. IMPLEMENTATION SPECIFICS ... 35

4.1 R... 35

4.1.1 Statistics and R... 36

4.1.2 R and Windows™... 36

(9)

4.2.1 Installation of Rserve ... 38

4.3 Java ... 39

4.3.1 Java language ... 39

4.3.2 Java platform... 39

4.4 Java servlets ... 40

4.5 JSP (Java Server Pages) ... 42

4.6 JDBC (Java Database Connectivity)... 42

4.7 MySQL ... 43

4.8 Apache Tomcat ... 44

5. OBJECTIVES AND RESULTS ... 46

5.1 Uploading experiment data ... 47

5.2 Normalization methods... 48

5.3 Data visualization... 50

5.3.1 Scatter plot ... 50

5.3.2 MA plot... 51

5.3.3 RI plot ... 52

5.4 Creating a process pipeline ... 53

5.5 Identifying genes of interest... 57

5.5.1 Fold change cut-off ... 57

5.6 Clustering genes... 60

(10)

5.8 Search genes... 63

5.9 Output results as files... 65

6. CONCLUSIONS... 67

6.1 Possible improvements ... 68

REFERENCES………..70

APPENDICES………...72

(11)

LIST OF TABLES

Table 2.1 - Gene expression matrix with raw gene expression data... 14

Table 2.2 - Gene expression matrix with intensity ratio values... 15

Table 2.3 - Gene expression matrix with log 2 intensity ratio values ... 16

Table 3.1 - Bioconductor packages... 30

(12)

LIST OF FIGURES

Figure 1.1 - An overview of the formation of proteins... 4

Figure 1.2 - DNA double helix structure ... 5

Figure 1.3 - Formation of mRNA ... 6

Figure 1.4 - Central Dogma of Molecular Biology... 8

Figure 1.5 - Preparation of microarrays... 10

Figure 2.1 - Process of obtaining a gene expression matrix ... 13

Figure 2.2 - Effects of lowess normalization ... 19

Figure 2.3 - A scatter plot ... 20

Figure 2.4 - An MA plot ... 21

Figure 2.5 - An RI plot... 22

Figure 2.6 - Construction of a two-dimensional dendrogram representing a hierarchical cluster of related genes... 26

Figure 2.7 - A heat map with a dendrogram and a color key... 28

Figure 4.1 - R command line interface on startup ... 37

Figure 4.2 - Java servlet execution process ... 41

Figure 4.3 - The three-tier architecture of a JDBC connection... 43

Figure 5.1 - Sample data file for analysis ... 47

Figure 5.2 - Process of uploading data files... 48

Figure 5.3 - Normalized data file using total intensity normalization ... 49

(13)

Figure 5.5 - Total intensity normalization scatter plots and text files ... 50 Figure 5.6 - matrix of scatter plots with a zoomed out portion for two specific experiments ... 51

n n×

Figure 5.7 - matrix of MA plots with a zoomed out portion for two specific

experiments ... 52 n

n×

Figure 5.8 - matrix of RI plots with a zoomed out portion for two specific

experiments ... 53 n

n×

Figure 5.9 - Steps for pipelining analysis ... 54 Figure 5.10 - Pipeline screen to define a sequence of routines to be performed on the data ... 55 Figure 5.11 - Results screen after submitting the pipeline screen ... 56 Figure 5.12 - Color based image map of the gene’s intensity ratios between two

experiments ... 59 Figure 5.13 – Individual gene details... 59 Figure 5.14 - Heat map with a dendrogram to represent expression clusters ... 61 Figure 5.15 - User selected columns for calculating the intensity ratios and the number of top/bottom genes needed... 62 Figure 5.16 - Results showing top 10 ratios and bottom 25 ratios between two

experiments ... 63 Figure 5.17 - Search screen with the list of genes from the uploaded data file ... 64 Figure 5.18 - Output screen for a search done on gene information... 65

(14)

1. INTRODUCTION

Two complementary advances, one in knowledge and one in technology, are greatly facilitating the study of gene expression and the discovery of the roles played by specific genes in the development of disease. As a result of the Human Genome Project[1], there has been an explosion in the amount of information available about the DNA sequence of the human genome, including identification of a large number of genes within these previously unknown sequences. The challenge currently faced by scientists is to find a way to organize and catalog this vast amount of information into a usable form. The full impact of the Human Genome Project will be realized only after the functions of the new genes are discovered.

With this vast amount of information comes the need for tools to make sense of the data. This led to the second advance which facilitated the identification and classification of the DNA sequence information and the assignment of functions to these new genes- the DNA microarray technology. With the invention of the DNA chip, researchers have gone from looking at genes one at a time to tens of thousands at a time[2]. In order to really understand a genome, scientists need to understand how genes interact with each other and which genes are present under different conditions. This can be done by measuring the amount of each mRNA present in the cell. Microarrays enable us to measure this for thousands of genes simultaneously. With the aid of a computer, the amount of mRNA bound to the spots on the microarray is precisely measured, generating

(15)

a profile of gene expression in the cell. Microarrays generate huge amounts of valuable data and the handling and analysis of such data is becoming one of the major bottlenecks in the utilization of the technology. The raw microarray data are images, which have to be transformed into gene expression matrices—tables where rows represent genes, columns represent various samples such as tissues or experimental conditions, and numbers in each cell characterize the expression level of the particular gene in the particular sample. These matrices have to be analyzed further, if any knowledge about the underlying biological processes is to be extracted and this forms the basis for my thesis- microarray data analysis.

The data analysis process constitutes the analysis of the gene expression matrix using either supervised or unsupervised methods. Among the many statistical packages available for data analysis, ‘R’ is a statistical package which is widely used for the analysis of microarray data[3]. Several open source software are available which perform data analysis using R functionality as their base. Most of these packages either require some hands on programming experience and syntactical knowledge of the software in order to perform the analysis of the microarray data or are platform dependent and not universally available for all types of users.

The Bioinformatics Research Group (BRG) [http://kbrin.a-bldg.louisville.edu/brg/], which is a joint collaboration between the Speed School of Engineering and the School of Medicine at the University of Louisville, came up with the initiative for developing user-friendly software that can be used by biologists who generally lack programming knowledge. This thesis work is concentrated on developing a web based java tool which allows users to upload their data files in the format of a gene

(16)

expression matrix and then performs normalization of the data, produces plots to visualize the data, perform clustering of similar patterns of differentially expressed genes and lets users to save their results to a text file.

It should be noted that a good understanding of these methods and the biology behind the data is needed to choose the most appropriate for solving a particular problem. The rest of chapter one is devoted to an overview of molecular biology, including a discussion of DNA, RNA, genes and microarrays. Chapter two discusses microarray analysis techniques, including an overview of log ratios, normalization, visualization plots, differentially expressed genes using the fold change method, and clustering. Chapter three is a literature review of existing microarray data analysis software, their drawbacks and how the system being developed caters to the needs of the user who lacks programming expertise. Chapter four gives a detailed description of the software used for the development of the microarray analysis tool. Installation and implementation specifics are also covered in a detailed manner. An overall discussion of the system being developed in the form of its objectives and the results obtained is dealt with in chapter five. Conclusions and further improvements to the microarray data analysis tool are discussed in chapter six. A detailed glossary of terms is also available as part of this thesis for the reader’s reference.

(17)

1.1 Overview of molecular biology

Every cell in an organism contains a full set of chromosomes and identical genes. At a given point of time, only a subset of these genes is active. These genes define certain unique properties of a cell type. The information contained in the DNA is transcribed into messenger RNA (mRNA) molecules, which are then translated into proteins, which perform most of the important functions of the cell. Figure 1.1 illustrates this process.

Cell

Nucleus

Chromosome

Protein

Gene (mRNA),

Gene (DNA)

single strand

Cell

Nucleus

Chromosome

Protein

Gene (mRNA),

Gene (DNA)

single strand

Figure 1.1 - An overview of the formation of proteins

1.2 DNA

Deoxyribonucleic Acid (DNA) is the basis for the building blocks encoding the information of life. A single stranded DNA molecule, called a polynucleotide or oligomer, is a chain of small molecules called nucleotides. There are four different nucleotides, or bases: adenosine (A), cytosine (C), guanine (G) and thymine (T).

(18)

Stringing together a simple alphabet of four characters together we can get enough information to create a complex organism. The ends of the polynucleotide are marked either 3’ or 5’. The general convention is to label the coding strand from 5’ to 3’ (left to right). For instance, the following is a polynucleotide:

5’G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’

DNA can be either single-stranded or double stranded. When DNA is double-stranded, the second strand is referred to as the reverse complement strand. Complementary bases are determined by which pairs of nucleotides can form bonds between them. In the case of DNA, A binds to T, and C binds to G. For the polynucleotide given above, the double-stranded polynucleotide is as follows:

5’G→T→A→A→A→G→T→C→C→C→G→T→T→A→G→C 3’ | | | | | | | | | | | | | | | |

3’ C←A←T←T←T←C←A←G←G←G←C←A←A←T←C←G 5’

Two complementary polynucleotide chains form a stable structure known as the DNA double helix (Figure 1.2). Using the double stranded molecule as a template, proteins will be produced for active genes with the help of RNA molecule.

Image source: www.genecrc.org/site/ lc/lc2b.htm

(19)

1.3 RNA

Ribonucleic Acid (RNA) is similar to DNA in the fact that it is constructed from nucleotides. However, instead of thymine (T), an alternative base uracil (U) is found in RNA. RNA can be found as double-stranded or single-stranded, and can also be part of a hybrid helix where one strand is an RNA strand and the other is a DNA strand. RNA is important in the cell and contributes in a variety of ways. One of the most important roles of RNA is in protein synthesis. Two of the major RNA molecules involved in protein synthesis are messenger RNA (mRNA) and transfer RNA (tRNA).

1.4 mRNA

Messenger RNA (mRNA) is a linear molecule encoding genetic information copied from DNA molecules. DNA is copied into a single stranded mRNA molecule by the transcription process. This occurs as follows. Genes consist of coding regions called exons and non-coding regions called introns. mRNA processing removes introns and splices the exons together. Processed mRNA can be translated into a protein sequence.

Source: http://www.ebi.ac.uk/microarray/biology_intro.html

(20)

Therefore, in order to determine which genes are active in a cell (i.e., those that are producing a protein product) one can measure the amount of mRNA present. This gives an approximation of the activity of individual genes in a cell.

1.5 Gene

A gene can be described as the physical and functional unit of heredity that carries information from one generation to the next[4]. A gene can be thought of as the DNA sequence necessary for the synthesis of a functional protein or RNA molecule. Proteins are important components of the body that determine how the different kinds of molecules in the body are organized and act. Thus, proteins play a key role in the way we look and the also in making us a unique individual. Genes are expressed as proteins, a complex process consisting of two main steps: Each gene (DNA) is converted (transcribed) into messenger RNA (mRNA), RNA that serves as a template for protein synthesis. The resulting mRNA then guides the synthesis of a protein through a process called translation. Thus isolating the mRNA helps us to find expressed genes from the human genome.

1.6 Central Dogma of Molecular Biology

The Central Dogma of Molecular Biology states that the region of a double stranded DNA molecule that corresponds to a gene is copied, or transcribed, to a complementary single stranded mRNA molecule[5]. The single stranded mRNA molecule then gets translated to a protein (Figure 1.4). If mRNA molecules can be identified, the expression level of the corresponding genes can be determined.

(21)

Source: http://www.accessexcellence.org/RC/VL/GG/images/central.gif

Figure 1.4 - Central Dogma of Molecular Biology

1.7 MicroRNA (miRNA)

A miRNA is a form of single-stranded RNA which is typically 20-25 nucleotides long, and is thought to regulate the expression of other genes[6]. miRNAs are RNA genes which are transcribed from DNA, but are not translated into protein. The DNA sequence that codes for a miRNA gene is longer than the miRNA. This DNA sequence includes the miRNA sequence and an approximate reverse complement. When this DNA sequence is transcribed into a single stranded RNA molecule, the miRNA sequence and its reverse-complement base pair to form a double stranded hairpin loop which is a primary miRNA structure (pri-miRNA). Pri –miRNAs are processed in the nucleus into hairpin RNAs called Pre-miRNAs. The pre-miRNA molecule is then actively transported out of the nucleus by a carrier protein. Thus through a mechanism that is not fully characterized, the

(22)

bound mRNA remains untranslated resulting in reduced expression of the corresponding gene.

The function of miRNAs appears to be in gene regulation. miRNAs have been reported to be critical in the development of organisms; they are differentially expressed in tissues and are involved in viral infection processes. In the past two to three years, a great deal of effort has gone in understanding how, when and where miRNAs are produced and functions in cells, tissues and organisms. Several research groups have provided evidence that miRNAs may act as key regulators of processes as diverse as early development, cell proliferation and cell death, apoptosis and fat metabolism and cell differentiation. There is speculation that the role of miRNAs in regulating gene expression could be as important as that of transcription factors. The discovery of miRNAs and their functions has added insight into how gene regulation is much more complex than the Central Dogma of Molecular Biology previously led biologists to believe.

1.8 Microarrays

Microarrays, developed in the lab of professor Patrick Brown at Stanford, in the early 1990’s, took molecular biology by storm[7]. They are small slides spotted with fixed samples of DNA, each for a different gene. When a researcher prepares a labeled cell extract and incubates it with the slide, messengers in the sample anneal to the fixed DNA, showing which genes in the sample are active. Microarray technology helps to identify genes that are expressed under different conditions such as during the stages of a cell cycle, under different environmental conditions, under diseased states at a particular

(23)

time, or under different tissue or cell types. A microarray is typically a glass slide, on to which DNA, cDNA or Oligonucleotide molecules are attached at fixed locations (spots).

There may be tens of thousands of spots on an array, each containing a huge number of identical DNA molecules of varying lengths. For gene expression studies, each of these molecules ideally should uniquely identify a single gene in the genome. Microarrays are used to compare gene expression levels in two different samples, for example, a cell in a healthy state and a diseased state. A microarray employs the ability of a given mRNA molecule to bind specifically to, or hybridize to, the complementary DNA from which it is originated.

Source: http://www.bioteach.ubc.ca/MolecularBiology/microarray/index.htm

(24)

A microarray contains many DNA sequences, and the expression levels of thousands of genes can be determined in a single experiment by measuring the amount of mRNA bound onto each spot of the array. Arranged systematically, the particular sequences can be identified by the location of the spots on the slide.

For two channel experiments, the relative abundance of each of the gene-specific sequences in two RNA samples (test and reference) may be estimated by fluorescently labeling the samples, mixing them and hybridizing them to the sequences on the glass slide. The two samples of mRNA from the cells (target) are reverse transcribed into cDNA, and labeled using two different dyes (red Cyanine 5 and green Cyanine 3). Usually, the reference sample is labeled Cy3 and the test sample with Cy5. The mixture reacts with the spotted cDNA sequences (probes). This results in cDNA sequences from the targets and the probes base-pairing with one another. After this hybridization step is complete, the microarray is placed in a scanner, consisting of lasers with different wavelengths, a microscope and a camera. The slide is scanned twice, first using one colored laser and then the second. Laser light excites the fluorescent dyes, Cy3 is excited by green laser light and Cy5 is excited by red laser light[4]. Green spots indicate that the test substance has lower activity than the reference substance, red spots indicate that the test substance is more abundant than the reference substance; yellow spots mean that there is no change in the activity level between the two populations of test and reference substance. Black represents areas where neither test nor control substance has bound to the target DNA. The process of creating and labeling a microarray can be observed in Figure 1.5.

(25)

Having an introduction to the central dogma of molecular biology, genes, microarrays, their preparation process and uses, the next chapter introduces microarray data analysis and the techniques used to analyze microarray data for obtaining useful information and knowledge about the underlying biological processes.

(26)

2. MICROARRAY ANALYSIS TECHNIQUES

2.1 Microarray data analysis

Analysis of microarray data is performed to identify which genes are involved in the process being studied. It involves statistical analysis by various graphical and numerical means to select differentially expressed (DE) genes or to find groups of genes whose expression profiles can reliably classify the different RNA sources into meaningful groups. The analysis of gene expression data is performed by constructing the gene expression matrix that describes spot quantitations from different hybridizations. The process of constructing a gene expression matrix from the raw microarray data is summarized in Figure 2.1.

(27)

A gene expression matrix is a matrix, in which the first column represents the gene names, and the subsequent columns represent the different experimental conditions and the cell values usually represent the gene expression value for the given experiment. Given in Table 2.1 is a gene expression matrix with sample gene expression values.

Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 403 409.3 611.5 569.2 536.6 580.2 Gene 2 757.3 574.4 826.7 595.3 755.2 956 Gene 3 284.4 327.3 421.6 336.6 391.3 412.6 Gene 4 2314.2 1685.3 2264.7 2204.1 2233.1 2458.4 Gene 5 1574.5 1273 1484.6 1321.2 1474.7 1774.1 Gene 6 2333.7 1796.8 2464.5 2372.5 2095.9 2735.7

Table 2.1 - Gene expression matrix with raw gene expression data

The cell values can also be the intensity ratios of the particular experiment with a preset experiment value. In the gene expression matrix in Table 2.2, the cell values are the intensity ratios of all the other experiments with experiment 1, calculated from the expression matrix of Table 2.1.

(28)

Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1 Gene 1 1.015633 1.51737 1.412407 1.331514 1.439702 Gene 2 0.758484 1.091641 0.786082 0.997227 1.26238 Gene 3 1.150844 1.482419 1.183544 1.375879 1.450774 Gene 4 0.728243 0.97861 0.952424 0.964955 1.062311 Gene 5 0.808511 0.942903 0.839124 0.936615 1.12677 Gene 6 0.769936 1.056048 1.016626 0.898102 1.172259

Table 2.2 - Gene expression matrix with intensity ratio values

Data analysis is based on the hypothesis that there are biologically relevant patterns to be discovered in the data. The microarray data analysis process depends on the analysis of the gene expression matrix using both supervised and unsupervised methods. Most data analysis methods use raw expression values, intensity ratios or both for their analysis routines. Data analysis methods take these huge sets of data as input and produce both visual and numerical results for interpretation and further analysis. The most commonly used microarray data analysis methods include log ratios of the gene intensity data in order to spread the values across a given range, normalization to identify and remove bias from the data, diagnostic plots of the microarray data for visualization purposes, and methods to identify differentially expressed genes and clustering of genes with similar behavior patterns. Each of these analysis techniques are discussed in detail in the following sections.

(29)

2.2 Log ratios

A logarithmic transformation produces a continuous spectrum of values and treats up and down regulated genes evenly across a range [8]. Equation 2.1 shows the formula for calculating log2 ratios.

i i i G R X =log2 [Equation 2.1]

Where i=1, 2…..Ngenes and Ri , Gi are the measured intensity values for gene ‘i’ from two different experimental conditions.

By using log2 values, X=0 represents equal expression, X=1 represents up regulation by a factor 2, X=-1 down regulation by a factor 2, X=2 up regulation by factor 4, and so on. Additionally, calculating log 2 values spreads the values more evenly across the intensity range and provides better visualization of the data and it tends to make the variability of data more constant over the intensity range[9]. Given in Table 2.3 are the log2 values of the intensity ratios calculated in Table 2.2.

Exp 2/Exp 1 Exp 3/Exp 1 Exp 4/Exp 1 Exp 5/Exp 1 Exp 6/Exp 1 Gene 1 0.02237883 0.60157266 0.49815582 0.413067216 0.52577046 Gene 2 -0.39880918 0.12649896 -0.34724803 -0.004006164 0.33614569 Gene 3 0.20269214 0.56795340 0.24311371 0.460353645 0.53682236 Gene 4 -0.45750812 -0.03119360 -0.07032387 -0.051465694 0.08720612 Gene 5 -0.30666134 -0.08481948 -0.25304488 -0.094472262 0.17219357 Gene 6 -0.37718928 0.07867587 0.02378897 -0.155049228 0.22929092

(30)

2.3 Normalization

The goal of normalization is to identify and remove any systematic bias in the measured fluorescence intensities, arising from variation in the microarray process rather than from biological differences between the RNA samples or the printed probes[10;11]. Sources of bias include:

• labeling efficiencies of the dyes

• different amounts of Cy3 and Cy5 labeled mRNA • scanning parameters

• spatial or plate effects, print tip effects, etc.

In the normalization process, a normalization factor (also referred to as scaling factor) is calculated and is multiplied to all the values of an experiment. Either of the experiments which are being compared can be multiplied with the normalization factor. This process is the same as taking a constant value away from the log of the normal ratio.

2.3.1 Total intensity normalization

Total intensity normalization computes the normalization factor by summing the measured intensities in both the experiments considered[10]. This is shown in Equation 2.2,

∑

= = = array k k array k k total N G N R N 1 1 [Equation 2.2]

where Narray is the total number of genes, Gk and Rk are the measured intensity values of the kth gene in both the experiments. The intensities are then rescaled such that Gk’ = Ntotal

(31)

Gk and Rk’= Rk and the normalized expression ratio for each feature are calculated (Equation 2.3). k k total k k k G R N G R T 1 ' ' '= = [Equation 2.3] This is equivalent to ) ( log ) ( log ) ' ( log2 Tk = 2Tk − 2 Ntotal 2.3.2 Median normalization

In median normalization the normalization factor is found by calculating the median of the array in question. Hence the equation becomes

a k

k T median

T ')=log ( )− (

log2 2 where a is the experiment array.

The advantage of using the median normalization is that it is insensitive to outliers which occur commonly in microarray data sets.

2.3.3 Lowess normalization

Lowess stands for Locally Weighted Linear Regression. It is also referred to as Loess. Lowess uses a linear regression model whereas Loess uses a quadratic regression model. The lowess normalization procedure subtracts a Lowess regression curve from the data to normalize it[10;12].

(32)

Figure 2.2 - Effects of lowess normalization

A Lowess curve is first drawn on the RI Plot. The lowess curve is calculated by a regression process which calculates the dependence of the ratio on the intensity and puts it in a mathematical context.

The dependence, for each gene (i) is calculated by observing its distance from the curve. On subtracting the dependence from the observed log

) (xi y 2 ratios, the equation becomes: )) ( ^ 2 ( log ) ( log ) ( ) ( log ) ' ( log2 Tk = 2 Tk −y xi => 2 Tk − 2 y xi [Equation 2.4]

Figure 2.2 shows the effect of lowess regression on a set of data. The plot on the right hand side is the RI plot itself and the plot on the left hand side is the RI plot fitted with the lowess curve. Lowess detects the systematic deviations in the RI plot and corrects them by carrying out a local weighted linear regression function given by Equation 2.4, and uses this function, point by point, to correct the measured ratio values. The results of applying such a lowess correction can be seen in the left hand side plot of Figure 2.2. Lowess analysis is used as a normalization method that can remove intensity dependant effects in the log2 ratio values.

2.4 Scatter plot

The scatter plot is an important graphical tool for studying the spread and linearity of data[8]. In its simplest form, two variables are plotted along the axes, and marks are

(33)

drawn according to these coordinates. The intensity values of genes under different experiments can be depicted as a scatter plot. A scatter plot is straightforward, but very high correlation between the two experimental intensity values makes the features of the plot difficult to discern. In an ideal scatter plot, all the spots are clustered around the diagonal line representing y=x. Figure 2.3 shows a scatter plot with most of the data points clustered around the diagonal line.

Figure 2.3 - A scatter plot

2.5 MA plot

An MA plot is a scatter plot with transformed axes[8]. The X-axis conforms to the logged total intensity value of the two experiments; the Y-axis shows exactly the log-ratio of the two experiments. MA plots are used to identify spot artifacts and detect intensity-dependant patterns in the log ratios. Since the interest lies in deviations of the points from the diagonal line, it is beneficial to rotate the axes and re-scale the axes as in the MA plot. The MA plot serves to increase the room available to represent the range of differential expression and makes it easier to see non-linear relationships between the log intensities. The MA plot in Figure 2.4 shows the differentially expressed genes more

(34)

clearly than the scatter plot in Figure 2.3. If an MA plot clearly shows the dependence of the log ratio M on overall spot intensity A, this suggests that intensity or ‘A’ dependent normalization method may be preferable.

Figure 2.4 - An MA plot

2 log 1

log2Exp 2Exp

M = − (log 1 log 2) 2 1 2 2Exp Exp A= + 2.6 RI plot

A ratio-intensity (RI) plot is also a scatter plot like the MA plot that shows the intensity specific effects for all the genes by plotting the log ratio as a function of the product of the intensities[9;12]. RI plots are used to determine if there is a rough correlation between the total intensity of a spot and its ratio. The easiest way to visualize intensity-dependent effects, and the starting point for the lowess analysis described in section 2.3.3, is to plot the measured log2 (Exp 1/Exp 2) for each gene as a function of the log10 (Exp 1*Exp 2) product intensities. This ‘R-I’ plot can reveal intensity-specific artifacts in the log2 (ratio) measurements which can be eliminated using lowess

(35)

normalization method. Under the assumption that most genes are not differentially expressed, most of the points in the RI plot should fall along the horizontal line. Figure 2.5 shows an RI plot where a large number of genes which are not differentially expressed fall along the horizontal line, and a number of differentially expressed genes are scattered away from the horizontal line.

Figure 2.5 - An RI plot

2.7 Difference between MA and RI plots

The MA plot and RI plot are used to check if the data exhibits an intensity dependent structure. RI plots and MA plots are used in an alternative manner by scientists.

In an MA plot, plot M=log 2 (R/G) Vs A= (1/2) log 2 (R*G) In a RI plot, plot R= log 2 (R/G) Vs I=log 10 (R*G)

where R and G are two different experiments.

The type of plots used for analysis is a source of confusion due to the fact that the RI plot looks very similar to an MA plot. It is important to know that MA plots are similar to RI plots but are not the same. RI plots are most commonly used to show the effect of lowess

(36)

normalization. MA plots are used instead of scatter plots because they serve to increase the room available to represent the range of differential expression and makes it easier to see non-linear relationships between the log intensities.

2.8 Identifying differentially expressed genes

One of the main goals of microarray experiments is to identify differentially expressed (DE) genes[11]. It will be practical to identify a limited number of genes which are the most likely candidates. This set of DE genes can be further analyzed using clustering techniques, etc.

2.8.1 Fold change

The fold change detection is a simple approach where a fixed fold-change-cutoff interval is used to find genes which are differentially expressed.[10; 13]

2 1 sample in value Expression sample in value Expression change Fold = [Equation 2.5]

If a gene’s experimental log-ratio exceeds the upper cutoff interval boundary, then it is marked as significant and over expressed. If a gene’s experimental log-ratio falls below the lower cutoff interval boundary, then it is marked as significant and under expressed. Genes with experimental log-ratios in the range of the interval are marked for regular behavior. Important factors to be considered in fold change method are what cutoff should be used and should the cutoff be the same for all the genes. Though it is a very straightforward method for classifying genes, the fold change method has the disadvantage of not considering variability. Hence, genes with large variances are more likely to make the cutoff just because of noise. For poorly expressed genes, small changes

(37)

in intensity can lead to large calculated fold changes. And it is not a statistically based method.

2.9 Clustering

Microarray experiments deal with a large amount of data, which has to be stored and analyzed. Therefore a general idea is to reduce the dimensionality of the data. The basic concepts in clustering are to try to identify and group together similarly expressed genes and then try to correlate and interpret the observations at the biological end. The basic principles in gene clustering are:

1. Organize the data into a small number of homogeneous groups.

2. Find similar expression patterns of genes. Both low and high expression level genes can be placed in the same cluster if their expression profiles have similar shape.

2.10 Types of clustering

Clustering can be hierarchical or flat, as well as agglomerative or divisive[10]. Agglomerative processes start out by considering each object as a separate cluster and proceed to group the most similar objects in an iterative fashion until all the data are included. Divisive methods start out with the complete set of data as one large group, or cluster, and proceed by partitioning the objects starting with those that are most dissimilar. Based on their background principle, the different types of clustering methods available are Hierarchical agglomerative clustering[9;10], Hierarchical divisive clustering[9;10], k-means clustering and self organizing maps (SOM’s)[9;10;13].

(38)

2.10.1 Hierarchical clustering

The clustering method used for analysis in this tool is hierarchical clustering. The hierarchical clustering algorithm uses a bottom-up approach where it iteratively joins the two closest clusters starting from a single cluster[9;10;13]. After each step, a new distance matrix between the newly formed clusters and the other clusters is recalculated. For a set of N genes to be clustered, and a N×N distance matrix, the hierarchical clustering is performed as follows:

1. Assign each gene to a cluster of its own.

2. Find the closest pair of clusters and merge them into a single cluster.

3. Compute the distances between the new cluster and each of the old clusters. 4. Steps 2 and 3 are repeated until all the genes are clustered.

The distance matrix is calculated by considering the shortest distance from any member of one cluster to any member of the other cluster. Hierarchical clustering has become popular for the following reasons:

• Hierarchical clustering techniques are meaningful to cluster data at the experiment level rather than at the level of individual genes. Such experiments are most often used to identify similarities in overall gene expression patterns in the context of different treatment regimens.

• The analysis reveals groups of similar genes that can be studied in greater depth. • It is possible to visualize the data in a hierarchical way using interactive computer

programs.

While intuitively appealing as a method, hierarchical clustering is not an efficient method for very large gene expression matrices as the full distance matrix of all pair-wise

(39)

distances has to be calculated in advance, which for n objects takes an order of n2 steps. Hierarchical clustering is also less suitable for noisy data.

2.10.2 Dendrogram

Hierarchical clustering can be represented as a tree called a dendrogram[9;10].

Source: http://www.awprofessional.com/articles/article.asp?p=357695&seqNum=4&rl=1

Figure 2.6 - Construction of a two-dimensional dendrogram representing a hierarchical cluster of related genes

(40)

By cutting the dendrogram at a particular height will give the different clusters and the ze of the clusters. The dissimilarity of the clusters is proportional to the length of the ertical lines projecting from each cluster. Figure 2.6 is an example of how a dendrogram f clusters is obtained.

Each column represents a different experiment, each row a different spot on the icroarray. The height of each link is inversely proportional to the strength of the orrelation. Relative correlation strengths are represented by integers in the ccompanying chart sequence. Genes 1 and 2 are most closely coregulated, followed by enes 4 and 5. The regulation of gene 3 is more closely linked with the regulation of enes 4 and 5 than any remaining link or combination of links. The strength of the orrelation between the expression levels of genes 1 and 2 and the cluster containing enes 3, 4, and 5 is the weakest (relative score of 10). (Adapted from: Jeffrey Augen, ioinformatics and Data Mining in Support of Drug Discover," Handbook of Anticancer rug Development. D. Budman, A. Calvert, E. Rowinsky, editors. Lippincott Williams

2.10.3 Heat maps

A heat map is a color image with a dendrogram attached to the left side and to the top of the image[10;14]. The rows and columns plotted in the heatmap are re-ordered based on the restrictions imposed by the dendrogram. Each row in the heatmap represents a gene and the columns represent the different experiments to which the gene is subjected. The colors in the heat map simply represent the values in the gene expression matrix. One can observe from a heat map (Figure 2.7) that genes with similar gene expression profiles (i.e. strings of similar colors) are grouped close together.

si v o m c a g g c g "B D and Wilkins. 2003)

(41)

Figure 2.7 - A heat map with a dendrogram and a color key

The next chapter is a literature review of existing microarray data analysis tools, and their advantages and disadvantages in terms of ease of use, availability and functionality. It also includes a discussion about the motivation for the developed microarray data analysis tool.

(42)

3. LITERATURE REVIEW

There are several commercial and non-commercial solutions as well as a growing body of freely available open source software for analyzing microarray data. A review of some popular open source microarray data analysis tools is presented here including Bioconductor, TM4, MIDAS, BASE, WebArray and SNOMAD.

3.1 Bioconductor

Bioconductor is an open source project for computational biology[15]. The main

focus is to d on analysis.

Biocon

s at least one vignette, a document that provides a textual, sk oriented description of the package’s functionality and can be used interactively. lthough initial efforts focused primarily on DNA microarray data analysis, many of the

ftware tools are general and can be used broadly for the analysis of genomic and xpression data. Bioconductor has adopted object-oriented programming as its primary rogramming paradigm.

he main features of the Bioconductor project are:

Use of R to provide a wide range of statistical and graphical methods for the analysis of genomic data.

eliver high-quality infrastructure and end-user tools for expressi

ductor is built completely on R[3;14] and R packages. A list of the different types of packages available is given in Table 3.1. In addition to providing genomic data analysis tools, Bioconductor has excellent integrated, dynamic documentation. Each Bioconductor package contain

ta A so e p T

(43)

Help integrate biological literature data from PubMed and LocusLink with the analysis of genomic data.

Allows the development of extensible, scalable and interoperable software.

Provide high qual le research.

nalysis tool with a simple user interface, which does not require the user upload data for analysis and

dow o the

web-ba e focus of this thesis.

ity documentation and reproducib

Provide training in computational and statistical methods for the analysis of genomic data.

Task Packages

General programming tools Biobase, graph, tkWidgets, reposTools, rhdf5

Annotation AnnBuilder,

Table 3.1 - Bioconductor packages

Although Bioconductor has the advantage of building on the existing toolkit of statistical applications, it is command line based which is imposing for many users. The tkWidgets package provides some functionality for creating GUI’s, but even that requires additional programming.

The need for an a

annotate

Graphics Geneplotter, hexbin

Preprocessing microarray data Affy, marrayClasses, marrayInput, marrayNorm, marrayPlots, marrayTools Differential gene expression Genefilter, multtest, ROC

any kind of programming skills, and instead lets

nl ad the results in a point-and-click fashion, is the main factor in developing sed application which is th

(44)

3.2

alysis suite of tools was developed to provide the

mic of the

mic ajor applications,

Mic a

System Multiexperiment Viewer (MeV). Since the focus of this project is in array data analysis, the discussion is confined to MIDAS.

M alysis System

MIDAS is a java application which pr users an intuitive interface to design

an sses combining one or more ring steps. MIDAS

reads “.tav” (TIGR ArrayViewer file type, w mn, tab-delimited text

fo purposes o

a single slide) files generated by TIGR Spo ia

M les include lo ormalization. It

also includes background- ate analysis and filtering,

and the

s the data in tav format. While TM4 overcomes some of the limitati

accessed through web, instead of

TM4 and MIDAS

The TM4 microarray an

roarray community with a comprehensive set of tools to handle all aspects roarray process[16]. The TM4 suite of tools consist of four m

ro rray Data Manager (MADAM), TIGR_Spotfinder, Microarray Data Analysis (MIDAS), and

micro

IDAS: Microarray Data An

ovides

alysis proce normalization and filte

hich is an eight-colu

rmat developed at TIGR for the f storing the intensity values of the spots on tfinder or retrieved from the database v ADAM. Normalization modu wess and total intensity n

and quality- control trimming, replic

identification of differentially expressed genes using intensity dependent Z-scores and user defined fixed fold-change cut-offs. MIDAS provides scatter plots that illustrate the effects of each algorithm on the data. When the normalization and filtering steps are complete, MIDAS output

ons of a command-line driven system, it has the disadvantage of requiring users to maintain current copies of the software locally and to update the system as it evolves.

Thus the need for an analysis tool which can accept data as a simple text file, instead of program specific formats and which can be

(45)

maintaining local copies of the program on a user’s computer, has been another motiva

BASE was developed using a web-based approach which closely integrates a data management system with a data analysis system[16]. Since expression analysis tools are evolving rapidly, BASE has a plug-in architecture that allows new modules to be easily added for data transformation, analysis, or visualization. BASE incorporates a data analysis interface that allows users to define an analysis method that passes data through multiple routines and to create transformed datasets and subsets. This allows the original unmodified data to be analyzed in a number of ways to create multiple analyses. BASE allows data to be visualized in a variety of ways. Unmodified and transformed datasets can be plotted interactively as scatter plots, displayed in histograms, or viewed as tables. Though BASE minimizes the software update problem through its web-based approach, it has the disadvantage that it loses a good deal of the graphical functionality that local applications can provide.

The motivation for creating a pipeline process in the application being developed comes from the analysis method of BASE. Also, though not yet implemented, the integration of the data analysis module with a data management system as done in BASE is a good future improvement.

ting fact in developing this application.

3.3 BASE: BioArray Software Environment

3.4 WebArray: an online platform for microarray data analysis

WebArray offers a convenient platform for biologists to access several cutting-edge microarray data analysis tools[17]. WebArray runs on a LAMP system (Linux +

(46)

Apache + MySQL + Python) system. Background computations are mostly done by R scripts. The currently implemented functions of WebArray were based on limma (Linear Models for Microarray Analysis) and affy package from Bioconductor, the spacings LOESS histogram (SPLOSH) method, PCA-assisted normalization method and genome mapping method. WebArray incorporates these packages and provides a user-friendly interface for accessing a wide range of key functions of limma and others, such as spot quality weight, background correction, graphical plotting, normalization, linear modeling, empirical bayes statistical analysis, false discovery rate (FDR) estimation, and chromosomal mapping for genome comparison. Microarray analysis using WebArray can be executed in three steps: 1) uploading and managing files; 2) selecting datasets and methods for analysis, 3) browsing results. A good help document is also available with detailed annotation of all the functions of WebArray. Thus WebArray is an excellent free open source software for microarray analysis that can be used by an average biologist after some training.

ardization and Normalization of MicroArray Data)

on to the regular transformations and visualization tools, SNOMAD includes two non-linear transformations which correct bias and variance which are non-uniformly distributed across the range of microarray element signal intensities: 1) local mean normalization; and 2) local variance correction (Z-score generation using locally calculated standard deviation).

3.5 SNOMAD (Stand

SNOMAD is an interactive, user-friendly web-application which can be accessed freely via the internet with any standard HTML browser[18]. SNOMAD is a collection of algorithms for the normalization and standardization of gene expression datasets derived from diverse sources. In additi

(47)

The SNOMAD tool is available at -

http://pevsnerlab.kennedykrieger.org/snomad.htm. No programming expertise or

software installation is required. Users can upload their gene expression data and specify the transformations they wish to apply on their data. Results come in the form of both a text file containing numeric values and image files of graphs of the data corresponding to all the transformations.

WebArray and SNOMAD are two user-friendly tools available in the market for microarray data analysis. But they have their own disadvantage of having limited functionality, confined to a certain set of routines that the user can perform on the data. They do not have the scope for adding new R programs to the already existing system. In such a case, biologists tend to use multiple tools for obtaining the required results. This lack of extensibility formed another motivation for the development of the application under discussion. Thus all these above discussed factors led to the development of the current application to provide a solution to the community driven need for an easy to use, readily available and extensible microarray data analysis tool, which uses R routines for analysis.

(48)

4. IMPLEMENTATION SPECIFICS

In this chapter, the software implementation specifics for the microarray analysis tool are discussed. A brief introduction to the R package, which forms the base for the statistic

4.1 R

R is a powerful software environment for data manipulation, calculation and graphical display. It is a GNU General Public License project similar to the S language. The name is partly based on the first names of the first two authors (Robert Gentleman and Ross Ihaka), and partly a play on the name of the Bell Labs language ‘S’[3;14].

supports a wide range of statistical techniques including descriptive statistics, linear and nonlinear modeling, classical statistical tests, probability distributions, analysis of variance (ANOVA), time series analysis, classification, clustering, robust regression and maximum likelihood.

al analysis of the microarray data is given. Description of Java which has been used to develop the user interface, and information about Rserve, which is the plug-in used to connect to R from Java are also given. Some background about MySQL database and the JDBC connection needed to connect to a database from Java code is also provided. A clear understanding of the software is needed to understand the implementation techniques discussed in this thesis.

(49)

R is extensible via user defined functions written in its own language, or through the use of dynamically loaded modules written in other languages. It can be used with Linux, UNIX and Microsoft Windows™.

4.1.1 Statistics and R

Most of the statistical techniques have been built into the base R environment and many more are supplied in the form of packages. There are about 10 packages called standard packages which are supplied with R and many more can be downloaded from the Comprehensive R Archive Network (CRAN) website (http://cran.r-project.org).

The major difference between R and other statistical systems is that in R, the statistical analysis is performed as a sequence of steps with the results of every step stored in objects. In systems like SAS and SPSS copious output is obtained from a regression or analysis whereas R will give minimal output and store the results in a fit

ct f further processing by R functions.

4.1.2 R and Windows™

The latest version of R for Windows™ can be downloaded from the CRAN website. The version used for development of this project is R 2.1.1. A full installation of R on Windows™ takes up to 50 MB of disk space. To install, double click on the icon for rw2011.exe and follow the instructions. R installed in this way can be started from the start menu or by double clicking the R shortcut. To add packages to the existing R system, download the packages from the CRAN website and unzip them into the R/rw2011/library folder directly.

(50)

Figure 4.1 - R command line interface on startup

erted into native data types. • Persistent connections until the connection are closed.

4.2 Rserve

Rserve [19]is a TCP/IP server which allows other programs to use R facilities from various languages without the need to initialize R or link against R library. Rserve supports remote connection, authentication and file transfer. Typical use of Rserve is to integrate R backend for computing statistical models, plots, etc from other applications. The features of Rserve include:

• R initialization is not necessary. • Most R data types are conv

(51)

• Offers client independence since the client is not linked to R.

• Rserve provides some basic security in the form of encrypted user/password authentication.

• Rserve allows transferring files between the client and the server.

Rserve itself is the server which responds to requests from the clients. It listens for incoming connections and processes incoming requests. A client framework was also developed – JRclient. JRclient is a client suite which allows a java application to access Rserve. It was developed in java. It provides automatic type translation for most objects such as int, double, arrays, string or vector and classes for special R objects such as RList, RBool, etc. The idea behind the separation of client/server side allows handling multi-threading better when linking to R library directly.

4.2.1 Installation of Rserve

R 1.5.0 or to be able to use

AIX and Windows™. The Windows™ version of Rserve was used for development. Although Rserve works on Windows™, it is not the recommended since Windows™ lacks important features that make the separation of namespaces possible. Therefore Rserve for Windows™ allows only one connection at a time and all subsequent connections share the same namespace.

Installation process for Windows™:

1. Make sure to download the proper binary based on the version of R.

2. Copy the binary Rserve.exe to the same directory where R.dll is located. By default it is in the R\rw2011\bin folder.

3. Run rserve.exe to start the server and to make connections to R. higher needs to be installed on your system in order Rserve. Rserve works on Linux, Solaris,

(52)

Rserve was developed by Simon Urbanek, a researcher at AT&T Research labs. Any e

(General Public License).

4.3

This microarray data analysis tool was mostly developed using Java in order to provide a platform independent solution. The IDE (Integrated Development Environment) used for code development is Eclipse SDK 3.1.1. Other IDE’s that can be used are Borland’s JBuilder or Netbeans.

ple object-oriented, distributed, interpreted, robust, secure, architecture neutral, portable, high-performance, multithreaded, and dynamic language[20]. A program written in java is both compiled and interpreted. A java compiler generates an architecture independent object file executable on any system supporting the java runtime environment. The object code consists of bytecode instructions designed to be both easy to interpret on any machine and easily translated into native machine code at load time. So compilation takes place only once, interpretation occurs each time the program is executed.

rogram runs. Som

operati e. The java platform differs from these

on interested to contribute to the project can do so since it is released under GPL

Java

4.3.1 Java language

Java as described by Sun Microsystems is, a sim

4.3.2 Java platform

A platform is the hardware or software environment in which a p

e popular platforms like Windows™, Linux, Mac OS, etc. are a combination of the ng system and the underlying hardwar

(53)

platform

parts:

1. The Java Virtual Machine (JVM)

2. The Java Application Programming Interface (API)

The JVM is the interpreter and the runtime system, which lets java programs run on any hardware-based platform where it has been already ported to. The API is a large collection of ready-made software components that provide several capabilities. It is a grouped up collection of libraries of related classes and interfaces. These libraries are also

4.4

Servlets[21] are java programs that run on a web server and build web pages. Servlets provide a component-based, platform-independent method for building web-based applications. Servlets are server- and platform- independent which leaves us free to select any server, platform and tools for running our application.

s based on the fact that it is a software-only platform that runs on top of other hardware-based platforms.

The java platform has two

known as packages[20].

(54)

Source: http://cs.nmu.edu/~jeffhorn/Classes/CS122/Figures/javaTranslation.gif

Figure 4.2 - Java servlet execution process

form that specifies method=POST. To be a servlet, a class should extend HttpServlet and contain the doGet and the doPost methods to handle the GET and POST requests respectively. Both these methods take two arguments: an HttpServletRequest and an HttpServletResponse. The HttpServletRequest has methods to handle all incoming information such as form data, HTTP request headers, and the client’s hostname. The HttpServletResponse lets you specify outgoing information such as HTTP status codes, content-type, cookies and most importantly lets you post document content back to the client. The two important packages that have to be imported into the servlet file are:

Servlets are class files which handle GET and POST requests. GET requests are the usual type of browser requests for web pages in HTTP, when a user types a URL on the address line or follows a link from a web page. Servlets also handle POST requests, which are generated when someone submits an HTML

(55)

1. javax.servlet (for HttpServlet) and

2. javax.servlet.http (for HttpServletRequest and HttpServletResponse).

4.5 JSP (Java Server Pages)

Java Server Pages[20;21] is a technology that lets you mix regular, static HTML with dynamically-generated HTML. You simply write the regular HTML in the normal manner, using whatever web-page building tools you normally use. You can then enclose the code for the dynamic parts in special tags which start with “<%” and end with “%>”. A JSP is saved with a .jsp extension and it can be invoked just like any other normal web page. Though it appears to be a normal HTML file, a JSP acts like a servlet behind the scene

.6 JDBC (Java D

s.

4 atabase Connectivity)

JDBC[20;21] defines how a java program can communicate with a database. JDBC API provides two packages – java.sql and javax.sql. By using JDBC API, one can connect to any database, send queries to the database and process the results.

JDBC architecture defines the different layers to work with any database and Java. 1. JDBC API interfaces and classes which are at top most layer (to work with java) 2. A driver which is at the middle layer (maps java to database specific language) 3. A database at the bottom (to store physical data)

(56)

Source: http://www.dbmsmag.com/9610i06.html#figure1

Figure 4.3 - The three-tier architecture of a JDBC connection

The three main interfaces provided by the JDBC API to work with databases are: connection functionality.

2.

hich comes

4.7

conjunction with web technology server applications[22]. The database has been designed for speed, which would be useful in large transactions. MySQL is currently the most widely installed database, a well respected product that is more than capable of commercial operation. In fact, the entire Google search engine is built upon MySQL technology[23]. MySQL offers most of the functionality one will

1. Connection interface provides database

Statement interface provides SQL query representation and execution functionality.

3. ResultSet interface provides functionality for retrieving the data w from the execution of a SQL query using Statement.

MySQL

MySQL is a very popular open source database server which is commonly used in to create dynamic and powerful

(57)

expect from an RDBMS. It ensures that transactions comply with the ACID model (Atomicity, Consistency, Isolation, and Durability), allows the building of indexes, supports standard data types and allows for database replication, among other features. One area where MySQL falls short is its lack of certain features like sub-queries, constraints, views, cursors and objects. MySQL is fast, easy to use, is open source and if the application is a web application then MySQL meshes in perfectly with most of the web development languages. When using MySQL with java, the MySQL Connector/J driver needs to be downloaded from MySQL’s website [http://www.mysq

.8 Apache Tomcat

developm

application container that was created to run Servlets and Java Server Pages (JSP) in web applications. Java m

eb pages, servlets and JSP into a single directory structure. It can be thought of as a container h a deployment directory where you can place all your web application files for them

l.com/products/connector/j/].

4

Apache Tomcat (also codenamed Catalina) is a standalone Web server used as a ent server on your desktop. The Tomcat server [21] is a java based web

ust be installed for Tomcat to operate.

Tomcat organizes all the parts of a web application such as static w

whic cts as the

to execute without any hassles.

The root folder is the deployment folder where all the static html files and JSPs can be placed. The Servlets are placed in the ROOT/WEB-INF/Classes folder. Pros of Tomcat are that it is an open source project, stays on top of the Servlet API developments, and works extremely well. The cons are that it is not the fastest implementation and that you are on your own for support.

(58)

In the next chapter, a detailed discussion on the developed microarray data analysis tool is presented. The discussion is in the form of objectives of the analysis tool and the results that have been achieved along with screen shots of the system for easier understanding.

(59)

5. OBJECTIVES AND RESULTS

The aim of this thesis was to develop a freely available, platform independent plication for visualization, normalization and analysis of microarray experiments and so a tool which will guide the users through the steps of normalization and data analysis such as identifying differentially expressed genes and to cluster those differentially expressed genes into clusters of genes exhibiting similar behavior.

R, the statistical package which is freely available can be used to perform all types of analysis on microarray data, but it has the disadvantage of being a command line based package which requires the biologists to know the syntax of various commands and also requires the users to be familiar with programming techniques and concepts. The users have to in short, be well versed in R to perform efficient data analysis. Thus the motivation for the tool comes from the need for a easy to use, point and click kind of interface which is easily accessible over the internet, to which users can easily connect to, upload their data files retailored to a particular format and get both numeric and visual results for interpreting the data without having to worry about the intricacies of programming.

I will now discuss about the different objectives of the tool and the way they were implemented and also discuss the results of normalizing and analyzing the microarray data using the software tool, which was developed during the course of this work.

ap al

(60)

5.1 Uploading experiment data

The user can upload experimental data as a text file, which is more importantly in the format of a gene expression m

the genes and the expression values of the genes for different experiments (which may be different biological conditions, different time-points, etc).

atrix. The data text files should contain a listing of all

Figure 5.1 - Sample data file for analysis

As shown in figure 5.1, the gene names form the first column, the other columns are the different experiments and the numerical values represent the gene expression data for each gene under the different experiments. The data file can be uploaded through a user friendly web page as shown below and it will be saved in the working directory of the application on the server and will be used for all further analysis (figure 5.2).

(61)

Figure 5.2 - Process of uploading data files

5.2 Normalization methods

A wide choice of common normalization methods is offered to the user to remove the systematic bias within the data. It is also possible to add newly developed normalization techniques at a later stage.

All the previou e applied to the data

from the uploaded file. The normalization can be applied to more than one experimental column or to all the experimental columns. In the case of multiple experimental columns, all the columns are normalized with respect to the first experimental column using the normalization method selected. The normalized data can be downloaded as a text file for reference or for input to another system.

(62)

Figure 5.3 - Normalized data file using total intensity normalization