A Sample Analysis Pipeline - EMMA2 : a MAGE-compliant system for the analysis of microarray dat

sites and intron-exon boundaries can thereby be detected with a precision up to the length of a single oligonucleotide. Tiling microarrays are produced using in-situ synthesized oligos due to the higher density this technique allows.

Oligonucleotides for tiling arrays need to cover the complete genomic sequence; therefore they cannot be optimized with respect to low cross-hybridization and equal melting temperature. As a result a higher variation of signal intensity between oligonucleotides is likely to occur. This problem has to be dealt with in subsequent data analysis steps.

3.1.5 Protein Arrays

Microarray production technology, originally developed to deposit DNA reporters, has been further adopted to small peptides, proteins, and antibodies. Protein microarrays can be produced by either spotting purified solutions of molecules or alternatively by light-directed synthesis of short peptides on the surface.

Spotted antibody arrays are mainly used for the detection and quantification of proteins and protein abundance in a complex mixture. New approaches allow the deposition of full-length functional proteins. This type of arrays is used to study protein–protein, protein–DNA and protein–small molecule interactions (Bertone and M, 2005; Doi et al., 2002; Zhu et al., 2001). Protein microarray experiments currently use single dye techniques.

3.2 A Sample Analysis Pipeline

Although there can be many different ways in which to perform an actual microarray experiment, on a more abstract level, several steps of analysis are common to every such experiment. A fixed series of operations or experimental steps has to be followed, which is imposed by the technological requirements. Some steps are not specific to microarrays but to biological experiments in general.

A rough categorization is sometimes based on laboratory versus computer based analysis task; but it does not seem too helpful because it is an artificial separations by the location where the analysis steps are performed. It is better to devide experiments into three logical stages:

A planing and design phase during which the experimenter defines initial hypothesis, consults the literature, defines which effects to study, which variables or quantities to measure, and which instruments to use.

The experiment conduction wherein the object of study is eventually exposed to experimental conditions and quantities of interest are measured.

The analysis phase in the course of which measurement results are evaluated and interpreted to achieve results.

These abstract definitions can be further subdivided, as there can be various levels of complexity depending on the type of experiment; and for microarrays many of these steps involve complex biochemical reactions. The following description of a microarray experiment pipeline is in accordance with the steps described in several textbooks on microarray analysis, although none of them has explicitly defined a workflow diagram (Baldi and Hatfield, 2002; Kohane et al., 2003; Parmigiani et al., 2003; Schena, 2003).

The planning and design phase is most important for a successful experiment. Planning might eventually give rise to the question whether to carry out this experiment at all, as possibly similar measurements have already been made. Moreover, it is necessary to identify free and dependent variables in the experiment, and, most importantly, the initial experimental hypothesis or experimental question should be stated1_{. It will be further assumed that the choice of measurement techniques has}

been made and includes microarrays.

The experimenters need to select the appropriate microarray technology and maybe even produce their own arrays. Also, the necessary number of replications of measurements need to be assessed to be able to measure expression at the desired level of confidence. Power analyses methods serve to calculate the approximate number of replications necessary. They can help in experiment design by providing an estimate of the number of replicates, given the desired power (the ability to detect a large proportion of the differentially expressed genes), the confidence level, and the variability of the data (see for example Pan et al. (2002), Black and Doerge (2002), Li et al. (2005), Page et al. (2006), and in particular Fu and Jansen (2006) for a software implementation of power analysis based on publicly available data). Efficient experiment design needs to consder which conditions to compare di- rectly if using a multi-channel platform. A good assignment of experimental factors should result in high experimental power, while minimizing the number of required microarrays and thus the costs. Many methods to find good or even optimal de- signs have been proposed in the works of Kerr and Churchill (2001), Kerr (2003), Churchill (2002), Yang and Speed (2002), Fu and Jansen (2006) and many more.

The phase of experiment conduction can be further divided into the generation of the actual biological sample and the measurement process. Sample generations is the process of studying an organism, a cell or parts thereof or even a community of organisms, while controlling the free experimental variables. Biological molecules of interest (RNA, DNA, protein) are consecutively extracted. Sample generation is the most variable part of the experiment. Sample measurement is the next step, that involves labeling and hybridization as complex biochemical reactions.

Further data acquisition and processing involves computer hardware and software and is still part of the measurement process. Images of the arrays are acquired. In a first step of processing, image analysis algorithms are applied to reduce pixel related image data to intensity statistics related to each feature on the array.

Often this is as simple as trying find all the genes differentially expressed under condition X vs. condition Y.

In document EMMA2 : a MAGE-compliant system for the analysis of microarray data in integrated functional genomics (Page 39-41)