1 INTRODUCTION
1.8 Techniques for global gene expression analysis
The new and developing field of microarray technology evolved from E.M. Southerns realization that labelled nucleic acid molecules can be hybridized to their counterparts and therefore be used to detect their existence and amount in the original sample (Southern, 1975). The sequencing of whole genomes from human, as well as of many “laboratory” animal species, quickened the development of new technologies for the measurement of several thousand genes in a single experiment (Brown & Botstein, 1999; Schena, 1996). Meanwhile, these microarray technologies are used for a wide spectrum of issues, like drug discovery, basic research and target discovery, biomarker determination, pharmacology, toxicology, target selectivity, development of prognostic tests and disease-subclass determination (Butte, 2002). A wide range of different platforms for global gene expression are currently available. Although they all are either cDNA or oligonucleotide based, they differ in distinct properties such as the type of probes (short/long oligonucleotides, cDNA), the number of genes, probe selection and design, competitive versus non-competitive hybridization, labeling methods or the methods of production (in situ polymerization, spotting, microbeads). In the following paragraphs, the bead chip technology of Illumina Inc. and the Affymetrix Gene Chip, used during this work, are introduced.
Illumina BeadChip arrays
Illumina Inc. developed in 2003 a bead based technology for global gene expression analysis (Gunderson et al., 2004). The chips are based on a silicon wafer with 3 µm sized beads on their surface and covalently bound 50mer oligonucleotide probes. One single probe-type representing one gene is bound to each bead type with more than 100,000 copies per bead. All the bead types are pooled and put onto the surface of a silicon wafer (Figure 9). This wafer was previously prepared by plasma etching to provide wells at a regular distance of 5 µm. Each array contains about 900,000 beads so statistically on a whole genome array, each bead-type is represented ~30 times on average. This redundancy allows up to 30,000 genes to be detected simultaneously per array. Because of the random arrangement of the beads and their high redundancy, local area effects (scratches, impurity and intensity variation) are of minor consequence, but this feature also raises the need for an initial decoding step. Therefore, the probes consist not only of the gene specific part (50 nt) but also a 23 nt- long address sequence. Decoding is performed by Illumina Inc. by sequential hybridizations with differently coloured probes and is at the same time an important quality control step (Gunderson et al., 2004).
A B 100 60 40 20 80 Amount of Beads per chip Number of different Bead types A B 100 60 40 20 80 Amount of Beads per chip Number of different Bead types
Figure 9: The production process of an Illumina BeadChip array. A) Depicts the structure of a single bead, the generation of a bead pool and the combination with a previously etched silicon wafer to a complete BeadChip. B) Shows a histogram of the average abundance of bead types per chip.
Affymetrix gene expression arrays
Affymetrix arrays are based on in situ synthesis of oligonucleotides directly on to the array surface. The probes are 25 nt long and are directly synthesized onto a silicon wafer via a combination of photolithography and combinational chemistry (McGall & Fidanza, 2001). For each gene, Affymetrix uses 11 to 20 probe sets, a probe set consisting of a 25 nt perfect match and a 25 nt mismatch oligonucleotide, to guarantee
Figure 10: Scheme of the process and the architecture of an Affymetrix gene expression array (Taken from the Affymetrix homepage, www.affymetrix.com).
The RNA samples have to be isolated from the sample and reverse transcribed in order to produce biotinylated cRNA before hybridizing them to arrays of both suppliers, Illumina and Affymetrix. This procedure allows detection and quantification which otherwise wouldn’t be possible. After scanning, raw data must be preprocessed before statistical analysis and the relative expression level of each gene can be determined by comparing the intensities of the genes to each other or to a control. With respect to the technical aspects and the experiment layout, each set of microarray data has to be normalized in an appropriate way. Further details of both techniques used will be discussed in detail in chapter 3.1.
Methods of data analysis
DNA microarray technology has made it possible to generate millions of data-points in a relatively short time. The analytical steps needed to convert the noisy data into reliable and interpretablebiological information are challenging and error prone. Due to their great number, only an overview of the most common and important methods and algorithms used during these studies are presented. In principle, there are two main statistical approaches to identify genes or patterns of interest from microarray data. Supervised methods are used to identify patterns of gene expression, e.g. for the
identification of marker genes or the classification of compounds. Unsupervised methods identify signatures in the data set without input of data specific knowledge and can be used to summarize and to reduce the complexity of the multidimensional data. Important unsupervised tools include Principal Components Analysis (PCA), Hierarchical Clustering, Correlation and Self Organizing Maps (SOM) (Butte, 2002). PCAs are an attempt to reduce the multi-dimensional data of microarrays. Therefore, vectors (so called “Eigenvektoren”) are calculated, each representing the greatest amount of variance of the data cloud within one certain experiment (Figure 11). The largest, and therefore statistically most relevant, Eigenvektoren are plotted resulting in one single point per sample in two- or three-dimensional space and is therefore a good tool for data reduction and display (Yeung & Ruzzo, 2001).
Figure 11: Graphical representation of a PCA transformation in two dimensions (x and y). The variance of the data in the original space (x, y) is best captured by the basis vectors v1 and v2, which in turn are used as basis for the localization of the experiment in the appertaining PCA.
Several different algorithms can be applied depending on the structure of the dataset and the aim of the analysis. Hierarchical clustering calculates the distance of the sample or gene profiles from each other and visualizes this in form of a dendrogram- tree. Experiments closer to each other are more similar to each other than those further
SOM was used to group genes according to their expression profile (Nikkilä et al., 2002).
Supervised methods include t-test and the Analysis of Variance (ANOVA). T-test was applied to detect differences between empirical mean values of two datasets giving statistical confidence to the detected values. ANOVA was used to identify genes in a multivariate model whose expression is significantly altered between different biological samples. First described by R.A. Fisher in the 1920s, an ANOVA partitions the observed variance into components due to different explanatory variables and allows the effects of two or more treatment variables to be studied simultaneously. Other supervised methods include classification methods, such as Support Vector Machine or K-nearest-neighbour analysis. These algorithms “learn” to classify the data into preset categories from a training set and are able to match new data to the existing classifier (Raudys, 2000). Additionally, the minimum amount of genes needed for this discrimination can be calculated by ranking.