Gene expression profiling experiments typically involve the following steps: 1. Experimental design
2. Sample preparation
3. Hybridization and washing 4. Image acquisition (scanning) 5. Data pre-processing
6. Data analysis
These steps are discussed in more detail below.
2.2.1 Experimental design
Careful formulation of the questions to be addressed or hypothesis to be tested lies at the heart of properly conducted gene expression studies (Yang and Speed, 2002). The aims of the experiment influence the choice of the microarray platform (cDNA arrays, oligonucleotides arrays), the number of technical replicates (arrays hybridized with the same sample) or biological replicates (arrays hybridized with different individual samples from the population being studied) needed, and the tools necessary for data pre-processing and analysis. Crafting the experimental setup around clear objectives maximizes the information leveraged from the data whilst minimizing the efforts and the costs (Stekel et al., 2003).
2.2.2 Sample preparation
Sample preparation consists of extracting mRNA from frozen tissue (freezing pre- vents further degradation or production of RNA). Isolation of mRNA can be per- formed directly from the cells or through purification of total RNA extracted. After isolation, mRNA is converted to cDNA using reverse transcriptase (RT) and labelled with a fluorescent dye. Sample preparation for Affymetrix GeneChips
10 2.2. Gene expression profiling using microarrays
involves additional steps: isolated mRNA is linearly amplified before the RT re- action and the resulted cDNA is used to produce anti-sense complementary RNA (cRNA) which is fluorescently labelled. Detailed technical specifications of sam- ple preparation are provided by a number of authors including Mahadevappa and Warrington (1999) and Hegde et al. (2000).
2.2.3 Hybridization and washing
The labelled sample is poured onto the microarray and incubation follows (at high salt concentrations, presence of formamide and temperature of 42◦ C, or presence of an aqueous solution and temperature of 65◦C) to promote probe-target hy- bridization. After the microarray is removed from the incubation chamber, strin- gent washes are performed (at lower salt concentrations and room temperature) to remove non-specific and weak bindings. For high-quality gene expression pro- filing, the hybridization and washing solution must be evenly mixed and spread on the microarray (Freeman et al., 2000).
2.2.4 Image acquisition
The microarray is scanned to produce an image containing the fluorescence inten- sity of each probe. This stage consists of exciting the dye using a laser beam thus generating signals dependent on the quantity of target sample hybridized with each probe on the chip. Ideally, fluorescent signals should come only from targets that hybridized to their complementary probes. In practice, unwashed targets adhering to the slide, non-specific bindings, and other chemicals used for array preparation, hybridization and washing also generate fluorescent signals (Scharpf et al., 2007). These residual signals (background noise) need to be accounted for as they affect probe value calculations.
2.2.5 Data pre-processing
Processing of the raw intensities generated during scanning consists of the follow- ing steps:
1. Background correction 2. Normalisation
3. Summarization (Affymetrix GeneChips only) These steps are discussed in more detail below.
Chapter 2. Feature selection methods for microarray gene expression data 11
Background correction
The purpose of background correction is to adjust the probe intensities by remov- ing the background noise. Methods for estimating and subtracting the background noise are microarray technology dependent.
For cDNA microarrays, an unbiased estimate of the probe-specific background can be calculated from the local neighbourhood surrounding the probe by taking the mean of the pixel intensity values. The standard correction approach consists of subtracting the local background from the raw intensity values (Yin et al., 2005). To correct for negative adjusted intensities resulting from this approach, Edwards (2003) proposed a method that uses subtraction only when the difference exceeds a certain threshold; otherwise a smooth monotonic function that is linear on the log-scale with respect to the raw probe intensities is used. Yang et al. (2002) pro- posed sampling the probe-specific background at the probe nominal location from a background image estimated using morphological operations. This approach re- sults in less variable background adjusted intensities. Ritchie et al. (2007) provides a comparison of background correction methods for cDNA microarrays and their effect on identifying differentially expressed genes.
For oligonucleotide microarrays, the high-density of the chip prohibits the use of local neighbourhood methods. Instead, the MM control probes, which are adja- cent to their PM probes, can be used for background correction. Two widely used background correction methods are integral parts of Affymetrix microarray suite 5.0 (MAS 5.0) (Affymetrix, 2002) and robust multi-array average (RMA) (Irizarry et al., 2003).
The MAS 5.0 background correction method splits the chip into 16 rectangular zones of equal size. Zone-specific background and noise are taken as the mean and standard deviation of the lowest 2% probe intensities. For each probe, background and noise are calculated as weighted averages of the zone-specific background and noise signals, respectively. Probe intensities are adjusted by subtracting their back- ground unless the difference leads to a value less than the probe-specific noise, in which case the raw intensity is replaced by the noise.
The RMA background correction assumes the observed PM values within a probe set follow a linear additive model consisting of an exponentially distributed true signal and a normally distributed background. The adjustment procedure consists of replacing the PM intensity with the expected value of the true signal (which depends on the particular PM being adjusted, PM values above their mode and MM below their mode).
12 2.2. Gene expression profiling using microarrays
Normalization
The purpose of normalization is to adjust for systematic biases in sample collec- tion (Fentz et al., 2004), hybridization (Han et al., 2006), labelling (Gregory Cox et al., 2004) and scanning (Bengtsson et al., 2004) so that unbiased comparisons can be made across technical and biological replicates. Robust and widely used normalization methods vary from global normalization (Stafford, 2012) to quantile normalization (Bolstad et al., 2003) and lowess normalization (Smyth and Speed, 2003).
Global normalization consists of multiplying the probe intensities of each ar- ray with an array-specific scaling factor to adjust the mean or the median intensity values of the arrays; quantile normalization consists of transforming the distribu- tions of the array-specific intensities to have identical statistical properties; lowess normalization consists of removing the dependency of the log2 ratio values on the intensity using locally weighted linear regression analysis. Normalization is a growing field which generated a consistent amount of literature. Comprehensive reviews are provided by a number of authors including Quackenbush (2002) and Bilban et al. (2002).
Summarization
The purpose of summarization is to estimate probe set expression levels from asso- ciated probe-pair intensities. Summarization methods are divided into two major categories: single-chip methods (summarization is performed for each array inde- pendently) and multi-chip methods (summarization is performed by borrowing information across arrays).
One of the most widely used single-chip methods is MAS 5.0 which provides for each probe set a robust average of the differences between the PM values and the associated ideal mismatch (IM) values. The IM value is equal to the MM value if MM<PM and is slightly less than the PM value (according to some adjustment formula) if MM>PM. Subtraction of MM values from PM values can induce bias in probe set level summarization (Irizarry et al., 2003) as MM probes also measure specific hybridization (Wu et al., 2004b) and can often exceed the values of the PM probes (effect observed in about one third of the probe-sets) (Naef et al., 2002). In addition, the PM and MM intensities vary in probe-specific ways (Li and Wong, 2001). These factors confounding accurate probe-set level summarization are not accounted for by MAS 5.0.
Multi-chip methods such as RMA, factor analysis for robust microarray sum- marization (FARMS) (Hochreiter et al., 2006) and multi-chip modified gamma
Chapter 2. Feature selection methods for microarray gene expression data 13
Model for Oligonucleotide Signal (multi-mgMOS) (Liu et al., 2005) adjust for the systematic differences in probe hybridization affinities. RMA assumes the log2 transformed PM intensities follow a linear additive model consisting of the true probe set level, probe pair affinity effect and independent and identically dis- tributed error term. FARMS assumes the log2 transformed PM intensities follow a factor analysis model consisting of the normally distributed true signal and nor- mally distributed noise. Finally, multi-mgMOS uses a probabilistic model with gamma distributed PM and MM values that additionally accounts for the MM probes measuring specific hybridization. This method returns the estimated probe set levels and the uncertainties (standard errors) around these estimates, which are particularly useful for downstream analyses adopting a Bayesian framework.
2.2.6 Data analysis
The pre-processed data is analysed to extract biologically-relevant information. Analysis methods are purpose-specific and the choice of the algorithms influences the quality of the results which are usually based on a list of selected genes. These genes need to be enriched with functional information (molecular functions and biological processes associated with the genes) in order to strengthen the under- standing of the biological process being studied.
Functional annotation can be performed using Gene Ontology (Consortium et al., 2004) which unifies functional information in highly interconnected struc- tures and under a common nomenclature. If valuable knowledge was gained by addressing the questions or testing the hypothesis of the experiment, the data is published following carefully defined standards (Brazma et al., 2001) into a public repository (Gardiner-Garden and Littlejohn, 2001).