• No results found

Chapter 2: Gene Expression Microarrays

2.2 Microarray Normalization:

When examining microarray data, two types of variation exist: informative variation and obscuring variation. Informative or interesting variations result from the conditions behind the study such as alterations accompanying a disease state, the effect of a protein or gene knockout, changes in environmental conditions (such as nutrients or temperature), introduction of infectious agents, mutations, or cellular stresses. Obscuring variations, on the other hand, can occur during the process of carrying out the experiment and can interfere with the interesting biological variations that occur between the two conditions. Obscuring variations can arise during the preparation of samples including variations during mRNA extraction, temperature fluctuations or reagent quality. In addition, some variations can occur while manufacturing the arrays such as the hybridization efficiency of the probes and probe concentrations. Finally, other obscuring variations arise from the processing of arrays, either during the hybridization of samples (differences in the amount of sample applied, buffer concentration and cross-hybridization interferences) or after array hybridization (variation in fluorescent intensity, optical

measurements and imaging algorithms) [60]. Thus the purpose of normalization prior to analysis is to deal with these obscuring variations [61] and unless the arrays are appropriately normalized, comparing values from different samples can lead to misleading results [62].

2.2.1 Robust Multichip Average Algorithm:

In order to produce gene expression levels that are representative of the hybridized DNA or mRNA samples, the probe intensities for each probe set have to be summarized. The robust multichip average (RMA) algorithm utilizes PM intensities only, as opposed to some of the other algorithms that use both PM and MM probes. The rationale behind this exclusion is due to reports that have revealed that the typical subtraction of MM values to correct for non-specific binding is not necessarily appropriate since the mathematical subtraction does not directly correspond to biological subtraction [55]. The pre-processing of Affymetrix microarray data includes three main steps: background-adjustment, normalization and final summarization of probe expression levels. Since all arrays are assumed to have a common mean background level, the PM intensities are adjusted to remove the background effect thus providing a more accurate absolute level of probe expression. Following background-correction, probe values are

normalized using quantile normalization [62]. Values are transformed using the empirical distribution of each array and the empirical distribution of the averaged sample quantiles [61]. The purpose of quantile normalization is to make the distribution of probe intensities the same across all the arrays [61-62] and has been shown to produce favorable outcomes in terms of speed, variance and bias criteria when compared to other normalization algorithms[61]. Finally, for each probe on the array the background-corrected, normalized and log2-transformed PM intensities are fit into a linear additive model to remove probe-specific affinities. Median polishing is used to protect against outlier probes and to estimate the model parameters, resulting in the robustness of RMA [62].

2.2.2 Reference Robust Multichip Average

Classic microarray normalization and summarization methods including RMA and other multiple array-dependent algorithms present a major limitation in that the final model is applied to all the test samples used. In other words, the training samples and the test samples are the same. This dependency restricts the expansion of the model to additional data due to the lack of archived parameters that can be applied to an updated database of microarray samples. As a result, data from two studies cannot be directly compared if each has been normalized separately since each analysis used different data to define the normalization parameters and estimate the probe effects. This requires that the normalization technique be reapplied to the data as a whole to avoid pre- processing bias, a process that can create several constraints when dealing with large amounts of data, including time and memory restrictions. The reference robust multichip average (refRMA) algorithm, however, allows for the construction of a static normalization scheme that can be applied to added data on a continual basis [63]. The normalization process is termed static since the previously normalized data are not re-normalized with the addition of new data.

In short, a large number of biologically distinct Affymetrix microarrays are used to train the RMA model. Similar steps are applied to the training data as with the classical RMA, namely background-adjustment, quantile normalization and median polishing. The training process then produces two archived vectors; a probe effect vector compiled from the individual log-scale probe affinity effects and a normalization vector compiled based on the transformed PM intensities. The resulting vectors can then be extended to new test data by using the

predetermined group of arrays to estimate the effects and the average empirical distribution that should be used for the added data. The final step differs, however, in that a full median polish summarization cannot be performed so the median is taken across probes from each probe set resulting in probe set level summaries [63].

2.2.3 Custom Chip Definition Files

Microarray data requires the use of chip definition files (CDF) in order to process the raw

information obtained from the data files. Affymetrix CDF files encode the physical design of the chip. They also contain the sequence details that can be used to link the oligonucleotide probes that are present on the chip to the investigated transcripts [64]. Much attention has been directed towards statistical algorithms for normalizing data and detecting differentially expressed genes, yet problems related to probe and probe set identity can result in significant errors, especially when expression changes are not dramatic. Affymetrix had initially utilized the complete

information available during the design stage, but with the immense progress achieved in genome sequencing and annotation in the past years, the Affymetrix probe set designs have become suboptimal [65]. Therefore, a gap exists in the correspondence between the probes and probe sets from the Affymetrix chips with the genes and transcripts [64-65]. Probe set annotation is

constantly updated by Affymetrix and it has deviated from the original one-to-one

correspondence between probe set and transcription locus. Nonetheless, the updates affect the qualitative attributes of the probe sets that control the effective matching between probes and genome sequences [64]. Analysis of chip definition files has revealed that several of the old probe sets do not truly reflect the expression levels of several significant genes in a given tissue [65].

Entrez custom CDF files are part of a collection of custom CDF files created by Dai et al. [65]. The process includes mapping probe sequences to individual sequences found in UniGene, dbSNP, and the genome sequence of the species and then aligning these sequences. Probes matching non-transcribed regions are excluded and only probes that have one perfect match with the corresponding genome sequence are retained. The probes in all probe sets are also required to

be aligned in the same direction on the genome. Finally, each probe set has to contain at least three probe pairs [65]. The resulting expression data is representative of an individual gene as opposed to the original Affymetrix CDF that produce gene intensities per probes where one gene can be represented by more than one probe and one probe can be mapped to multiple genes. The development of custom CDF files has been shown to significantly improve the outcomes of differential expression microarray analysis [64-65].