2.4 Genetic and epigenetic alterations in cancer
2.6.2 Exon microarray normalisation
The raw microarray signals need to be pre-processed in order to correct the effects and biases that occur during the experimental procedures. There are several algorithms
for normalising exon microarrays such as RMA (Robust multiarray analysis) [180,
181],Frozen robust multiarray analysis (fRMA)[182] andPLIER (Probe Logarithmic
Intensity Error, proposed by Affymetrix). However the most commonly used algorithms in practice are RMA and, its slightly modified versions, such as fRMA. PLIER was reported as being technically biased and numerically unstable [183], and is not very much used.
2.6 Microarrays 33 Table 2.3 Exon microarrays probeset classification by confidence level [179].
Evidence Level Description
Core Refers to probesets that are supported by the most reliable ev-
idence from RefSeq and full-length mRNA GenBank records containing complete CDS information.
Extended Refers to probesets that are supported by other cDNA evidence
beyond what is used to support core probesets. Extended evi- dence comes from other Genbank mRNAs not annotated as full- length, EST sequences, ENSEMBL gene collections, synthetically mapped mRNA from Mouse, Rat, or Human, mitoMap mitochon- drial genes, microRNA registry genes, vegaGene, and vegaPseu- doGene records.
Full Refers to probesets that are supported by computational gene
prediction evidence only. They are supported by gene and exon prediction algorithms including GeneID, GenScan, GenScanSub- Optimal, exoniphy, RNAGene, sgpGene and Twinscan.
Free Refers to probesets that are supported by annotations which were
merged such that no single annotation (or evidence) contains the probeset.
Ambiguous Refers to probesets that cannot be unambiguously assigned to a
particular transcript cluster.
2.6.2.1 Robust multiarray analysis (RMA)
The RMA algorithm proposed by Irizarry et al. [180] is one of the most commonly used normalisation methods for exon microarrays. The main advantage of RMA is that it uses only perfect match probes. RMA normalisation consists of three steps: background correction, quantile normalisation and summarisation.
The first step of RMA is background correction. The purpose of background normalisation is to correct for non-specific binding, i.e. the hybridisation of sequences that are not complementary to microarray probes. The model assumes that the observed probes intensities are a combination of the true signal and background noise. More specifically, as presented in Bolstad [181]:
S=X+Y, (2.1)
whereSis the observed signal intensities of the probes,X is the true signal (assumed to
follow an exponential distribution) andY is the background noise, normally distributed
and truncated at 0 to avoid negative values. Under this model the background corrected
2.6 Microarrays 34
Next, the probe intensities are quantile normalised. Quantile normalisation [184] is a method designed to make the distribution of probe intensities the same. This is achieved by transforming the intensities so that the corresponding quantiles across all microarrays are equal.
The third step of RMA normalisation is the summarisation of the intensities of probes within a probeset in order to obtain a single value, the probeset estimate expression level. Li and Wong [185] observed that the variation of the intensities of probes from the same probeset can be very large, due to probe-specific effects (or affinities). Sometimes the variation due to probe-specific effects was larger than the variance across microarrays [185]. Fortunately these probe-specific effects are reproducible, predictable and can be reliably accounted for. RMA uses the following linear additive model to account for probes affinities, when estimated the probeset expression:
Yi jn=µin+αjn+ei jn, withi=1, ...,I,j=1, ...,J,n=1, ...,N, (2.2)
whereiis the index of the microarray, jrepresents the probe index in the probeset, and
nis the probeset index in the microarray.Yi jnrepresents the log2background-adjusted
and quantile normalised expression level of a probe jfrom probesetnfrom the arrayi,
µin is the log2expression level of the probesetnin arrayi,αjnis the probe affinity of
the probe jfrom probesetnandei jnan independent identically distributed error term
with mean 0 [180].
The parameters of the above model are estimated using the median polish algorithm
[186], which is robust to outliers. In the end we are interested in the value ofµin, which
represents the probeset expression level, after we corrected for probe affinities.
2.6.2.2 Frozen robust multiarray analysis (fRMA)
Frozen robust multiarray analysis (fRMA) [182] is an extension of the RMA algorithm. The main difference is that the reference distribution used in quantile normalisation, the probe-effects and the error variances necessary for RMA are not computed locally from a set of microarrays anymore, but have been precomputed using a large number of microarrays available in the public databases and frozen. This allows fRMA to process single arrays or small batches separately, and to obtain in the end comparable arrays.
More specifically, the background correction for fRMA is the same as for RMA, as background correction is a single-array procedure anyway. For quantile normalisation, probe intensities of the single-array/batch are forced to the frozen reference distribution. Intuitively, one would expect the probe effects to be constant across studies. How- ever McCall et al. [182] discovered that the probe-specific effects were variable in
2.6 Microarrays 35
samples coming from different batches. Also McCall et al. [182] noted the variance of
the errorei jn(see Equation 2.2) within batches is different. Therefore, the summarisa-
tion probe-level model has been extended to account for batch-effects and to allow the error variability to depend on batch as well. The updated summarisation model is:
Yi jkn=µin+αjn+γjkn+ei jkn, withi=1, ...,I,j=1, ...,J,n=1, ...,N,k=1, ...,K,
(2.3)
where a k has been added to notation to represent batch. We note that compared to
Model 2.2 this model has a new term,γ, that accounts for batch-specific variability and
that the error term depends now on the batch as well.
The performance of fRMA is comparable to the performance of RMA. When data was processed together RMA slightly outperformed fRMA, while when processing the data in separate batches, fRMA slightly outperformed RMA [182].