Normalization - Analysis methods - Sohler, Florian (2006): Contextual Analysis of Gene E

2.3 Analysis methods

2.3.2 Normalization

Background correction

The image analysis software usually returns foreground- and background values for each spot. It is assumed that the foreground consists of two components, the signal from the bound cDNA molecules and noise from natural fluorescence, unspecific binding and the imaging process. The noise can be estimated from the background where no DNA was spotted. The simplest and most commonly used method for background correction is subtraction of the background intensities from the foreground intensities. Unfortunately, this leads to an increase of variance (fore- and background measurements both have variance, the sum has a greater variance) and many negative intensity values that can not be handled by some subsequent methods. Therefore, other methods try to minimize the number of negative values and use more sophisticated methods for background correction. Kooperberg et al. (2002) for instance use a Bayesian method, while Edwards (2003) use background subtraction if the difference between foreground and background intensities is large and uses a smooth monotonic and positive function when the difference is small or negative.

Alternatively, the background correction step can simply be skipped which can of course lead to biased ratios in cDNA arrays if the background differs in the two channels.

Lo(w)ess normalization

The lowess or loess method can be employed to remove intensity-dependent biases from

two-channel microarray data. The name is derived from the term “locally weighted

26 2. Expression Data Analysis

scatterplot smoothing”, as the method uses locally weighted regression to smooth data.

A short introduction to smoothing that covers some additional approaches besides local

regression can be found in Gentle et al. (2004). The lowess method was proposed by

Cleveland (1979) and can be used for many different kinds of data that come in the form of one or more predictor variables and a response variable. It is based on the following principles:

• For each data point (the focal point), a polynomial fit on the local neighborhood is

computed.

• The neighboring points are weighted according to their distance from the focal point

for the fit.

• The value of the fitted polynomial is computed for the focal point as the “smoothed”

value.

• Points with high residuals are iteratively down-weighted, as outliers should not affect

the fit to a large degree.

The points used for the local fit can be specified by an absolute distance to the focal point or by the fraction of points that should be used. In the first case, the number of points can vary with the focal point, in the latter case, the specified number of nearest neighbors is used.

The points are weighted with a tri-cubic function:

wi = (1−(di/maxdist)3)3,

where di denotes the distance of the i-th point from the focal point and maxdist is the

maximal allowed distance. Next, the polynomial is fitted using a least-squares approach. With that approach, outliers can have a large influence on the fit. In order to achieve a more robust fit, a re-weighting procedure was suggested, down-weighting points with large residuals. With these additional weights, the fit is re-computed and the procedure is iterated a specified number of times. The effect of down-weighting and choice of polynomial degree is demonstrated in Figure 2.3.

The use oflowess for the normalization of microarray data was proposed by Yang et al.

(2002).

Oligonucleotide array normalization

For the normalization of oligonucleotide arrays like the Affymetrix GeneChips, it is necessary to estimate the expression level for the probe sets from the measurements of the probe cell pairs. Rajagopalan (2003) compares three different statistical methods to estimate these levels and the expression ratios between different experiments. Among those methods is the MAS5 algorithm that is implemented in the Affymetrix software. The most basic feature of these algorithms is the present-call for each probe set giving an estimate

2.3 Analysis methods 27 5 10 15 20 −2 0 2 4 6 8 10 12 x y Robust, deg=2 Least−squares, deg=2 Robust, deg=1 Least−squares, deg=1

Figure 2.3: Robust and least squares lowess fit with polynomial degrees 1 and 2. Obviously the least squares fit without iterative down-weighting is much more vulnerable with respect to outliers.

28 2. Expression Data Analysis

whether the corresponding RNA was present in the sample or not. In MAS5, this present- call is based on a p-value computed by a Wilcoxon rank test that is based on the difference of the PM and MM cells.

Between-array normalization

Variations between arrays are usually not only due to the different conditions under study, but also due to differences in sample preparation, hybridization, and chip manufacturing. Between-array normalization is necessary in order to eliminate these effects as far as possible. In principle, the same methods can be used for oligonucleotide and cDNA arrays, but usually between-array normalization for cDNA arrays tries to retain the distribution of expression values for each array as far as possible and adjust only a few values like the median and the deviation using for instance a scale normalization (Yang et al., 2002). The reason for this difference is that many biases are already removed during within-array analysis.

For oligonucleotide arrays, it has to be decided, if the normalization is carried out on the cell intensities or on the summarized probe set values. The simplest between-array normalization is a global scaling to a common target mean value. This global normalization on the probe set values is the standard procedure in the Affymetrix software.

Bolstad et al. (2003) compare different between-array normalization methods for oligonucleotide arrays. They suggest several new methods, some of which are based on an MA-plot for pairs of arrays, i.e. treating two different arrays like the two channels of a cDNA array. One of these methods is the cyclic lowess normalization, which computes a lowess fit for all pairwise comparisons, adjusts the expression values according to a combination of the resulting lowess curves and iteratively repeats the process until the changes are very small. In general, between-array normalization methods adjust the data such that some statis- tics become similar or equal for all arrays. As shown by Fundel et al. (2005), the effects of these normalizations can be drastic and should be checked manually or using automatic methods to check the stability of the final results for example by sub-sampling from the available replications.

In document Sohler, Florian (2006): Contextual Analysis of Gene Expression Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 41-44)