Robust Multi-Array Average - Statistical analysis of genotype and gene expression data

Since MMs not only measure non-specific binding and background noise, but also contain information about the gene abundance intendedly probed by PMs, they actually should be considered in preprocessing methods. Irizarry et al.

(2003), however, decided not to include MMs in their preprocessing procedure called RMA (Robust Multi-array Analysis), since when developing RMA they did not know how to extract this information.

Motivated by noticing that the density of the observed PM intensities typically look like the ones displayed in Figure 3.3, Irizarry et al. (2003) model the observed PM values of each sample j, j = 1, . . . , n, as a sum of the specific

FIGURE 3.3. Density of the PM intensities of four of the 38 Affymetrix HG-U133 Plus 2 chips described in Appendix A.1.

3.3 RMA 27

signals S ∼ Exp(γj) and the background noise N ∼ N νj, σj2, where S and N

are assumed to be independent, and N is truncated at zero to avoid negative background corrected PM intensities.

Starting from this model, each PM intensity is background corrected by setting it to the expected signal E

S O = P M (i) hj , where E SO = o = a_j + σ_j φaj σj − φo−aj σj Φ aj σj + Φ o−aj σj − 1 (3.2)

with aj = o − νj − σj2γj, and φ and Φ being the density and the distribution

function, respectively, of the standard normal distribution (for a derivation of (3.2), seeBolstad, 2004).

In the actual implementation of RMA in the R function rma, Φo−aj

σj −1 and φo−aj σj

are omitted, since followingBolstad(2004) the latter value is negligible and Φo−aj

σj

≈ 1 in most microarray experiments. The parameters νj, γj and

σj, j = 1, . . . , n, are estimated by ad-hoc approaches: For each sample j, νj is

estimated by the mode of the density of the PM intensities, σj is determined

by the variability in the probe intensities less than ˆνj, and γj is estimated by

the reciprocal of the mode of the density of the strictly positive P M_hj(i)− ˆνj

values.

In a comparison of several normalization methods, Bolstad et al. (2003) identify quantile normalization described in Algorithm3.3 as the approach that shows the best performance in terms of variance and bias reduction. Further- more, the run time of quantile normalization is extremely short in comparison to the run times of other complete data methods, i.e. approaches that combine information from all arrays for normalization, that otherwise work almost as well as quantile normalization. The MA plots (see Appendix D.1) in Figure

3.4 reveal another important advantage of quantile normalization. Contrary to scaling, see (3.1), it can effectively combat non-linearities between arrays that are frequently observed in microarray experiments.

3.3 RMA 28

FIGURE 3.4. Scaling vs. Quantile Normalization. For a subset of the (log₂- transformed) probe intensities of two of the 38 HG-U133 Plus 2 chips (see Appendix

A.1), MA plots before normalization (left) and after normalization using scaling (midd- le) and quantile normalization (right) are shown. The solid lines are loess curves fitted through the data points. (Source: Schwender and Belousov,2006)

normalization step of RMA by constructing aPm

i=1Hi× n matrix in which each

row corresponds to one of the nPM =Pm_i=1Hi PMs, and each row to one of the

n arrays, and by applying Algorithm 3.3 to this matrix.

Algorithm 3.3 (Quantile Normalization) Let Z be a K × n matrix.

1. Construct a K × n matrix Zsort with elements z_kjsort = z(k)j, k = 1, . . . , K,

j = 1, . . . , n.

2. Construct a K × n matrix Zrank with entries z_kjrank =     1 |Tkj| X ` ∈Tkj `     with Tkj = n ` : zsort `j = zkj o .

3. Set q = n−1Zsort1n, where 1n is a vector of length n containing only ones.

4. Normalize the columns of Z by setting Z to Znew with elements z_kjnew = q_zrank

3.3 RMA 29

FIGURE 3.5. Profiles of the probe sets with Affymetrix-ID 242059 at (left panel) and 1563090 at (right panel) for the 18 colorectal (red) and the 20 breast (blue) cancer samples (see Appendix A.1).

As exemplified by the profiles of the probe sets displayed in Figure3.5, the variability of a single probe across several chips is typically smaller than the variation in a probe set from a single array. The summarization of the intensities of a probe set might thus benefit from considering all samples in one model. In fact, Bolstad (2004) shows that fitting such a multi-chip model outperforms methods that examine each array separately by, e.g., computing the expression value of each probe set by a (robust) mean over the corresponding intensities (as in Section 3.2). As the right panel of Figure 3.5 reveals, occasionally occurring outliers are a problem for the summarization step. Since tens of thousands of multi-chip models should be fitted, a well-suited summarization method should therefore be able to deal with such outliers automatically. Assuming that the probe and the chip effects are multiplicative on the original scale,Irizarry et al.

(2003) hence employ median polish (Tukey,1977) described in Algorithm 3.4 to robustly fit a multi-chip model

log₂P M_hj(i) = µ(i)+ α(i)_h + β_j(i)+ ε(i)_hj

for each probe set i, i = 1, . . . , m, where µ(i) _{is the intercept, α}(i)

h is the effect of

probe h, h = 1, . . . , Hi, and β (i)

3.3 RMA 30

Algorithm 3.4 (Median Polish)

Let yhj, h = 1, . . . , H, j = 1, . . . , n be a set of observations, and τ be the

tolerance for convergence.

1. Construct a H × n matrix E with initial entry ehj = yhj, and set sold = 0.

2. Compute the vector r consisting of the row-wise medians of E, and sweep E by setting its elements ehj to enewhj = ehj − rh.

3. Generate the vector c consisting of the column-wise medians of E, and update E by setting its entries ehj to enewhj = ehj − cj.

4. Set snew =

h,j|ehj|. If snew > 0, or |sold− snew| ≥ τ snew, set sold = snew,

and repeat Steps 2-4.

5. Let ˆY be a H × n matrix consisting of the elements ˆyhj = yhj − ehj.

6. Estimate the parameters of the model yhj = µ + αh+ βj + εhj by

(a) computing the medians r1 and c1 of the values in the first row and

the first column of ˆY, respectively, (b) and setting

αh = ˆyh1− c1, βˆj = ˆy1j − r1, and µ = ˆˆ y11− ˆα1− ˆβ1.

The expression value xij of probe set i and sample j is then given by

xij = ˆµ(i)+ ˆβ (i) j .

Note that the RMA signals are already log₂-scaled, whereas the outcomes of MAS 5.0 are expression values on original scale.

3.4.1 PLM 31

In document Statistical analysis of genotype and gene expression data (Page 35-40)