Modifications of RMA - Statistical analysis of genotype and gene expression data

Drawbacks of median polish are that this procedure, on the one hand, does not provide estimates for the standard errors, and on the other hand, is only applicable to a probe set i, i = 1, . . . , m, if none of the probe intensities P M_hj(i), h = 1, . . . , Hi, j = 1, . . . , n, is missing. In Section 3.4.1, a summarization

method is described that does not have these drawbacks.

In Section3.4.2, a background correction method alternatively to the convo- lution model used in RMA is presented that employs the MMs to estimate the non-specific binding affecting the PMs.

3.4.1 Probe Level Model

Instead of median polish, robust regression using M-estimators (Huber, 1981) can be employed to fit the probe level model (PLM)

log₂P M_hj(i) = α(i)_h + β_j(i)+ ε(i)_hj, (3.3) for each probe set i, i = 1, . . . , m, where

– the probe effects α(i)_h , h = 1, . . . , Hi, are constrained by Hi

α(i)_h = 0, – the expression values xij are given by the chip effects β

(i)

j , j = 1, . . . , n,

– the errors ε(i)_hj are assumed to be independently and identically distributed. Fitting such a model can be considered as a weighted linear regression in which the observations are iteratively reweighted based on their standardized residuals (see Algorithm3.5). The larger the absolute values of the standardized residuals, the smaller are the weights such that outliers have (almost) none in- fluence on the estimation of the parameters β = hα1 . . . αHi β1 . . . βm

i0 . (For a short introduction to M-estimation and its connection to weighted linear regression, see Appendix D.2.)

3.4.1 PLM 32

Algorithm 3.5 (Robust Linear Regression)

Given a response vector y of q observations, a q × p design matrix Z, a weighting function w, a maximum number Kiter of iterations, and the tolerance τ for

convergence, the parameter vector β of the linear model y = Zβ + ε is robustly estimated as follows.

1. Initially, estimate β by ˆβ(0) _{= (Z}0_Z)−1

Z0y, and set ˆε(0) _{= y − Z ˆ}_β(0)_.

2. For k = 1, 2, . . .,

(a) compute the standardized residuals ˆu(k−1) = ˆε(k−1)ˆs(k−1) with sca- ling parameter ˆs(k−1) _{= 1.4826 · median}

εˆ(k−1) ,

(b) construct the q × q diagonal weight matrix W(k−1) _{with diagonal}

elements w_``(k−1)= w ˆ u(k−1)_` ,

Z0W(k−1)_y,

(d) set ˆε(k)_{= y − Z ˆ}_β(k)_,

(e) stop if k = Kiter, or if

ˆ ε(k−1)_{− ˆ}_ε(k)0_ε_ˆ(k−1)_{− ˆ}_ε(k) max 10−16_,_ε_ˆ(k−1) 0 ˆ ε(k−1) ≤ τ.

The results of a comparison of Huber’s weighting function wH(u) = I |u| ≤ 1.345 +

1.345

|u| · I |u| > 1.345

(3.4) with the Geman-McClure function wGM(u) = (1 + u2)

−2

and Tukey’s biweight function wTB(u) = 1 − (u/4.6851)2

· I (|u| ≤ 4.6851) carried out by Bolstad

(2004) indicate that all three approaches fit the models almost equally well. To be in concordance with the standard R function rlm for fitting robust linear models, Bolstad (2004) thus uses (3.4) as default in the R function fitPLM for fitting the probe level models (3.3).

3.4.2 GCRMA 33

3.4.2 Base Composition Based Background Correction

Each base at any position in a PM sequence affects the intensity of this probe (Wu et al., 2004). On the one hand, the higher the GC-content, i.e. the number of the bases G and C in the sequence, the stronger is the hybridization, since G and C are connected via three hydrogen bonds, while A and T are joined by two hydrogen bonds (see Figure 2.1 on page 10). On the other hand, only the pyrimidines C and U are labeled with fluorescent dye which might either impede hybridization if too many bases are labeled, or prevent the shining of sequences that strongly bind if too few bases are labeled (Naef and Magnasco,2003).

To investigate the effect of the base composition on the probe intensities,

Naef and Magnasco(2003) model the PM intensities by a sum over the position- dependent base effects θk`, k ∈ {A, T, C, G}, ` = 1, . . . , 25. The resulting least

square estimates ˆθk` are shown in Figure3.6.

Based on this idea and the results of a few other microarray experiments,

Wu et al. (2004) propose an alternative background correction step for RMA. In this modified approach called GCRMA, they assume for any probe pair that

P M_hj(i) = O_jPM+ N_hijPM+ Shij and M M (i) hj = O

MM j + N

hij + ϕhijShij,

FIGURE 3.6. Position-dependent effects of the four bases A, T, C and G on the probe intensities. (Source: Naef and Magnasco,2003, Figure 3)

3.4.2 GCRMA 34

where the optical noise Oj, j = 1, . . . , n, follows a lognormal distribution,

  ln NPM hij ln N_hijMM  ∼

N

    νPM hij ν_hijMM  , σ2_j   1 0.7 0.7 1     (3.5)

is the noise based on non-specific binding, Shij is the signal of interest, and

0 < ϕhij < 1 takes into account that MMs can also measure specific binding.

For each chip j, j = 1, . . . , n, Wu et al. (2004) estimate the parameters by ad-hoc approaches: ϕ is set to zero, since following Wu et al. (2004) this has only a small impact on the results of the approach. The optical noise is assumed to be constant (since the variance of Oj is almost zero), and calculated

by ˆOj = min i, h

P M_hj(i), M M_hj(i)o− 1. The parameters in (3.5) are estimated by firstly fitting a loess curve fj (Cleveland and Devlin, 1988) through the scatter

plot of lnM M_hj(i)− ˆOj

vs. ˆλMM_hi , where the probe affinities ˆλ are determined by summing over the base effects ˆθk` corresponding to the respective probe

sequence. (These base effects slightly differ from the ˆθk` shown in Figure 3.6,

since the model Wu et al., 2004, use differs slightly from the model of Naef and Magnasco, 2003). Secondly, σj is set to the MAD of the negative residuals

resulting from this regression, and the means are estimated by ˆνPM

hij = fjˆλPMhi

and ˆν_hijMM = fjˆλMMhi

The background corrected intensity of each PM is then computed by a trun- cated maximum likelihood estimator for Shij given by

ˆ Shij =      P M_hj(i)− Ôj − ˆNhijPM, if P M (i) hj − Ôj − ˆNhijPM > τ τ, if P M_hj(i)− Ôj − ˆNhijPM ≤ τ , (3.6) where ˆNPM hij = exp 0.7 · lnM M_hj(i)− Ôj + νPM hij − 0.7 · νhijMM− (1 − 0.72) σj2 , and τ is the minimum value allowed for Shij. In the R function gcrma, this value

is set by default to τ = 6.

In this implementation of GCRMA, two further corrections are made that are not mentioned in Wu et al. (2004). First, the linear model

log₂P M_hj(i) = γ0+ γ1λˆPMhi + (i) hj

3.5 PLIER 35

is fitted for a randomly chosen subset of the PM intensities. For each chip j, j = 1, . . . , n, the PM intensities background corrected by (3.6) are then additionally adjusted by setting log₂P M_hj(i) to

shij = log2P M (i) hj − ˆγ1λˆPMhi + n −1 PM m X k=1 Hi X `=1 ˆ γ1λˆPMk` .

Afterwards, the final background corrected PM values are determined by setting P M_hj(i) to

S_hijnew= exp ( n−1_PMX k, ` ln P M_kj(`)+ 1.15 ln P M_hj(i)− n−1_PMX k, ` ln P M_kj(`) !) .

In document Statistical analysis of genotype and gene expression data (Page 40-44)