Drawbacks of median polish are that this procedure, on the one hand, does not provide estimates for the standard errors, and on the other hand, is only applicable to a probe set i, i = 1, . . . , m, if none of the probe intensities P Mhj(i), h = 1, . . . , Hi, j = 1, . . . , n, is missing. In Section 3.4.1, a summarization
method is described that does not have these drawbacks.
In Section3.4.2, a background correction method alternatively to the convo- lution model used in RMA is presented that employs the MMs to estimate the non-specific binding affecting the PMs.
3.4.1
Probe Level Model
Instead of median polish, robust regression using M-estimators (Huber, 1981) can be employed to fit the probe level model (PLM)
log2P Mhj(i) = α(i)h + βj(i)+ ε(i)hj, (3.3) for each probe set i, i = 1, . . . , m, where
– the probe effects α(i)h , h = 1, . . . , Hi, are constrained by Hi
X
h
α(i)h = 0, – the expression values xij are given by the chip effects β
(i)
j , j = 1, . . . , n,
– the errors ε(i)hj are assumed to be independently and identically distributed. Fitting such a model can be considered as a weighted linear regression in which the observations are iteratively reweighted based on their standardized residuals (see Algorithm3.5). The larger the absolute values of the standardized residuals, the smaller are the weights such that outliers have (almost) none in- fluence on the estimation of the parameters β = hα1 . . . αHi β1 . . . βm
i0 . (For a short introduction to M-estimation and its connection to weighted linear regression, see Appendix D.2.)
3.4.1 PLM 32
Algorithm 3.5 (Robust Linear Regression)
Given a response vector y of q observations, a q × p design matrix Z, a weighting function w, a maximum number Kiter of iterations, and the tolerance τ for
convergence, the parameter vector β of the linear model y = Zβ + ε is robustly estimated as follows.
1. Initially, estimate β by ˆβ(0) = (Z0Z)−1
Z0y, and set ˆε(0) = y − Z ˆβ(0).
2. For k = 1, 2, . . .,
(a) compute the standardized residuals ˆu(k−1) = ˆε(k−1)ˆs(k−1) with sca- ling parameter ˆs(k−1) = 1.4826 · median
εˆ(k−1) ,
(b) construct the q × q diagonal weight matrix W(k−1) with diagonal
elements w``(k−1)= w ˆ u(k−1)` ,
(c) update the estimate for β by ˆβ(k) = Z0W(k−1)Z−1
Z0W(k−1)y,
(d) set ˆε(k)= y − Z ˆβ(k),
(e) stop if k = Kiter, or if
ˆ ε(k−1)− ˆε(k)0εˆ(k−1)− ˆε(k) max 10−16,εˆ(k−1) 0 ˆ ε(k−1) ≤ τ.
The results of a comparison of Huber’s weighting function wH(u) = I |u| ≤ 1.345 +
1.345
|u| · I |u| > 1.345
(3.4) with the Geman-McClure function wGM(u) = (1 + u2)
−2
and Tukey’s biweight function wTB(u) = 1 − (u/4.6851)2
2
· I (|u| ≤ 4.6851) carried out by Bolstad
(2004) indicate that all three approaches fit the models almost equally well. To be in concordance with the standard R function rlm for fitting robust linear models, Bolstad (2004) thus uses (3.4) as default in the R function fitPLM for fitting the probe level models (3.3).
3.4.2 GCRMA 33
3.4.2
Base Composition Based Background Correction
Each base at any position in a PM sequence affects the intensity of this probe (Wu et al., 2004). On the one hand, the higher the GC-content, i.e. the number of the bases G and C in the sequence, the stronger is the hybridization, since G and C are connected via three hydrogen bonds, while A and T are joined by two hydrogen bonds (see Figure 2.1 on page 10). On the other hand, only the pyrimidines C and U are labeled with fluorescent dye which might either impede hybridization if too many bases are labeled, or prevent the shining of sequences that strongly bind if too few bases are labeled (Naef and Magnasco,2003).
To investigate the effect of the base composition on the probe intensities,
Naef and Magnasco(2003) model the PM intensities by a sum over the position- dependent base effects θk`, k ∈ {A, T, C, G}, ` = 1, . . . , 25. The resulting least
square estimates ˆθk` are shown in Figure3.6.
Based on this idea and the results of a few other microarray experiments,
Wu et al. (2004) propose an alternative background correction step for RMA. In this modified approach called GCRMA, they assume for any probe pair that
P Mhj(i) = OjPM+ NhijPM+ Shij and M M (i) hj = O
MM j + N
MM
hij + ϕhijShij,
FIGURE 3.6. Position-dependent effects of the four bases A, T, C and G on the probe intensities. (Source: Naef and Magnasco,2003, Figure 3)
3.4.2 GCRMA 34
where the optical noise Oj, j = 1, . . . , n, follows a lognormal distribution,
ln NPM hij ln NhijMM ∼
N
νPM hij νhijMM , σ2j 1 0.7 0.7 1 (3.5)is the noise based on non-specific binding, Shij is the signal of interest, and
0 < ϕhij < 1 takes into account that MMs can also measure specific binding.
For each chip j, j = 1, . . . , n, Wu et al. (2004) estimate the parameters by ad-hoc approaches: ϕ is set to zero, since following Wu et al. (2004) this has only a small impact on the results of the approach. The optical noise is assumed to be constant (since the variance of Oj is almost zero), and calculated
by ˆOj = min i, h
n
P Mhj(i), M Mhj(i)o− 1. The parameters in (3.5) are estimated by firstly fitting a loess curve fj (Cleveland and Devlin, 1988) through the scatter
plot of lnM Mhj(i)− ˆOj
vs. ˆλMMhi , where the probe affinities ˆλ are determined by summing over the base effects ˆθk` corresponding to the respective probe
sequence. (These base effects slightly differ from the ˆθk` shown in Figure 3.6,
since the model Wu et al., 2004, use differs slightly from the model of Naef and Magnasco, 2003). Secondly, σj is set to the MAD of the negative residuals
resulting from this regression, and the means are estimated by ˆνPM
hij = fjˆλPMhi
and ˆνhijMM = fjˆλMMhi
.
The background corrected intensity of each PM is then computed by a trun- cated maximum likelihood estimator for Shij given by
ˆ Shij = P Mhj(i)− ˆOj − ˆNhijPM, if P M (i) hj − ˆOj − ˆNhijPM > τ τ, if P Mhj(i)− ˆOj − ˆNhijPM ≤ τ , (3.6) where ˆNPM hij = exp 0.7 · lnM Mhj(i)− ˆOj + νPM hij − 0.7 · νhijMM− (1 − 0.72) σj2 , and τ is the minimum value allowed for Shij. In the R function gcrma, this value
is set by default to τ = 6.
In this implementation of GCRMA, two further corrections are made that are not mentioned in Wu et al. (2004). First, the linear model
log2P Mhj(i) = γ0+ γ1λˆPMhi + (i) hj
3.5 PLIER 35
is fitted for a randomly chosen subset of the PM intensities. For each chip j, j = 1, . . . , n, the PM intensities background corrected by (3.6) are then additionally adjusted by setting log2P Mhj(i) to
shij = log2P M (i) hj − ˆγ1λˆPMhi + n −1 PM m X k=1 Hi X `=1 ˆ γ1λˆPMk` .
Afterwards, the final background corrected PM values are determined by setting P Mhj(i) to
Shijnew= exp ( n−1PMX k, ` ln P Mkj(`)+ 1.15 ln P Mhj(i)− n−1PMX k, ` ln P Mkj(`) !) .