Conclusions - Statistical analysis of genotype and gene expression data

The scree plots in Figure 4.4 reveal that in the applications to the signals themselves, MAS 5.0, qPLIER and PLIER need many more principle components to explain the same proportion of variability in the data than PLA and PLA+ which in turn require many more components than the RMA methods. If the project effect is removed the distance between PLA(+) and the Affymetrix approaches will shrink almost to zero, whereas the RMA procedures still show the most parsimonious PCA representation. In this second application, the RMA approaches in which probe level models have been fitted slightly outperform the methods in which median polish has been used for summarization.

4.7 Conclusions

In this chapter, we have compared nine preprocessing procedures to identify the approach that provides the best reduction of the typically eleven pairs of probe intensities per gene and sample to one expression value.

The single-chip method MAS 5.0 and the multi-chip approach PLIER in which the normalization step is avoided show the worst performance in almost any of the comparisons: They lead to the lowest S/N ratios, they allow to identify the smallest number of differentially expressed genes, and they need many more principle components than the other procedures to explain the same proportion of variation. Applying PLIER to quantile normalized probe intensities seems to compensate the former two problems, and makes qPLIER a serious competitor for the other preprocessing methods.

From the two groups of procedures, PLM seems to be the best RMA approach, and PLA the best PLIER and PLIER-like method, since they both outperform – at least slightly – the other approaches.

In a comparison of these two methods, PLM requires less principle components to capture the same proportion of variation and shows a slightly larger S/N ratio, whereas PLA identifies virtually the same number of genes and exhibits a larger number of perfectly separated groups. PLA, however, also shows larger

4.7 Conclusions 53

FIGURE 4.5. The base effects used in GCRMA and GCPLM (left panel), and estimated using the 38 Affymetrix HG-U133 Plus 2 chips (right panel).

normality problems and the smallest correlations of both the housekeeping genes and the GAPDH probe sets.

Therefore, the result of the study for Roche Diagnostics (cf. Section 4.1) to choose PLM as standard preprocessing algorithm still holds if the set of considered methods is extended by adding other (popular) approaches.

Since the base component based background method should actually improve RMA, it is a little bit surprising that GCRMA and GCPLM result in worse data reductions than RMA and PLM, respectively. This particularly leads to smaller numbers of genes detected as differentially expressed. A reason for this might be that the position-dependent base effects θ (see Section3.4.2) have been computed byWu et al.(2004) once on a set of Affymetrix HG-U133A chips, and are used as standard in any application of this background method – no matter which type of Affymetrix microarray is analyzed.

To check whether the base effects in our data set are similar to the base effects of Wu et al. (2004), these effects are computed as described by Wu et al. (2004). As Figure 4.5 reveals, the base effects estimated by the Affymetrix HG-U133 Plus 2 data show a similar pattern as the base effects used in the background correction, but are much smaller.

Using the base effects displayed in the right panel of Figure 4.5 and an empirical Bayes approach also proposed by Wu et al. (2004) as an alternative

4.7 Conclusions 54

to the maximum likelihood estimator (3.6) might improve the performance of GCRMA and GCPLM.

Finally, we would like to remark that even though the comparisons presented in Bolstad (2004) and in this thesis show that it is advantageous to borrow strength from the whole set of chips by fitting multi-chip models, this only applies to experiments in which more than just a few microarrays are available. If one analyses just three or four chips, a single-chip approach might be better. The same applies to quantile normalization: It works pretty fast, even for a large set of arrays, and seems to provide a good normalization. But it should not be employed in an experiment consisting of three or four chips. In this case, other normalization approaches such as cyclic loess (Bolstad et al.,2003) are to prefer.

Chapter 5 Preprocessing of a Huge Number

of Microarrays Using R

5.1 Introduction

The BioConductor project (http://www.bioconductor.org) provides a large number of packages for the analysis of genomic data. For example, functions are available for all but the internal Roche preprocessing methods.

Typically, the gene expression values are computed by first reading the CEL files of interest into an AffyBatch object, say ab, by

> library(affy)

> cels <- list.celfiles(path, full = TRUE) > ab <- read.affybatch(filenames = cels)

where path is the name of the directory in which the CEL files are stored, and cels is a character vector naming the CEL files with their full directory path. Afterwards, the probe intensities stored in ab are preprocessed by calling the corresponding R function. The PLM signals, e.g., can be generated using either the function threestep or by

> library(affyPLM) > out <- fitPLM(ab)

> signal.plm <- pset2eset(out)

5.2 just-Versions 56

TABLE 5.1. Run times in seconds for the applications of, on the one hand, ReadAffy and fitPLM, and on the other hand, just.rmaplm to different numbers of Affymetrix HG-U133 Plus 2 chips on an AMD Athlon XP 3000+ machine with 1 GB of RAM.

10 20 25 30 50 70 90 100

ReadAffy/fitPLM 166 348 Error Error – – – –

just.rmaplm 130 226 251 318 530 779 1,133 Error

As Table5.1 shows, this approach works only for a small number of microarrays, since the construction of an AffyBatch object is very memory-consuming. In a CRCA project of Roche Diagnostics, however, the PLM signals of more than 400 Affymetrix HG-U133 Plus 2 chips have to be computed under Win- dows XP on a machine with 4 GB of RAM. Since this thesis is written on an AMD Athlon XP 3000+ machine with 1 GB of RAM, this task is extended to generating the PLM signals of 500 HG-U133 Plus 2 chips on this computer in a reasonable amount of time.

In document Statistical analysis of genotype and gene expression data (Page 61-65)