Application of startPLM - Statistical analysis of genotype and gene expression data

For the summarization step, each of the chunk files is imported and pro- cessed separately using the approach described in Section 5.2: A PM matrix is constructed based on probe intensities stored in the currently considered chunk file, the names of the probes in this matrix are obtained from L, and the PLM signals are computed for the probe sets represented in the PM matrix. To en- sure that these expression values do not have to be calculated again, if startPLM stops because of, e.g., memory problems, they are stored chunk-wise as RData files in the subdirectory exprsChunks.

After having summarized all probe sets, the chunk files containing the expression values are successively read in and combined with each other to construct an m × n matrix X comprising the signals of all m probe sets and n samples. To safeguard against memory problems, the combined matrix is exported after combining save.combine, 2 · save.combine, . . . chunks.

Finally, the gene expression matrix, and if asExprs = TRUE in startPLM an exprSet object containing this matrix are stored in the subfolder output, where an exprSet object is the typical output of preprocessing functions such as rma or just.rma. (Note that the class exprSet will be replaced by the class ExpressionSet in BioConductor 2.0 that will be released in April, 2007, such that the output of startPLM will then be an ExpressionSet object.)

If the preprocessing unexpectedly fails because of memory (or other) problems, the procedure can be restarted using the function restartPLM which only has one argument, namely folder. Employing the information stored in folder, restartPLM performs the same analysis as startPLM starting (virtually) at the same point at which the preprocessing has been interrupted.

5.4 Application of startPLM

Using an AMD Athlon XP 3000+ machine with 1 GB of RAM, startPLM is applied to 500 Affymetrix HG-U133 Plus 2 chips. The whole computation takes less than 10 hours, where the background correction and normalization

5.4 Application of startPLM 60

requires about 2 hours 43 minutes, the fitting of the PLMs 6 hours 53 minutes, and the fusion of the chunks 19 minutes. During this procedure, about 5.13 GB of files are stored in the directory specified by folder, where the final output of startPLM, i.e. the exprSet object, exhibits a size of 195 MB.

Except for folder, the default settings of the arguments of startPLM are used in this computation. This in particular means that each chunk file contains at least 11 · 100 · 500 = 550, 000 intensities, since by default chunk.size = 100 probe sets are considered at once. Lowering chunk.size would, on the one hand, increase the number of chunks, but on the other hand, decrease the number of intensities saved in the txt files which might reduce the run time and would make it possible to apply startPLM to many more than 500 microarrays.

PART III

High Level Analysis of

SNP Data

Chapter 6 Adapting DNA Microarray

Methods to SNP Data

6.1 Introduction

An important goal of microarray studies is the construction of a diagnostic chip, i.e. a microarray composed of a small number of genes, enabling to determine if a person has cancer, or which (sub-)type of cancer this patient exhibits. In more statistical terms, this means that a rule for predicting the cancer status of a person based on as few variables as possible should be constructed.

Since the result of the preprocessing is an m × n matrix X comprising the expression values of m genes (or more exactly, m probe sets) and n samples/patients, where m is typically in the tens of thousands, the number of genes has to be reduced dramatically. This reduction can, e.g., be done in the following two steps: Firstly, several ten to a few hundred genes are selected using, e.g., multiple testing (see Section6.3). Secondly, a discrimination method, often Support Vector Machines or Random Forests (see Section 7.3), is applied to this set of variables to further reduce the number of genes using, e.g., a backward elimination approach (e.g., Guyon et al., 2002), or a stepwise selection procedure such as SFFS (Sequentially Floating Forward Selection;Pudil et al.,1994,

6.1 Introduction 63

Somol et al., 1999), and to construct a classification rule.

This is, of course, a very simplified description of the way from the output of a preprocessing method to a classification rule. In the actual analysis, one also has to consider approaches such as (inner and outer) cross-validation to avoid selection bias. Furthermore, there are other reasonable strategies. For detailed descriptions on how gene expression data might be analyzed, see Gentleman et al.(2005), and for feature extraction in a more general setting, see, e.g., Guyon et al. (2006).

While a large number of papers concerned with high level analyses of gene expression data have been published in recent years, only a few methods dealing with the specific needs of the analysis of SNP data have been proposed. Two of these exceptions are the Multifactor-Dimensionality Reduction (MDR;Ritchie et al.,2001) and logic regression (Ruczinski et al.,2003). In a comparison of these procedures, Rabe(2004) shows that logic regression has several advantages over MDR. Logic regression is, e.g., faster, can handle a larger number of variables, uses a better search strategy, and leads to classification rules that are easier to interpret. Furthermore, in MDR, a new observation can only be classified if this person exhibits a combination of genotypes that has also been observed in the training set. We therefore exclude MDR from our analyses, and take a closer look on logic regression in Chapter 7and 8.

Since the goals of the analysis of gene expression and genotype data are similar (e.g., identifying genes/SNPs associated with the covariate of interest, classifying patients using genetic markers), one solution to the problem of how to analyze SNP data is to adapt methods developed particularly for the analysis of DNA microarrays to genotype data. These modified approaches can then not only be applied to genotype array data, but also to SNP data measured with, e.g., MALDI-TOF-MS.

In the following sections, this is exemplified by modifying three popular DNA microarray procedures: A method for imputing missing values (Section 6.2), a multiple testing approach (Section 6.3), and a discrimination procedure (Sec-

6.2 KNNimpute 64

tion 6.4). The latter two are improved versions of the methods presented in

Schwender (2005).

In each of these sections, we propose an algorithm based on matrix algebra that enables the simultaneous computation of all statistics employed by the respective method. These procedures reduce the run time in R substantially in comparison to approaches in which the statistics are determined individually. Since in these algorithms element-wise matrix calculation is used, notations for these computations are introduced in the following definition.

Definition 6.1 (Element-wise Matrix Calculation)

Let M and N be two R × C matrices, and n be a numerical value. Then, (a) M∗N is a R×C matrix with elements mrc·nrc, r = 1, . . . , R, c = 1, . . . , C,

(b) M

N is a R × C matrix with entries mrc

nrc

In document Statistical analysis of genotype and gene expression data (Page 68-73)