Comparison to other Solutions - Schmidberger, Markus (2009): Parallel Computing for Biol

5.3 Results

5.3.3 Comparison to other Solutions

The presented affyPara package is not the only solution to preprocess a huge number of microarray data. There are optimized functions in some Bioconductor packages. For example in theaffypackage, there is the functionjustRMA()which reads CEL files directly in the working directory and converts the raw data - without using an AffyBatch object - to an expression measure using robust multi-array average. In addition, there is the

aroma.affymetrix package which uses efficient data structure approaches for reading data

from a file on request instead of loading the file to the main memory [Ben04]. These three solutions are compared in Table 5.2.

All packages use theRlanguage for the basic operations. TheaffyParapackage requires additionally an API for parallel computing (e.g., MPI) and other solutions in the Bio- conductor packages use the C language to accelerate the methods. The aroma.affymetrix package uses the file system for data handling and is, therefore, the slowest implementation in computation time. The affyPara package uses the main memory of several computers and is, therefore, the fastest solution. A simple time comparison of rma background correction and quantile normalization is available in Figure 5.13. For small numbers of arrays the original affy code is the fastest solution. Using more than 50 arrays the affyPara package is the fastest solution. For 200 arrays the affyPara package is up to 13 times faster than

the aroma.affymetrix package and twice as fast as the code from the affypackage.

All three solutions are open-source, they work on the standard operation systems and there exist possibilities for third-party extensions. Compared to the other solutions, the

affyPara package only works for expression arrays. The described parallelization ideas can

affyPara aroma.affymetrix Affymetrix packages in Bioconductor

Programming language R+ MPI R R(+ C)

Depends on affy,snow aroma.core, affx-

parser, aroma.light,

R.huge,aroma.apd

Biobase

Data handling Memory File System Memory

Chip types expression arrays expression arrays,

SNP chips, exon

arrays, . . .

expression arrays,

SNP chips, exon

arrays, . . .

Usability easy difficult easy

Max. number of arrays limited by number of

computers

limited by size of hard disk

limited by main

memory

Computation time fast slow medium

Operating system Linux, Windows, Mac OS X

Third-party extensions yes

Open-source yes

Table 5.2: Comparison of theaffyParapackage to thearoma.affymetrixpackage and other solutions. 50 100 150 200 10 20 50 100 200 500 1000 BGC Arrays Time in sec 50 100 150 200 10 20 50 100 200 500 Normalization Arrays aroma affy affyPara

Figure 5.13: Computation time for rma background correction and quantile normalization using code from the affy,affyPara and aroma.affymetrixpackage.

5.4 Summary 81

there is a limitation in the maximum number of preprocessible microarrays. Due to the size of existing supercomputers and grid environments with the affyPara package, the biggest amount of microarray data should be preprocessible. Using the cluster pool at the IBE about 16.000 (32 nodes ·approximately 500 microarrays) microarrays of the type HG-U133A can be preprocessed using the function preproPara().

The usability of the affyPara package is very good for trained R and Bioconductor users. Some simple code examples for creating anAffyBatch object and rma background correction with the affyand affyPara package are visualized in the following lines:

R> library(affy) R> AB <- ReadAffy()

R> AB_bgc <- bg.correct(AB, method = "rma")

R> library(affyPara)

R> makeCluster(5, tpe = "MPI") R> AB <- ReadAffy()

R> AB_bgc <- bgCorrectPara(AB, method = "rma") R> stopCluster()

To use the power of parallel computing, the user needs only a working computer cluster and cluster start or administration programs (e.g., SGE). The R syntax is very sim- ilar and only two more lines for starting and stopping the cluster are required. For the

aroma.affymetrix package a complex data structure at hard disk level is required and the

commands are very different to other Bioconductor packages.

R> library(aroma.affymetrix)

R> cdf <- AffymetrixCdfFile$fromChipType("HG-U133A")

R> cs <- AffymetrixCelSet$fromName(name, tags, chipType = cdf) R> bc <- RmaBackgroundCorrection(cs)

R> csBC <- process(bc)

R> AB_bgc <- extractAffyBatch(csBC)

5.4 Summary

For preprocessing of high-density oligonucleotide microarrays theaffyPara package [SM08, SM09] based on the snow package was developed and is available at the Biodoncuctor repository: http://www.bioconductor.org/packages/release/biocExisting statistical

algorithms and data structures had to be adjusted and reformulated for parallel computing. Using the parallel infrastructure the methods could be enhanced and new methods are available. Parallelization of existing preprocessing methods produce, in view of machine accuracy, the same results as serialized methods. The partition of data and distribution to several nodes solves the main memory problems and accelerates the method up to factor 15 for 300 arrays or more. Due to communication limitations in the snow package good performance can be achieved calculating with about 20 arrays per node. For the complete rma preprocessing as much processors as possible can be used, but there should be not less than 10 arrays per processor.

A new limitation will be imposed by the memory size of the ExpressionSet class and the analysis will be performed on the new big size of expression sets. However, no data set is currently available on the market that would cause main memory problems with the

Chapter 6 Parallel Computing in

Next-Generation Sequence Data

This chapter demonstrates the power and problems using parallel computing to solve the mentioned – see Chapter 3.3 – problems and new challenges for next-generation sequence data. Especially the amount of data, data handling in parallel computing environments and communication costs are very critical in the presented examples.

Section 6.1 outlines general ideas for parallel computing in high-throughput sequencing. An existing parallel implementation for data I/O in the ShortRead package from Martin Morgan (FHCRC, Seattle, WA, USA) is presented. Section 6.3 discusses a parallel test implementation in the BSgenome package. The bsapply() function, which applies a function to each chromosome in a genome, was parallelized using the Rmpi and/or snow package. Additional standard optimization strategies for parallel computing are described, applied and tested.

6.1 Ideas

Due to the apply-like code structure of existing data analyses methods, the first idea for parallel computing in next-generation sequence data is the use of data decomposition. Depending on the kind of application and used sequencing technique or machine, there will be different ways of data partition, distribution and implementation. This chapter examines data decomposition for next-generation sequence data and will use the snow and/orRmpi package for parallel computing.

Parallel computing ideas from other research areas can be adapted to improve the methods especially for pattern matching and alignment. For example [Mut00] presents a simple and parallel algorithm to solve the multi-pattern matching problem with optimal speedup. The implementation uses a Monte-Carlo algorithm based on finger-print functions. Fur- thermore, parallel implementations exist in literature for pair-wise alignment using the Needleman-Wunsch algorithm [Bar02] or for local alignment using the Smith-Watermann algorithm [RS00]. These methods are used in serial from the Biostrings package, but are

not yet implemented in parallel in the Bioconductor project or used in next-generation sequence analyses.

Optimized parallel computing paradigm for huge numbers of global variables exists. NetWorkSpaces and Sleigh (R package NWS) use a central server to store all data. This could be used to have the sequences available for every worker and data communication is only required for the requested sequences. This solution is not yet tested for next-generation sequence data. In addition first ideas using graphical processing units for high-throughput sequence alignment are published [STDV07]. For example at a single computer workstation the processing speed of GPUs will be used to improve pattern matching. As well there is no integration to the R language available at the moment.

In document Schmidberger, Markus (2009): Parallel Computing for Biological Data. Dissertation, LMU München: Fakultät für Mathematik, Informatik und Statistik (Page 95-100)