This work focuses on biological data arising from the studies of genomes, especially microar- ray data and data from next-generation sequencing. But there are many other biological
2.4 Other Biological Data & Analyses 17
data, which will be focused on in statistical computing. It is not possible to list all available kinds of biological data. There are many examples, where computational tools are required to generate biological knowledge from data to better understand living systems. Following [Xio06], the main analysis applications of bioinformatics can be divided into three subfields (see Figure 2.7): sequence,structural, and functional analysis.
Figure 2.7: Overview of subfields of bioinformatics [Xio06] with structure, sequence and function analysis.
The are of sequence analysis includes sequence alignment, sequence database searching, motif finding and pattern discovery, gene and promoter finding, reconstruction of evolution- ary relationships, and genome assembly and comparison. This thesis focuses on classical tools for next-generation sequence data. Other kinds of data are ChIP-Sequencing (ChIP- Seq), used to analyze protein interactions with DNA or RNA-Sequencing (RNA-Seq) to sequence cDNA in order to get information about a sample’s RNA content.
Structural analysis includes protein and nucleid acid structure analysis, comparison, classification and prediction.
The functional analysis includes gene expression profiling, protein-protein interaction prediction, protein subcellular localization prediction, metabolic pathway reconstruction, and simulations. In addition to expression arrays, these types of arrays are for example data from tiling arrays (ChIP-chip), which are very similar to microarray data. Another example for functional analyses isflow cytometry. A technique for counting and examining microscopic particles suspended in a stream of fluid. It allows simultaneous multiparametric analysis of the physical and/or chemical characteristics of single cells flowing through an optical and/or electronic detection apparatus. Certain applications may include physical sorting of components.
The analysis of biological data often generates new problems and challenges, that in turn powers the development of new and better computational tools.
Chapter 3
Bioinformatics Using
R
and
Bioconductor
The open-source programming language R and the Bioconductor open-source project for the analysis and comprehension of genomic data provide a wide spectrum of computational tools for the analysis of genomic data. This chapter introduces existing methods for DNA microarray analyses and next-generation sequence data. It discusses computational prob- lems and challenges of existing programs. At the end it examines solutions to improve the performance and to allow analyses on huge numbers of biological data.
3.1
R
and Bioconductor
R[R D08a] is an open-source programming language and software environment for statisti- cal computing and graphics. The core Rinstallation provides the language interpreter and many statistical and modeling functions. Rwas originally created by R. Ihaka and R. Gen- tleman in 1993 and is now being developed by theR Development Core Team. R is highly extensible through the use of packages. These are libraries for specific functions or specific areas of study, frequently created by R users and distributed under suitable open-source licenses. A large number of packages is available at the ComprehensiveR Archive Network
(CRAN) at http://CRAN.R-project.org or the Bioconductor repository [GCB+04] at
http://www.bioconductor.org. The R language was developed to provide a powerful
and extensible environment for statistical and graphical techniques.
Bioconductor is an open-source and open-development software project and R package
repository for the analysis and comprehension of genomic data. Bioconductor is primar- ily based on the R programming language and a repository for R packages. The Bio- conductor project was started in fall of 2001 and is overseen by the Bioconductor core team, based primarily at the Fred Hutchinson Cancer Research Center (FHCRC, Seattle, WA, USA) with other members coming from various US and international institutions. There are currently (release 2.4, 21.April 2009) 320 contributed packages in Bioconduc- tor’s development repository. Releases occur twice a year, normally some days after a
Release 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9
# Packages 20 30 49 81 100 123 141 172 188
Release 2.0 2.1 2.2 2.3 2.4
# Packages 214 233 260 294 320
Table 3.1: Number of contributed packages included in each of the Bioconductor releases. new release of the R language. The project also maintains more than 400 annotation data packages, that aid in the analysis of data from microarray experiments. Table 3.1 tracks the growth of the project over the semi-annual releases. The download statistic
(http://www.bioconductor.org/packages/stats/) for Bioconductor software packages
reports 150.000 package downloads per year. The repository is split into three parts:
Software: Packages for diverse areas of high-throughput biological analysis.
Metadata: Bioconductor ’Annotation’ packages contain biological information about mi- croarray probes and the genes they are meant to interrogate, or contain ENTREZ gene-based annotations of whole genomes.
Experiment Data: Bioconductor ’Experiment Data’ packages contain example data sets directly stored in R variables.
Packages in the software repository mainly address the development of high-quality algorithms for genome data analysis. Packages used for microarray analyses in this work and in connection to the described aspects in Chapter 2.2 will be presented in Section 3.2. Since the beginning of 2008 an ensemble of new or expanded packages introduces tools for next-generation DNA sequence data. Details for these packages are presented in Section 3.3.