to estimate the classification or prognostic error [RHPM04], will be very time consuming. Here, preprocessing has to be part of the cross-validation and resampling strategy which is necessary to estimate the rule’s prediction quality honestly.
3.3
Next-Generation Sequence Data
The open-source programming languageR and the open-source project Bioconductor sup- port the rapid developments in next-generation sequencing. The packages in the Biocon- ductor repository focus especially on down-stream (after alignment) analysis, quality as- sessment, data manipulation, ChIP-seq and other peak calling, and visualization problems. Due to the very new technology most packages are in development and the code structure is very unstable. At the moment (August 2009) no publications about the new Bioconductor packages and furthermore, no relevant textbook about high-troughput sequence analyses exist. This section gives an overview of the latest packages for next-generation sequence data in the Bioconductor repository.
3.3.1
Available Bioconductor Packages
All presented packages are available at the Bioconductor website. For more details see the vignettes or help files of the packages.
The Biostrings Package
The Biostrings package offers memory efficient string containers, string matching algo-
rithms, and other utilities for fast manipulation of large biological sequences or set of sequences. Especially for the representation of DNA, RNA, amino acid, and general bio- logical strings. For the sequence manipulation it provides functions for sequence summary (e.g., alphabetFrequency()), pattern matching (matchDNApattern()), subsequences and ’Views’ and ’masks’, and alignments (global, local, ends-free, . . . ).
The BSgenome Package
TheBSgenome package provides an infrastructure for Biostrings-based genome data pack-
ages. Using this package new packages with genome data stored in classes of theBiostring package can be provided. To build the packages files containing the sequence data and files containing the mask data are required. The package provides a foundation for representing whole-genome sequences. At the moment there are 13 model organisms represented by 20 distinct genome builds available:
R> library(BSgenome) R> available.genomes()
[1] "BSgenome.Amellifera.BeeBase.assembly4" [2] "BSgenome.Amellifera.UCSC.apiMel2" [3] "BSgenome.Athaliana.TAIR.01222004" [4] "BSgenome.Athaliana.TAIR.04232008" [5] "BSgenome.Btaurus.UCSC.bosTau3" [6] "BSgenome.Btaurus.UCSC.bosTau4" [7] "BSgenome.Celegans.UCSC.ce2" [8] "BSgenome.Cfamiliaris.UCSC.canFam2" [9] "BSgenome.Dmelanogaster.UCSC.dm2" [10] "BSgenome.Dmelanogaster.UCSC.dm3" [11] "BSgenome.Drerio.UCSC.danRer5" [12] "BSgenome.Ecoli.NCBI.20080805" [13] "BSgenome.Ggallus.UCSC.galGal3" [14] "BSgenome.Hsapiens.UCSC.hg17" [15] "BSgenome.Hsapiens.UCSC.hg18" [16] "BSgenome.Hsapiens.UCSC.hg19" [17] "BSgenome.Mmusculus.UCSC.mm8" [18] "BSgenome.Mmusculus.UCSC.mm9" [19] "BSgenome.Ptroglodytes.UCSC.panTro2" [20] "BSgenome.Rnorvegicus.UCSC.rn4" [21] "BSgenome.Scerevisiae.UCSC.sacCer1"
Additional there are some functions to manipulate and process the whole genome data.
Thebsapply()function applies a function FUN to each chromosome in a genome using the
parameters contained within theBSParamsobject. This object holds the various parameters needed to configure the bsapply() function.
The ShortRead Package
The ShortRead package offers base classes, functions, and methods for representation of
high-throughput, short-read sequence data. Especially for data management, I/O, manip- ulating, and quality assessment of short read data of single-end Solexa data. For data management and I/O the package provides functions to navigate in the output directory structure of the Solexa Genome Analyzer sequencing machine and to read and filter the raw data. Additional there are functions (e.g., qa()) to summarize read and alignment quality and to create quality reports.
Further Packages
Next to the three mentioned packages there are several other packages for next-generation sequence analysis in the Bioconductor repository and some others in development:
IRanges & genomeIntervals: These packages offer an emerging infrastructure for repre- senting very large data objects, for rangebased representations, and for manipulating
3.3 Next-Generation Sequence Data 35
intervals on sequences.
rtracklayer: Extensible framework for interacting with multiple genome browsers (cur- rently UCSC built-in) and manipulating annotation tracks in various formats (cur- rently GFF, BED and WIG built-in).
HilbertVis & HilbertVisGUI: These packages provide creative approaches for visual- ization of long vectors of integer data (or sequence data), using space-filling (Hilbert) curves that maintain, as much as possible, the spatial information implied by linear chromosomes.
Chipseq: (in development) A package with tools for helping to process short read data for Chip-Seq experiments.
3.3.2
Computational Problems & Challenges
The computational problems are very similar to the mentioned problems for micorarray data (see Section 3.2.3). The continuing exponential accumulation of full genome data, including full diploid human genomes, creates new challenges not only for understanding genomic structure, function, and evolution, but also for the storage, navigation and privacy of genomic data. Independent from the used software for next-generation sequence data, actual research and computations are limited by the available computer hardware.
Data from high-throughput sequencing experiments are very large. They consist of 10s to 100s of millions of ’reads’ (each 10s to 100s of nucleotides long) and are coupled with whole genome sequences (for example, 3 billion nucleotides in the human genome). Currently, publicly available genomes are typically stored as flat text files in the GenBank
(http://www.ncbi.nlm.nih.gov/Genbank/), but this approach is unlikely to scale up in
many ways. The storage of the diploid genomes of all currently living humans using this simple approach would take ’GenBank’, without counting headers or any additional annotations, on the order of 36×1018 bytes, or 36 Petabytes, an amount difficult to store
or download over the Internet, even using standard compression technologies (e.g., gzip) [BWB09]. First developments in data structures and algorithms addressing these problems are in progress.
Furthermore, first generation approaches with relatively short reads, restricted appli- cation domains, and small numbers of sample individuals are being supplanted by newer technologies producing longer and more numerous reads. New protocols and the intrinsic curiosity of biologists are expanding the range of questions being addressed, and creating a concomitant need for flexible and high-performance software analysis tools. The increasing affordability of high-throughput sequencing technologies means that multi-sample studies with non-trivial experimental designs are just around the corner.