1
Removing Sequential Bottlenecks in
Analysis of Next-Generation
Sequencing Data
Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University
HiCOMB 2014
Outline
• Introduction
• Sequence Data Format Converter Design
• Experimental Results
• Conclusion
Explosion of Next-Generation
Sequencing Data
• NGS Advantages
– Faster and cheaper
• E.g., over one billion short reads per instrument run
– More accurate: higher resolution and deeper coverage
• Challenges
– Urgent need for turning raw data into knowledge
– Parallelism is the key
Historical Trends in Storage Prices
v.s. DNA Sequencing Costs
4 0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 0 1 10 100 1,000 10,000 100,000 1,000,000 1990 1994 1998 2002 2006 2010 D N A S eq u en ci n g C o st (B ase P ai rs p er D o lla r) H ar d D isk S to rag e P ri ce (M B p er D o lla r)
Hard Disk Storage
Pre-next Generation Sequencing Next Generation Sequencing
Varieties of NGS Data Formats
• Different Formats
– SAM (Sequence Alignment/Map)
• The de-facto text format for storing large nucleotide sequence alignments
– BAM (Binary Alignment/Map)
• The compressed, indexable, binary form of the SAM format
• Indexing is supported by BAI (BAM Index) file
– Other formats
• BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc.
Analysis Pipeline
6
• Current Pipeline
– Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST
• Reality
– Cross-utilization Problem: sequencing data ≠ input
– Some other analysis steps stay sequential
Motivation: Removing Other
Sequential Bottlenecks
• Parallel Format Conversion
– Current format conversion commonly makes use of a single core
– Current downstream tools may not be exchanged between different aligners
– Not hard to implement but important to scale out
• Parallelizing Certain Statistical Analysis Steps
– E.g., parallel analysis on the histogram data
Framework
• Sequence Data Format Converter
– Input: SAM/BAM – Output:
• BAM/SAM
• FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML
• Statistical Analysis Module
– Parallelize other statistical analysis steps
– E.g., non-local means (NL-Means) and false discovery rate (FDR) computation
8
only discuss the first component today
Outline
• Introduction
• Sequence Data Format Converter Design
• Experimental Results
• Conclusion
Sequence Data Format
Converter
• 3 Converter Instances
– SAM Format Converter – BAM Format Converter
– Preprocessing-Optimized SAM Format Converter
• Support
partial format conversion
on a
specific chromosome region
SAM Format Converter
11 No communication among procs after partitioning
partitioning is the key step for parallelization Extensibility and Programmability
Partitioning Algorithm
12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning
BAM Format Converter
• Challenge
– No explicit delimiter:
– Even partitioning -> unparsable records
• Solution: add a
preprocessing phase
– Partition data by supporting random access
13 Cannot be parallelized
BAMX and BAIX
• BAMX (BAM eXtended) File
– Transform each varying-length BAM record into a regular-layout BAMX record
– Align varying-length BAM fields by padding
• BAIX (BAI eXtended File)
– Index file of the BAMX file
– Store the alignment starting positions in BAM (logically) and in BAMX (physically)
Partial Conversion
• If only interested in a subset, no need for full
conversion
• Based on the BAIX file
– Given logical alignment starting and ending
positions, locate the physical starting and ending positions in the BAMX file (by binary search)
– Evenly partition the subset and proceed in parallel
Preprocessing-Optimized SAM
Format Converter
• Main Ideas
– Preprocessing can also optimize the SAM format conversion
– Such preprocessing can be parallelized because of the easy partitioning on the SAM format
Outline
• Introduction
• Sequence Data Format Converter Design
• Parallelization of Statistical Analysis Steps
• Experimental Results
• Conclusion
Experimental Setup
• Dataset
– Whole genome DNA-sequencing of three mouse samples
– Approximately 125 million sequences providing about 40-fold coverage of the genome
– In the SAM/BAM format
• Cluster
– 8 GB Memory
– Up to 32 8-core machines (256 cores in total)
Performance of SAM Format
Converter
• Input: 100 GB SAM data
• Output: BED, BEDGRAPH and FASTA
19 0 10 20 30 40 50 60 70 80 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA
Performance of BAM Format
Converter
• Input: 117 GB BAM data
• Output: BED, BEDGRAPH and FASTA
20 0 20 40 60 80 100 120 140 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA
SAM Format Converter Comparison:
Preprocessing-Optimized vs. Original
• Input: 15.7 GB BAM data
• Output: BED, BEDGRAPH and FASTA
21 0 10 20 30 40 50 60 70 80 90 100 8 16 32 64 128 S p e e d u p # of Cores BED_P BED BEDGRAPH_P BEDGRAPH FASTA_P FASTA
Outline
• Introduction
• Sequence Data Format Converter Design
• Parallelization of Statistical Analysis Steps
• Experimental Results
• Conclusion
Conclusion
• In the NGS analysis pipeline, the overall latency
cannot be reduced unless all sequential
bottlenecks are removed
• The first framework that can easily support parallel
sequence format conversion in distributed
environment
– SAM format converter – BAM format converter
– Preprocessing-optimized SAM format converter