Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

(1)

1

Removing Sequential Bottlenecks in

Analysis of Next-Generation

Sequencing Data

Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University

HiCOMB 2014

(2)

Outline

• Introduction

• Sequence Data Format Converter Design

• Experimental Results

• Conclusion

(3)

Explosion of Next-Generation

Sequencing Data

• NGS Advantages

– Faster and cheaper

• E.g., over one billion short reads per instrument run

– More accurate: higher resolution and deeper coverage

• Challenges

– Urgent need for turning raw data into knowledge

– Parallelism is the key

(4)

Historical Trends in Storage Prices

v.s. DNA Sequencing Costs

4 0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 0 1 10 100 1,000 10,000 100,000 1,000,000 1990 1994 1998 2002 2006 2010 D N A S eq u en ci n g C o st (B ase P ai rs p er D o lla r) H ar d D isk S to rag e P ri ce (M B p er D o lla r)

Hard Disk Storage

Pre-next Generation Sequencing Next Generation Sequencing

(5)

Varieties of NGS Data Formats

• Different Formats

– SAM (Sequence Alignment/Map)

• The de-facto text format for storing large nucleotide sequence alignments

– BAM (Binary Alignment/Map)

• The compressed, indexable, binary form of the SAM format

• Indexing is supported by BAI (BAM Index) file

– Other formats

• BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc.

(6)

Analysis Pipeline

6

• Current Pipeline

– Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST

• Reality

– Cross-utilization Problem: sequencing data ≠ input

– Some other analysis steps stay sequential

(7)

Motivation: Removing Other

Sequential Bottlenecks

• Parallel Format Conversion

– Current format conversion commonly makes use of a single core

– Current downstream tools may not be exchanged between different aligners

– Not hard to implement but important to scale out

• Parallelizing Certain Statistical Analysis Steps

– E.g., parallel analysis on the histogram data

(8)

Framework

• Sequence Data Format Converter

– Input: SAM/BAM – Output:

• BAM/SAM

• FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML

• Statistical Analysis Module

– Parallelize other statistical analysis steps

– E.g., non-local means (NL-Means) and false discovery rate (FDR) computation

8

only discuss the first component today

(9)

Outline

• Introduction

• Conclusion

(10)

Sequence Data Format

Converter

• 3 Converter Instances

– SAM Format Converter – BAM Format Converter

– Preprocessing-Optimized SAM Format Converter

• Support

partial format conversion

on a

specific chromosome region

(11)

SAM Format Converter

11 No communication among procs after partitioning

partitioning is the key step for parallelization _{Extensibility and} Programmability

(12)

Partitioning Algorithm

12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning

(13)

BAM Format Converter

• Challenge

– No explicit delimiter:

– Even partitioning -> unparsable records

• Solution: add a

preprocessing phase

– Partition data by supporting random access

13 Cannot be parallelized

(14)

BAMX and BAIX

• BAMX (BAM eXtended) File

– Transform each varying-length BAM record into a regular-layout BAMX record

– Align varying-length BAM fields by padding

• BAIX (BAI eXtended File)

– Index file of the BAMX file

– Store the alignment starting positions in BAM (logically) and in BAMX (physically)

(15)

Partial Conversion

• If only interested in a subset, no need for full

conversion

• Based on the BAIX file

– Given logical alignment starting and ending

positions, locate the physical starting and ending positions in the BAMX file (by binary search)

– Evenly partition the subset and proceed in parallel

(16)

Preprocessing-Optimized SAM

Format Converter

• Main Ideas

– Preprocessing can also optimize the SAM format conversion

– Such preprocessing can be parallelized because of the easy partitioning on the SAM format

(17)

Outline

• Introduction

• Parallelization of Statistical Analysis Steps

• Conclusion

(18)

Experimental Setup

• Dataset

– Whole genome DNA-sequencing of three mouse samples

– Approximately 125 million sequences providing about 40-fold coverage of the genome

– In the SAM/BAM format

• Cluster

– 8 GB Memory

– Up to 32 8-core machines (256 cores in total)

(19)

Performance of SAM Format

Converter

• Input: 100 GB SAM data

• Output: BED, BEDGRAPH and FASTA

19 0 10 20 30 40 50 60 70 80 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA

(20)

Performance of BAM Format

Converter

• Input: 117 GB BAM data

20 0 20 40 60 80 100 120 140 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA

(21)

SAM Format Converter Comparison:

Preprocessing-Optimized vs. Original

• Input: 15.7 GB BAM data

21 0 10 20 30 40 50 60 70 80 90 100 8 16 32 64 128 S p e e d u p # of Cores BED_P BED BEDGRAPH_P BEDGRAPH FASTA_P FASTA

(22)

Outline

• Introduction

• Parallelization of Statistical Analysis Steps

• Conclusion

(23)

Conclusion

• In the NGS analysis pipeline, the overall latency

cannot be reduced unless all sequential

bottlenecks are removed

• The first framework that can easily support parallel

sequence format conversion in distributed

environment

– SAM format converter – BAM format converter

– Preprocessing-optimized SAM format converter