• No results found

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

N/A
N/A
Protected

Academic year: 2021

Share "Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data"

Copied!
23
0
0

Loading.... (view fulltext now)

Full text

(1)

1

Removing Sequential Bottlenecks in

Analysis of Next-Generation

Sequencing Data

Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University

HiCOMB 2014

(2)

Outline

• Introduction

• Sequence Data Format Converter Design

• Experimental Results

• Conclusion

(3)

Explosion of Next-Generation

Sequencing Data

• NGS Advantages

– Faster and cheaper

• E.g., over one billion short reads per instrument run

– More accurate: higher resolution and deeper coverage

• Challenges

– Urgent need for turning raw data into knowledge

Parallelism is the key

(4)

Historical Trends in Storage Prices

v.s. DNA Sequencing Costs

4 0 1 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 0 1 10 100 1,000 10,000 100,000 1,000,000 1990 1994 1998 2002 2006 2010 D N A S eq u en ci n g C o st (B ase P ai rs p er D o lla r) H ar d D isk S to rag e P ri ce (M B p er D o lla r)

Hard Disk Storage

Pre-next Generation Sequencing Next Generation Sequencing

(5)

Varieties of NGS Data Formats

• Different Formats

– SAM (Sequence Alignment/Map)

• The de-facto text format for storing large nucleotide sequence alignments

– BAM (Binary Alignment/Map)

• The compressed, indexable, binary form of the SAM format

• Indexing is supported by BAI (BAM Index) file

– Other formats

• BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc.

(6)

Analysis Pipeline

6

• Current Pipeline

– Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST

• Reality

– Cross-utilization Problem: sequencing data ≠ input

– Some other analysis steps stay sequential

(7)

Motivation: Removing Other

Sequential Bottlenecks

• Parallel Format Conversion

– Current format conversion commonly makes use of a single core

– Current downstream tools may not be exchanged between different aligners

– Not hard to implement but important to scale out

• Parallelizing Certain Statistical Analysis Steps

– E.g., parallel analysis on the histogram data

(8)

Framework

• Sequence Data Format Converter

– Input: SAM/BAM – Output:

• BAM/SAM

• FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML

• Statistical Analysis Module

– Parallelize other statistical analysis steps

– E.g., non-local means (NL-Means) and false discovery rate (FDR) computation

8

only discuss the first component today

(9)

Outline

• Introduction

• Sequence Data Format Converter Design

• Experimental Results

• Conclusion

(10)

Sequence Data Format

Converter

• 3 Converter Instances

– SAM Format Converter – BAM Format Converter

– Preprocessing-Optimized SAM Format Converter

• Support

partial format conversion

on a

specific chromosome region

(11)

SAM Format Converter

11 No communication among procs after partitioning

partitioning is the key step for parallelization Extensibility and Programmability

(12)

Partitioning Algorithm

12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning

(13)

BAM Format Converter

• Challenge

– No explicit delimiter:

– Even partitioning -> unparsable records

• Solution: add a

preprocessing phase

– Partition data by supporting random access

13 Cannot be parallelized

(14)

BAMX and BAIX

• BAMX (BAM eXtended) File

– Transform each varying-length BAM record into a regular-layout BAMX record

– Align varying-length BAM fields by padding

• BAIX (BAI eXtended File)

– Index file of the BAMX file

– Store the alignment starting positions in BAM (logically) and in BAMX (physically)

(15)

Partial Conversion

• If only interested in a subset, no need for full

conversion

• Based on the BAIX file

– Given logical alignment starting and ending

positions, locate the physical starting and ending positions in the BAMX file (by binary search)

– Evenly partition the subset and proceed in parallel

(16)

Preprocessing-Optimized SAM

Format Converter

• Main Ideas

– Preprocessing can also optimize the SAM format conversion

– Such preprocessing can be parallelized because of the easy partitioning on the SAM format

(17)

Outline

• Introduction

• Sequence Data Format Converter Design

• Parallelization of Statistical Analysis Steps

• Experimental Results

• Conclusion

(18)

Experimental Setup

• Dataset

– Whole genome DNA-sequencing of three mouse samples

– Approximately 125 million sequences providing about 40-fold coverage of the genome

– In the SAM/BAM format

• Cluster

– 8 GB Memory

– Up to 32 8-core machines (256 cores in total)

(19)

Performance of SAM Format

Converter

• Input: 100 GB SAM data

• Output: BED, BEDGRAPH and FASTA

19 0 10 20 30 40 50 60 70 80 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA

(20)

Performance of BAM Format

Converter

• Input: 117 GB BAM data

• Output: BED, BEDGRAPH and FASTA

20 0 20 40 60 80 100 120 140 8 16 32 64 128 S p e e d u p # of Cores BED BEDGRAPH FASTA

(21)

SAM Format Converter Comparison:

Preprocessing-Optimized vs. Original

• Input: 15.7 GB BAM data

• Output: BED, BEDGRAPH and FASTA

21 0 10 20 30 40 50 60 70 80 90 100 8 16 32 64 128 S p e e d u p # of Cores BED_P BED BEDGRAPH_P BEDGRAPH FASTA_P FASTA

(22)

Outline

• Introduction

• Sequence Data Format Converter Design

• Parallelization of Statistical Analysis Steps

• Experimental Results

• Conclusion

(23)

Conclusion

• In the NGS analysis pipeline, the overall latency

cannot be reduced unless all sequential

bottlenecks are removed

• The first framework that can easily support parallel

sequence format conversion in distributed

environment

– SAM format converter – BAM format converter

– Preprocessing-optimized SAM format converter

References

Related documents

Al-Hazemi (2000) suggested that vocabulary is more vulnerable to attrition than grammar in advanced L2 learners who had acquired the language in a natural setting and similar

The second module focuses on the main political currents and political parties in Turkey, examining Turkish nationalism, conservatism-Islamism, left-social democracy

This study examined the students' perceptions of the effectiveness of Mobile Assisted Language Learning (MALL) based instruction as compared to direct instruction for 36

Program Entered By Erin Schumacher Rhonda Walsh Contact Existing Location Proposed Location Sabrina Hooper CSOB. Downtown

Tabela 24: Mann-Whitneyev U test za preverjanje razlik pri postavki 4 lestvice Odprto komuniciranje pri procesih inoviranja med skupino managerjev, ki se niso udeležili delavnice o

The solid colored line represents the marginal posterior median ef- fective population trajectory inferred by BNPR (yellow), BNPR-PS (blue), and BNPR-PS with simple covariates

Therefore, our approach factors the mapping problem into four natural sub-goals: (1) building a metrical repre- sentation for local small-scale spaces; (2) detecting places

Choral Arrangement: