Towards Integrating the Detection of Genetic
Variants into an In-Memory Database
Cindy Fähnrich, Dr. Matthieu-P. Schapranow
2nd International Workshop on Big Data in Bioinformatics and Healthcare
Oct 27, 2014
■
Next-generation sequencing (NGS) requires adapted analysis workflow
□
Higher error rates
□
Shorter reads
■
Base sequencing step produces output within a few
hours
■
Subsequent processing steps take
days
up to several
weeks
Motivation –
Genome Data Analysis Process
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 2
Variant Detection within an In-memory Database | Cindy Fähnrich |
6th December 2013
DNA
■
NGS growth pattern more
remarkable than Moore’s law
à
Addressing data deluge with more
computing power no option
■
For variant calling: Still options to
improve data processing
□
Single-threaded processing
□
Data stored in files on disk
Motivation –
The Next-Generation Sequencing Data Deluge
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 3 0.001 0.01 0.1 1 10 100 1000 10000 01/12/01 01/12/03 01/12/05 01/12/07 01/12/09 01/12/11 01/12/13 Cost in [USD] Date
Main Memory Cost per Megabyte Sequencing Cost per Megabase
0.001 0.01 0.1 1 10 100 1000 10000 01/12/01 01/12/03 01/12/05 01/12/07 01/12/09 01/12/11 01/12/13 Cost in [USD] Date
Main Memory Cost per Megabyte Sequencing Cost per Megabase
IMDB Building Blocks
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 4 Combined columnand row store Map/Reduce
Single and multi-tenancy
Lightweight compression Insert only
for time travel
Real-time replication
Working on integers
SQL interface on columns and rows Active/passive
data store
Minimal
projections Group key
Reduction of software layers Dynamic multi-threading Bulk load of data Object-relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization
+
+ + + + Pv
+ + +t
SQL x xT
disk
IMDB Building Blocks
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 5 Combined columnand row store Map/Reduce
Single and multi-tenancy
Lightweight compression Insert only
for time travel
Real-time replication
Working on integers
SQL interface on columns and rows Active/passive
data store
Minimal
projections Group key
Reduction of software layers Dynamic multi-threading Bulk load of data Object-relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization
+
+ + + + Pv
+ + +t
SQL x xT
disk
■
Different calling strategies for variant types with increasing complexity
□
SNP calling (single-/ multi-sample)
□
Indel calling
à
Focus here on
single-sample
SNP calling
Different Types of Genetic Variants
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 6
AACTG vs. A
T
CTG
Single Nucleotide
Polymorphism (SNP)
AACTG
vs.
GTCAA
Structural Variations (SV)
AACTG
vs.
AA
_
TG
Insertion or Deletion (InDel)
■
SNP calling implemented as core
component of the database
■
Invocation of SNP calling via stored
procedure call:
■
Built-in parallel scheduling and
resource management of distinct
SNP calling steps
Our Contribution
Integrating SNP Calling into an In-Memory Database
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 7
CALL
"_SYS_AFL"
.
"CALL_SNPS”
(
SAMIMPORT.NA19240,
REFERENCE.HG19CHR1,
'chr1'
, 20, 20, 30, 40,
Reference Genome
■
Base sequence for comparison
■
Stored position-wise
Read Alignments
■
Reads mapped to the reference genome
■
Table conforming SAM format
Variant/SNP Calls
■
Detected SNPs
■
Table conforming VCF format
Our Contribution –
SNP Calling Data Artifacts
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 8
■
Genotype calling = deriving the actual genotype at a particular position
■
Assign probability to all possible genotypes depending on given data
à
Formula applied by GATK’s UnifiedGenotyper
Our Contribution –
Genotype Calling Formula
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 9 P(Gi) = Uniform for all genotypes Gi,i.e. 1
Dj = all base occurrences at a particular position j
Gi = Genotype for which to calculate the probability
Hl = Haploid part of genotype Gi
■
Data:
68.8M chr1 read
alignments from 1,000 genomes
project
■
Performance speedup by up to
22x for IMDB-based SNP calling
■
GATK‘s runtime depends on
system‘s I/O capabilities
■
Lower boundary for our approach
around 369s
Our Contribution
Experiment Results
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 5 10 15 20 25 30 35 40 Duration (seconds)Covered Positions on Chromosome 1 (millions) GATK
■
Running SNP calling within in-memory database satisfies expectations
□
Main memory availability
□
Built-in parallelization strategies
à
Memory access is the new bottleneck
■
SNP calling runtime improves up to factor 22 compared to GATK
■
Further evaluations on runtime performance and result set quality
■
Extension of statistical formula to incorporate other aspects
Conclusion
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 11Keep in contact with us.
12
Hasso Plattner Institute
Enterprise Platform & Integration Concepts
August-Bebel-Str. 88
14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow
http://we.analyzegenomes.com/
Cindy Fähnrich, M. Sc.
Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database