• No results found

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

N/A
N/A
Protected

Academic year: 2021

Share "Towards Integrating the Detection of Genetic Variants into an In-Memory Database"

Copied!
12
0
0

Loading.... (view fulltext now)

Full text

(1)

Towards Integrating the Detection of Genetic

Variants into an In-Memory Database

Cindy Fähnrich, Dr. Matthieu-P. Schapranow

2nd International Workshop on Big Data in Bioinformatics and Healthcare

Oct 27, 2014

(2)

Next-generation sequencing (NGS) requires adapted analysis workflow

Higher error rates

Shorter reads

Base sequencing step produces output within a few

hours

Subsequent processing steps take

days

up to several

weeks

Motivation –

Genome Data Analysis Process

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 2

Variant Detection within an In-memory Database | Cindy Fähnrich |

6th December 2013

DNA

(3)

NGS growth pattern more

remarkable than Moore’s law

à

Addressing data deluge with more

computing power no option

For variant calling: Still options to

improve data processing

Single-threaded processing

Data stored in files on disk

Motivation –

The Next-Generation Sequencing Data Deluge

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 3 0.001 0.01 0.1 1 10 100 1000 10000 01/12/01 01/12/03 01/12/05 01/12/07 01/12/09 01/12/11 01/12/13 Cost in [USD] Date

Main Memory Cost per Megabyte Sequencing Cost per Megabase

0.001 0.01 0.1 1 10 100 1000 10000 01/12/01 01/12/03 01/12/05 01/12/07 01/12/09 01/12/11 01/12/13 Cost in [USD] Date

Main Memory Cost per Megabyte Sequencing Cost per Megabase

(4)

IMDB Building Blocks

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 4 Combined column

and row store Map/Reduce

Single and multi-tenancy

Lightweight compression Insert only

for time travel

Real-time replication

Working on integers

SQL interface on columns and rows Active/passive

data store

Minimal

projections Group key

Reduction of software layers Dynamic multi-threading Bulk load of data Object-relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization

+

+ + + + P

v

+ + +

t

SQL x x

T

disk

(5)

IMDB Building Blocks

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 5 Combined column

and row store Map/Reduce

Single and multi-tenancy

Lightweight compression Insert only

for time travel

Real-time replication

Working on integers

SQL interface on columns and rows Active/passive

data store

Minimal

projections Group key

Reduction of software layers Dynamic multi-threading Bulk load of data Object-relational mapping Text retrieval and extraction engine No aggregate tables Data partitioning Any attribute as index No disk On-the-fly extensibility Analytics on historical data Multi-core/ parallelization

+

+ + + + P

v

+ + +

t

SQL x x

T

disk

(6)

Different calling strategies for variant types with increasing complexity

SNP calling (single-/ multi-sample)

Indel calling

à

Focus here on

single-sample

SNP calling

Different Types of Genetic Variants

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 6

AACTG vs. A

T

CTG

Single Nucleotide

Polymorphism (SNP)

AACTG

vs.

GTCAA

Structural Variations (SV)

AACTG

vs.

AA

_

TG

Insertion or Deletion (InDel)

(7)

SNP calling implemented as core

component of the database

Invocation of SNP calling via stored

procedure call:

Built-in parallel scheduling and

resource management of distinct

SNP calling steps

Our Contribution

Integrating SNP Calling into an In-Memory Database

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 7

CALL

"_SYS_AFL"

.

"CALL_SNPS”

(

SAMIMPORT.NA19240,

REFERENCE.HG19CHR1,

'chr1'

, 20, 20, 30, 40,

(8)

Reference Genome

Base sequence for comparison

Stored position-wise

Read Alignments

Reads mapped to the reference genome

Table conforming SAM format

Variant/SNP Calls

Detected SNPs

Table conforming VCF format

Our Contribution –

SNP Calling Data Artifacts

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 8

(9)

Genotype calling = deriving the actual genotype at a particular position

Assign probability to all possible genotypes depending on given data

à

Formula applied by GATK’s UnifiedGenotyper

Our Contribution –

Genotype Calling Formula

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 9 P(Gi) = Uniform for all genotypes Gi,i.e. 1

Dj = all base occurrences at a particular position j

Gi = Genotype for which to calculate the probability

Hl = Haploid part of genotype Gi

(10)

Data:

68.8M chr1 read

alignments from 1,000 genomes

project

Performance speedup by up to

22x for IMDB-based SNP calling

GATK‘s runtime depends on

system‘s I/O capabilities

Lower boundary for our approach

around 369s

Our Contribution

Experiment Results

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 10 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 5 10 15 20 25 30 35 40 Duration (seconds)

Covered Positions on Chromosome 1 (millions) GATK

(11)

Running SNP calling within in-memory database satisfies expectations

Main memory availability

Built-in parallelization strategies

à

Memory access is the new bottleneck

SNP calling runtime improves up to factor 22 compared to GATK

Further evaluations on runtime performance and result set quality

Extension of statistical formula to incorporate other aspects

Conclusion

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database 11
(12)

Keep in contact with us.

12

Hasso Plattner Institute

Enterprise Platform & Integration Concepts

August-Bebel-Str. 88

14482 Potsdam, Germany

Dr. Matthieu-P. Schapranow

[email protected]

http://we.analyzegenomes.com/

Cindy Fähnrich, M. Sc.

[email protected]

Cindy Fähnrich, Dr. Matthieu-P. Schapranow Detecting Genetic Variants within an In-Memory Database

References

Related documents

During the current study, when waste traps were artifi- cially contaminated with CRE (i.e. the contamination was confined to the waste trap water only and was not part of an

The following averages over all experiments are shown: the time (in minutes) and the average number of discovered subgroups, the number of conditions per subgroup description

A control transfer is meant to retrieve and send information regarding device configuration and device status in class-specific and vendor-specific devices and is

During the thirty-six months before responding to the questionnaire, 1,043 15 of the respondents to this survey had participated as counsel for a party in the me- diation of one

The magnitude WG is currently monitoring a test implementation of the IASPEI standard procedures, to be sure that descriptions of the procedures, such as described below,

The Arab Uprisings further exacerbated Saudi-Iranian tensions as Saudi Arabia and Iran continue to take advantage of the security vacuums of weak states in the Middle East

To explore further the differences seen in the blood vessel invasion of the repair tissue formed in MSC-transplanted osteochondral lesions, we examined the comparative effects

No major differences in the FTIR spectra were observed; nevertheless the water barrier properties were improved for film-3 which contains more carvacrol, a hydrophobic agent, as it