Accelerating Life Science Discovery using a High-Performance Analytics Platform in a Collaborative Environment Overview

(1)

Accelerating Life Science Discovery using a High-Performance

Analytics Platform in a Collaborative Environment

October 7, 2015

Kathy Tzeng, PhD

Worldwide Technical Lead

Healthcare & Life Sciences

IBM Systems Group

(2)

Genomic Solution Enablement Team

Mission:

•

Porting and Optimization of Genomics/Translational applications on IBM solution

•

Developing Solutions with Partners

•

Making IBM SW/HW available to Software developers

Members:

•

Independent Software Vendor (ISV) team

•

Toronto Compiler Lab

•

Boeblingen Development Lab

•

Tokyo Research Lab

(3)

GENOMIC MEDICINE– from Sequencing to Personalized Healthcare

NHGRI, a branch

of NIH, has

defined 5 steps

for genomic

medicine.

(source: E. Green

et al., Nature

470, 204–213)

Next Generation Sequencing

(or other ingestion)

the focus is on very large data

generation, mainly from $1000

whole genome sequencing, and

the data processing and reduction

includes human, plant, animal, and

microbiome genomics

Translational Research/Early

Discovery

the focus is on data integration

including genomic data, and the

analytics required to identify

biomarkers, understand

disease mechanisms, and to

identify new medical treatments

Personalized

Healthcare/Clinical

Genomics

the focus is on delivering

genomic medicine to patients

to improve outcomes by

associating patients with

(4)

Predictive

Response

Function

Known Traits or

Environmental Features

Measured

Biological

Response

W(t)

Model of associations

between features and

responses as a function of

time t

Computational Challenges

Feature combinatorics

Large file sizes

Large population sizes

Unstructured data types

F(t)

R(t)

Quantities describing

population traits or

environmental factors at time

t

Quantities describing

response events for an

organism at time

t

A Computationally Challenging Problem

Breakthroughs in Genomic Medicine require quantifying associations between known

population traits, environmental factors, and biological responses

(5)

Variant information requires a computationally intensive analysis of raw sequence data

across thousands of genomic samples

Workload Challenge #1:

‘‘‘‘

Big Data

’’’’

Analytics

ANNOVAR

Gene Ontology

…

~ 150 GB

(compressed)

Each human genome can have a few million variants

High-Throughput

Sequencing

File Format

Assembly & Alignment

BAM

Raw Reads

De Novo Assembly

~ 150 GB

Whole Human Genome

SOAPdenovo

Velvet

…

Reference-Based Mapping

BWA

Bowtie

SOAP

…

Reference Genomes

TGCA

GEO

dbSNP

…

Variant Calling

VCF

100 to 200 MB

Picard

GATK

SAMtools

SOAPsnp

…

Variant Annotations

Annotation Tools

intergenic … SNP in

IL23R associated with

Crohn's disease …

Sample:

Processing time

per genome

1 to 100

hours

*

on 1 compute node

* Duration depends on selection of analytical tools and hardware

FastQ

500 MB

3 billion DNA base pairs

(6)

Phenotypic Data

Ex. Clinical Histories,

Medical Images

…was in good health until

2-3 months ago when she

gradually developed

fatigue

and

intermittent

epigastric pain

, …

exonic NOD2 16 … a

frameshift … SNP… exonic

GJB2 13 … associated

with hearing loss …

exonic CRYL1,GJB6 13 … a

342kb deletion

Omics Data

Variant Databases

Scientific data must be extracted from very large volumes of natural language content, biomedical images,

and other unstructured data, and transformed into a structured format for analysis

Workload Challenge #2: Unstructured Information

Scientific Literature

Peer-Reviewed Articles, Clinical

Guidelines, Textbooks, Patents

… for statistical analysis and

relationship visualization

Information must be transformed

(7)

+

1

Omics Data

1

Omics Data

Workload Challenge #3:

‘‘‘‘

Big Data

’’’’

Integration

2

Phenotypic Data

2

Phenotypic Data

3

Knowledge Base

Discovery of genotype-phenotype associations requires an analysis of complex data types

that must be integrated within a common analytical environment

Variant Calls & Annotations

Electronic Text

& Web Sites

##FORMAT=<ID=DP, … ##FORMAT=<ID=HQ, …

#CHROM POS ID REF ALT … 20 14370 rs6054257 G A …

Clinical Features,

Environmental Factors,

Biological Responses

Phenotypic

Data

Phenotypic

Data

Knowledge

Base

Knowledge

Base

Variant ID

Patient-Centric Logical Data Model

Patient ID

Genotypic

Data

Genotypic

Data

Patient

Population

‘‘‘‘

Big

’’’’

Data Warehouse Environment

Variant List

Detail on a

Single Variant

VCF

1

3

2

Phenotype ID

Patient ID

Observation

Detail

Observed Traits

& Responses

(8)

Key Capabilities

Leading biomedical research organizations are asking for technology capabilities that will

give them a low-cost solution to accelerate scientific discovery in Genomic Medicine

Flexible, scalable, and low-cost

high-performance compute and storage solutions

capable of

efficiently processing rapidly growing quantities of genomic and other types of complex life

science data

Seamless

integration of complex life science data

types on a common analytical platform

Rapid

extraction and analysis of unstructured language content

from very large volumes of

clinical and scientific documents

Metadata collection capabilities providing

detailed audit trails

as source data are transformed

into analytical results

Tools for

scientific collaboration

that enable

data and workload sharing

to cross organizations

and geographic boundaries in a

secure environment

that ensures data privacy

(9)

A Foundation for Computational Science

IBM’s Reference Architecture for Genomic Medicine supports ‘big data’ computational research on a

foundation of HPC compute, storage, and workload management capabilities

Research

Applications

‘Big Data’

Foundation

Intelligent resource allocation, sharing, and monitoring

across parallel HPC workloads

RDBMS or NoSQL database environments enabling

rapid processing of large volumes of complex

high-dimensional data structures in a data warehouse

Performance optimization for open source and

commercial analytics applications

Te

x

t

A

n

a

ly

ti

cs

/

N

LP

Data Management: File System & Storage / ILM

LAN

Workload Orchestration with Metadata Capture

‘

Big

’

Data Warehouse

‘

Big

’

Data Warehouse

Im

a

g

e

A

n

a

ly

si

s

- Apache

UIMA

- IBM

System T

+

Low-cost, low-latency, easy-access storage & archiving

of data and metadata across heterogeneous

environments

IBM Research, IBM Watson, IBM Business Partners

IBM BigInsights, IBM Business Partners

IBM Spectrum Scale / Elastic Storage Server

IBM Platform Computing, IBM Business Partners

Text Analytics for the conversion of natural

language concepts into structured data entities

G

e

n

o

m

ic

A

n

a

ly

si

s

P

ip

e

li

n

e

s

C

o

m

p

u

ta

ti

o

n

a

l

M

o

d

e

li

n

g

(10)

Data management and analytics tools can be accessed and shared across heterogeneous systems in

on-premise and cloud environments

IBM Systems Facilitate Scientific Collaboration

External Collaborators

(Heterogeneous Environments)

Local Data Center

Virtual

Private Clouds

Public Cloud Users

Private Cloud Users

On-Premise Users

On-Premise

Cluster

Encrypted VPN

‘Big Data’ foundation

enables data access, data

management, and HPC

workload orchestration

across heterogeneous

on-premise, private cloud,

public cloud, and hybrid

cloud environments

HPC Network

Data Management: File System / Storage ILM

WAN

Workload

Burst

Applications

10GbE or InfiniBand 1/10 GbE

Workload Orchestration with Metadata Capture

‘Big’ Data Warehouse

(11)

AppCenter

(PAC, Galaxy, DataBiology, Lab7)

Orchestrator

(ASC/EGO, LSF, Symphony, PPM)

Translational

SSD/Flash

FC/IB Attached

Low-cost Storage

HA/DR Storage

Cloud Storage

P

la

tf

o

rm

s

C

o

m

p

u

te

S

to

ra

g

e

Personalized

Healthcare

Genomics

Datahub

(Spectrum Scale, Zato, Nirvana)

…

HPC Cluster

Big Data

Spark Cluster

Openstack

Docker

Application & Workflow

File & Database

Visualization

System & Log

A

cce

ss

(12)

Scale-out cluster

Users

Devices

Active Archive

TSM/LTFS/HPSS

Scale-up SMP

H

P

C

M

a

n

a

g

e

m

e

n

t

S

u

it

e

P

la

tf

o

rm

S

o

ft

w

a

re

S

ta

ck

A framework for NGS and HPC Systems Architecture

Spectrum Scale

(13)

IBM Genomics Reference Architecture

The IBM Reference Architecture is an ecosystem of data management and analytics tools

developed by IBM and industry-leading commercial and open source software providers

(14)

BioBuilds – Open Source Bioinformatics

•

Turn-key:

Pre-built binaries and complete build scripts enable easy

deployment

•

Optimized:

POWER8 binaries provide the best performance for your

hardware

•

Ready for the Clinic:

A single source for tools streamlining

verification and audit

•

Long Term Support:

Community sponsorship and support contracts

ensure ongoing support for tools

http://biobuilds.org/

Open Source bioinformatics tools for

research, commercial, and regulated

environments

.

(15)

2014.11

•

ALLPATHS-LG

•

Bedtools

•

Bfast

•

BLAST (NCBI)

•

Bowite

•

Bowtie2

•

BWA

•

Cufflinks

•

FastQC

•

HMMER

•

HTSeq

•

Mothur

•

Numpy

•

PICARD

•

PLINK

•

Python

•

SAMTools

•

SOAP3-DP

•

SOAPDenovo

•

SQLite

•

Tabix

•

TopHat

•

Velvet/Oases

2015.02

•

R

•

Bioconductor

•

FASTA

•

Trinity

•

SHRiMP

Updated tools

•

HMMER (LE)

•

OpenSSL

•

IGV

•

iRODS

•

RNAStar

•

ISAAC

•

TMAP

•

SOAPaligner/soap2

Updated tools

•

Bowtie2

•

BWA

•

OpenSSL

2015.04

(16)

https://www.broadinstitute.org/gatk/blog?id=4833

Optimization of GATK from Broad Institute

IBM works with genomics leaders to improve performance of analytical

workflows like GATK on IBM Power 8 Systems

(17)

Steps

Intel Runtime*

IBM Runtime

BWA

7

3.88

Samtools

5

3.18

MarkDuplicates

11

7.46

RealignTargets

1

0.23

IndelRealigner

6.5

0.75

BaseRecalibrator

1.3

1.13

PrintReads+Index

12.3

2.48

PreProcessiong Total

44

19.09

HaplotypeCaller

2.03

Total

21.12

Input Dataset:

G15512.HCC1954.1,

coverage: 65x

Both IBM and Intel

solution:

# of Machines = 1

# of cores/Machine = 24

IBM Solution:

3.325 GHz Power8 with

GPFS

Optimization of Broad’s Best Practice Pipeline

~ 65X Whole Human Genome analysis done within a day

~ 150X Whole Exome analysis done in 3.45 hours

(18)

Performance of L3 Bioinformatics BALSA on Power 8 with GPU

(19)

Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)

Data Set: 8 lanes of HiSeq data

Elapsed Time = 1730 min

Elapsed Time = 107 min

Without cache library

With cache library

IO Cache Library to Optimize Performance of Genomics Application

IBM uses a File Cache Library to improve I/O Performance and reduce

workflow runtimes

(20)

GPFS

NFS

119

437

Bowtie2:

NGS Benchmarks on

2.6 GHz iDataPlex with GPFS

and NFS

Elapsed Time in Minutes,

lower is better

Speed of the

matters

Speed of the

file system

matters

Accelerating Genomics Applications using GPFS

IBM and BIOVIA’s Pipeline Pilot scale genomic analysis from the

desktop to the enterprise using IBM GPFS

(21)

Genomic Workflow Optimization

Typical Genomic Sequencing Workflow – Command Line

• bwa aln -t 12 -l 40 -n 3 -k 2

• bwa sampe -a 700 -P -o 1000

• samtools view –bt

• samtools sort

• Picard: java –Xmx8g -Djava.io.tmpdir MarkDuplicates.jar METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR

• Picard: java -Xmx8g -Djava.io.tmpdir AddOrReplaceReadGroups.jar SORT_ORDER=coordinate

RGID=sample_lane RGLB=sample RGPL=illumina RGPU=lane RGSM=sample RGCN=center_name CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT TMP_DIR

• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T RealignerTargetCreator -nt 1

• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T IndelRealigner -targetIntervals -known 1000G_biallelic.indels.hg19.vcf

• Picard: java -Xmx8g -Djava.io.tmpdir FixMateInformation.jar SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR

• Gatk lite: java Xmx#{JAVA_REQMEM}g Djava.io.tmpdir T CountCovariates –recalFile

-knownSites:dbsnp,VCF /gpfs/gpfs1/GENOME/SNP_INDEL_VCF/dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate

• Gatk lit: java Xmx8g Djava.io.tmpdir T TableRecalibration recalFile sMode SET_Q_ZERO -solid_nocall_strategy THROW_EXCEPTION -nback 7 --baq RECALCULATE

(22)

Genomic Workflow Optimization

(23)

Runs

1

st

Set

2

nd

Set

3

rd

Set

4

th

Set

Total Sets

1 set on 8 nodes

10.06 hrs

---

10.06 hrs

4 sets on 8 nodes

19.02 hrs

20.9 hrs

21.26 hrs

25.07 hrs

25.10 hrs

Data Set: 37x coverage of whole human genomes

Workflow Input: 74 fastq.gz files, Workflow Output: Recalibrated Bam file

Dependency steps = Using LSF bsub–w option

Genomic Workflow Optimization

IBM Platform LSF workload scheduler is linked to the Process Manager and

maximizes the utilization of HPC resources to improve workflow runtimes

(24)

Data Compression Appliance

Compression

Algorithms

Compression ratio

(lossless)

Speed/throughput

gzip on Power 8 with FPGA board– available now

On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)

CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)

Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.

•

Pistoia compression contest was

held in 2012. James Bonfield of

Sanger Institute won with 1:9

compression ratio and

0.1GB/min

•

CRAM is released late 2012 to

compress BAM file by EBI and

accepted by Global Alliance of

Genomics and Health.

•

IBM is collaborating with Sanger

Institute and EBI on improving

compression for genomics data

– Samtools, Picard, CRAM

(25)

IBM works with Lab7 to deliver data provenance with performance, reliability

and security

. . >187_29_706_F3 T23302010303131123123022203111123200210100122001 102 T22211130023020133231323302310303131123123022201 211 >187_29_829_F3 T23302010003130123123022203111120122123202132301 212 >187_29_858_F3 T23302010303131123123022203111123222123122122321 212 >

Experimental Design Sample Prep Sequencing Mapping Analysis Reporting Meta Analysis

Workflow Engine

Federated Data Engine Pipeline Engine

Visualization/EDA Sample LIMS

User Experience

Sample Data Reference Attribute Sheet Pipeline

IBM Power System Solution with GPFS and Platform LSF delivers:

Superior compute infrastructure ---

Superior performance, scalability & maximum throughput

8

Outstanding enterprise-grade reliability and security:

•

Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime

•

IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes

reporting for compliance measurement and audit (HIPAA)

8

Total cost of ownership ---

Very affordable compared to like-sized x86 systems

Lab7 ESP

Comprehensive software platform

---combines LIMS and informatics functionalities

h

Data provenance ---

maintains continuous

data provenance by:

• Tracking the history of samples, analyses,

and results

• Providing detailed audit trails

9

Sequencing platform flexibility ---

manages

data generated from any sequencing platform

(26)

IBM Power System Solution with GPFS and Platform LSF delivers:

Superior compute infrastructure ---

Superior performance, scalability & maximum throughput

8

Outstanding enterprise-grade reliability and security:

• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime

• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and

includes reporting for compliance measurement and audit (HIPAA)

Total cost of ownership ---

Very affordable compared to like-sized x86 systems

3 C’s (Configure, Command, Collaborate)

Ontologies Annotation Samples Comments + Attachments Roles + Access Shopping Basket Social Scientific Lifecycle Management Meta Information Financial + Resource Mgmt Task Management

Project Management Applications Import Analysis Visualization Infrastructure Network Storage Compute Configuration Instruments

Compute and Storage Softlayer – LSF – GPFS

Transport DBE Download Manager S3, SCP, RSync, SFTP, FTP HTTP Logic Version Control + Reproducible Data Provenance Everything as an app: Scripts, Binaries, Pipelines, Workflow Management, Virtual Machines Portal API Custom Web Apps via API

DBE Multiprot Email + WF Integration Identity Management In fo rm a ti o n M a n a g e m e n t In te rf a c e O rc h e s tr a ti o n

Databiology for Enterprise Functional Architecture

Databiology for Enterprise

SaaS +

customer specific

instances

Central hub to

manage all ‘omics

data

and to

orchestrate all activities

Functionally rich

and orientated on

key steps

in R&D life cycle

Insight to Instrument

with best in

class applications

Easy integration

with existing

environments

Automatic data provenance

and

reporting

Cost neutral deployment

Gradual roll-out /

Low risk

(27)

tranSMART - Optimized on Power8 and Spectrum Scale

•

tranSMART

associates genotypic & phenotypic data for complex analytics

•

Watson Explorer

extracts insight from scientific literature and data record and provides

(28)

R Analytics Tools Solr Full Text index Gene Patterns PLINK Watson Analytics

Application

Browser

PostgreSQL tranSMART DB

GPFS

JDBC

I2b2

Application

Server

Application

Server

(Tomcat 7)

tranSMART

JDBC

Quartz Job Call

Web Server

(Apache2)

HTTP

Users

Power8

Watson

Analytics

Server

(29)

Dataset

TCGA_OV Simulation

GSE32583

GSE13168 GSE1456

GSE15258

No. Records 5,789,632 40,774,968

942,724

1,203,282 3,600,555 4,702,050

Accelerate tranSMART ETL by Power8/Spectrum Scale

(30)

NIH Data

CDC Data

NLM Data

Internet

Lab

Results

Imaging

Data

Radiology

Reports

Microbiology

Reports

Nursing Home

Records

Claims

Data

VPN

LAN

Electronic

Health

Record Data

Genomic

Data

Accepted

Medical

Knowledge

Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and GPFS

(31)

(32)

(33)