Accelerating Life Science Discovery using a High-Performance
Analytics Platform in a Collaborative Environment
October 7, 2015
Kathy Tzeng, PhD
Worldwide Technical Lead
Healthcare & Life Sciences
IBM Systems Group
Genomic Solution Enablement Team
Mission:
•
Porting and Optimization of Genomics/Translational applications on IBM solution
•
Developing Solutions with Partners
•
Making IBM SW/HW available to Software developers
Members:
•
Independent Software Vendor (ISV) team
•
Toronto Compiler Lab
•
Boeblingen Development Lab
•
Tokyo Research Lab
GENOMIC MEDICINE– from Sequencing to Personalized Healthcare
NHGRI, a branch
of NIH, has
defined 5 steps
for genomic
medicine.
(source: E. Green
et al., Nature
470, 204–213)
Next Generation Sequencing
(or other ingestion)
the focus is on very large data
generation, mainly from $1000
whole genome sequencing, and
the data processing and reduction
includes human, plant, animal, and
microbiome genomics
Translational Research/Early
Discovery
the focus is on data integration
including genomic data, and the
analytics required to identify
biomarkers, understand
disease mechanisms, and to
identify new medical treatments
Personalized
Healthcare/Clinical
Genomics
the focus is on delivering
genomic medicine to patients
to improve outcomes by
associating patients with
Predictive
Response
Function
Known Traits or
Environmental Features
Measured
Biological
Response
W(t)
Model of associations
between features and
responses as a function of
time t
Computational Challenges
Feature combinatorics
Large file sizes
Large population sizes
Unstructured data types
F(t)
R(t)
Quantities describing
population traits or
environmental factors at time
t
Quantities describing
response events for an
organism at time
t
A Computationally Challenging Problem
Breakthroughs in Genomic Medicine require quantifying associations between known
population traits, environmental factors, and biological responses
Variant information requires a computationally intensive analysis of raw sequence data
across thousands of genomic samples
Workload Challenge #1:
‘‘‘‘
Big Data
’’’’
Analytics
ANNOVAR
Gene Ontology
…
~ 150 GB
(compressed)
Each human genome can have a few million variants
High-Throughput
Sequencing
File Format
Assembly & Alignment
BAM
Raw Reads
De Novo Assembly
~ 150 GB
Whole Human Genome
SOAPdenovo
Velvet
…
Reference-Based Mapping
BWA
Bowtie
SOAP
…
Reference Genomes
TGCA
GEO
dbSNP
…
Variant Calling
Variant Calling
VCF
100 to 200 MB
Picard
GATK
SAMtools
SOAPsnp
…
Variant Annotations
Annotation Tools
intergenic … SNP in
IL23R associated with
Crohn's disease …
Sample:
Processing time
per genome
1 to 100
hours
*on 1 compute node
* Duration depends on selection of analytical tools and hardwareFastQ
500 MB
3 billion DNA base pairs
Phenotypic Data
Ex. Clinical Histories,
Medical Images
…was in good health until
2-3 months ago when she
gradually developed
fatigue
and
intermittent
epigastric pain
, …
exonic NOD2 16 … a
frameshift … SNP… exonic
GJB2 13 … associated
with hearing loss …
exonic CRYL1,GJB6 13 … a
342kb deletion
Omics Data
Variant Databases
Scientific data must be extracted from very large volumes of natural language content, biomedical images,
and other unstructured data, and transformed into a structured format for analysis
Workload Challenge #2: Unstructured Information
Scientific Literature
Peer-Reviewed Articles, Clinical
Guidelines, Textbooks, Patents
… for statistical analysis and
relationship visualization
Information must be transformed
+
1
Omics Data
1
Omics Data
Workload Challenge #3:
‘‘‘‘
Big Data
’’’’
Integration
2
Phenotypic Data
2
Phenotypic Data
3
3
Knowledge Base
Knowledge Base
Discovery of genotype-phenotype associations requires an analysis of complex data types
that must be integrated within a common analytical environment
Variant Calls & Annotations
Electronic Text
& Web Sites
##FORMAT=<ID=DP, … ##FORMAT=<ID=HQ, …
#CHROM POS ID REF ALT … 20 14370 rs6054257 G A …
Clinical Features,
Environmental Factors,
Biological Responses
Phenotypic
Data
Phenotypic
Data
Knowledge
Base
Knowledge
Base
Variant ID
Patient-Centric Logical Data Model
Patient ID
Genotypic
Data
Genotypic
Data
Patient
Population
‘‘‘‘
Big
’’’’
Data Warehouse Environment
Variant List
Detail on a
Single Variant
VCF
1
1
3
3
2
2
Phenotype ID
Patient ID
Observation
Detail
Observed Traits
& Responses
Key Capabilities
Leading biomedical research organizations are asking for technology capabilities that will
give them a low-cost solution to accelerate scientific discovery in Genomic Medicine
Flexible, scalable, and low-cost
high-performance compute and storage solutions
capable of
efficiently processing rapidly growing quantities of genomic and other types of complex life
science data
Seamless
integration of complex life science data
types on a common analytical platform
Rapid
extraction and analysis of unstructured language content
from very large volumes of
clinical and scientific documents
Metadata collection capabilities providing
detailed audit trails
as source data are transformed
into analytical results
Tools for
scientific collaboration
that enable
data and workload sharing
to cross organizations
and geographic boundaries in a
secure environment
that ensures data privacy
A Foundation for Computational Science
IBM’s Reference Architecture for Genomic Medicine supports ‘big data’ computational research on a
foundation of HPC compute, storage, and workload management capabilities
Research
Research
Research
Research
Applications
Applications
Applications
Applications
‘Big Data’
‘Big Data’
‘Big Data’
‘Big Data’
Foundation
Foundation
Foundation
Foundation
Intelligent resource allocation, sharing, and monitoring
across parallel HPC workloads
RDBMS or NoSQL database environments enabling
rapid processing of large volumes of complex
high-dimensional data structures in a data warehouse
Performance optimization for open source and
commercial analytics applications
Te
x
t
A
n
a
ly
ti
cs
/
N
LP
Data Management: File System & Storage / ILM
Data Management: File System & Storage / ILM
LAN
Workload Orchestration with Metadata Capture
Workload Orchestration with Metadata Capture
‘
Big
’
Data Warehouse
‘
Big
’
Data Warehouse
Im
a
g
e
A
n
a
ly
si
s
- Apache
UIMA
- IBM
System T
+
Low-cost, low-latency, easy-access storage & archiving
of data and metadata across heterogeneous
environments
IBM Research, IBM Watson, IBM Business Partners
IBM BigInsights, IBM Business Partners
IBM Spectrum Scale / Elastic Storage Server
IBM Platform Computing, IBM Business Partners
Text Analytics for the conversion of natural
language concepts into structured data entities
G
e
n
o
m
ic
A
n
a
ly
si
s
P
ip
e
li
n
e
s
C
o
m
p
u
ta
ti
o
n
a
l
M
o
d
e
li
n
g
Data management and analytics tools can be accessed and shared across heterogeneous systems in
on-premise and cloud environments
IBM Systems Facilitate Scientific Collaboration
External Collaborators
(Heterogeneous Environments)
Local Data Center
Virtual
Private Clouds
Public Cloud Users
Private Cloud Users
On-Premise Users
On-Premise
Cluster
Encrypted VPN
‘Big Data’ foundation
enables data access, data
management, and HPC
workload orchestration
across heterogeneous
on-premise, private cloud,
public cloud, and hybrid
cloud environments
HPC Network
Data Management: File System / Storage ILM
Data Management: File System / Storage ILM
WAN
Workload
Burst
Applications
10GbE or InfiniBand 1/10 GbEWorkload Orchestration with Metadata Capture
Workload Orchestration with Metadata Capture
‘Big’ Data Warehouse
AppCenter
(PAC, Galaxy, DataBiology, Lab7)Orchestrator
(ASC/EGO, LSF, Symphony, PPM)Translational
SSD/Flash
FC/IB Attached
Low-cost Storage
HA/DR Storage
Cloud Storage
P
la
tf
o
rm
s
C
o
m
p
u
te
S
to
ra
g
e
Personalized
Healthcare
Genomics
Datahub
(Spectrum Scale, Zato, Nirvana)
…
HPC Cluster
Big Data
Spark Cluster
Openstack
Docker
Application & Workflow
File & Database
Visualization
System & Log
A
cce
ss
Scale-out cluster
Users
Users
Devices
Devices
Active Archive
TSM/LTFS/HPSS
Scale-up SMP
H
P
C
M
a
n
a
g
e
m
e
n
t
S
u
it
e
P
la
tf
o
rm
S
o
ft
w
a
re
S
ta
ck
A framework for NGS and HPC Systems Architecture
Spectrum Scale
IBM Genomics Reference Architecture
The IBM Reference Architecture is an ecosystem of data management and analytics tools
developed by IBM and industry-leading commercial and open source software providers
BioBuilds – Open Source Bioinformatics
•
Turn-key:
Pre-built binaries and complete build scripts enable easy
deployment
•
Optimized:
POWER8 binaries provide the best performance for your
hardware
•
Ready for the Clinic:
A single source for tools streamlining
verification and audit
•
Long Term Support:
Community sponsorship and support contracts
ensure ongoing support for tools
http://biobuilds.org/
Open Source bioinformatics tools for
research, commercial, and regulated
environments
.
2014.11
•
ALLPATHS-LG
•
Bedtools
•
Bfast
•
BLAST (NCBI)
•
Bowite
•
Bowtie2
•
BWA
•
Cufflinks
•
FastQC
•
HMMER
•
HTSeq
•
Mothur
•
Numpy
•
PICARD
•
PLINK
•
Python
•
SAMTools
•
SOAP3-DP
•
SOAPDenovo
•
SQLite
•
Tabix
•
TopHat
•
Velvet/Oases
2015.02
•
R
•
Bioconductor
•
FASTA
•
Trinity
•
SHRiMP
Updated tools
•
HMMER (LE)
•
OpenSSL
•
IGV
•
iRODS
•
RNAStar
•
ISAAC
•
TMAP
•
SOAPaligner/soap2
Updated tools
•
Bowtie2
•
BWA
•
OpenSSL
2015.04
https://www.broadinstitute.org/gatk/blog?id=4833
Optimization of GATK from Broad Institute
IBM works with genomics leaders to improve performance of analytical
workflows like GATK on IBM Power 8 Systems
Steps
Intel Runtime*
IBM Runtime
BWA
7
3.88
Samtools
5
3.18
MarkDuplicates
11
7.46
RealignTargets
1
0.23
IndelRealigner
6.5
0.75
BaseRecalibrator
1.3
1.13
PrintReads+Index
12.3
2.48
PreProcessiong Total
44
19.09
HaplotypeCaller
2.03
Total
21.12
Input Dataset:
G15512.HCC1954.1,
coverage: 65x
Both IBM and Intel
solution:
# of Machines = 1
# of cores/Machine = 24
IBM Solution:
3.325 GHz Power8 with
GPFS
Optimization of Broad’s Best Practice Pipeline
~ 65X Whole Human Genome analysis done within a day
~ 150X Whole Exome analysis done in 3.45 hours
Performance of L3 Bioinformatics BALSA on Power 8 with GPU
Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)
Data Set: 8 lanes of HiSeq data
Elapsed Time = 1730 min
Elapsed Time = 107 min
Without cache library
With cache library
IO Cache Library to Optimize Performance of Genomics Application
IBM uses a File Cache Library to improve I/O Performance and reduce
workflow runtimes
GPFS
NFS
119
437
Bowtie2:
NGS Benchmarks on
2.6 GHz iDataPlex with GPFS
and NFS
Elapsed Time in Minutes,
lower is better
Speed of the
matters
Speed of the
file system
matters
Accelerating Genomics Applications using GPFS
IBM and BIOVIA’s Pipeline Pilot scale genomic analysis from the
desktop to the enterprise using IBM GPFS
Genomic Workflow Optimization
Typical Genomic Sequencing Workflow – Command Line
• bwa aln -t 12 -l 40 -n 3 -k 2
• bwa sampe -a 700 -P -o 1000
• samtools view –bt
• samtools sort
• Picard: java –Xmx8g -Djava.io.tmpdir MarkDuplicates.jar METRICS_FILE=metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT REMOVE_DUPLICATES=true ASSUME_SORTED=true TMP_DIR
• Picard: java -Xmx8g -Djava.io.tmpdir AddOrReplaceReadGroups.jar SORT_ORDER=coordinate
RGID=sample_lane RGLB=sample RGPL=illumina RGPU=lane RGSM=sample RGCN=center_name CREATE_INDEX=True VALIDATION_STRINGENCY=LENIENT TMP_DIR
• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T RealignerTargetCreator -nt 1
• Gatk lite: java -Xmx8g -Djava.io.tmpdir -T IndelRealigner -targetIntervals -known 1000G_biallelic.indels.hg19.vcf
• Picard: java -Xmx8g -Djava.io.tmpdir FixMateInformation.jar SO=coordinate VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true TMP_DIR
• Gatk lite: java Xmx#{JAVA_REQMEM}g Djava.io.tmpdir T CountCovariates –recalFile
-knownSites:dbsnp,VCF /gpfs/gpfs1/GENOME/SNP_INDEL_VCF/dbsnp_137.hg19.vcf -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate
• Gatk lit: java Xmx8g Djava.io.tmpdir T TableRecalibration recalFile sMode SET_Q_ZERO -solid_nocall_strategy THROW_EXCEPTION -nback 7 --baq RECALCULATE
Genomic Workflow Optimization
Runs
1
stSet
2
ndSet
3
rdSet
4
thSet
Total Sets
1 set on 8 nodes
10.06 hrs
---
---
---
10.06 hrs
4 sets on 8 nodes
19.02 hrs
20.9 hrs
21.26 hrs
25.07 hrs
25.10 hrs
Data Set: 37x coverage of whole human genomes
Workflow Input: 74 fastq.gz files, Workflow Output: Recalibrated Bam file
Dependency steps = Using LSF bsub–w option
Genomic Workflow Optimization
IBM Platform LSF workload scheduler is linked to the Process Manager and
maximizes the utilization of HPC resources to improve workflow runtimes
Data Compression Appliance
Compression
Algorithms
Compression ratio
(lossless)
Speed/throughput
gzip on Power 8 with FPGA board– available nowOn average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)
CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)
Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.
•
Pistoia compression contest was
held in 2012. James Bonfield of
Sanger Institute won with 1:9
compression ratio and
0.1GB/min
•
CRAM is released late 2012 to
compress BAM file by EBI and
accepted by Global Alliance of
Genomics and Health.
•
IBM is collaborating with Sanger
Institute and EBI on improving
compression for genomics data
– Samtools, Picard, CRAM
IBM works with Lab7 to deliver data provenance with performance, reliability
and security
. . >187_29_706_F3 T23302010303131123123022203111123200210100122001 102 T22211130023020133231323302310303131123123022201 211 >187_29_829_F3 T23302010003130123123022203111120122123202132301 212 >187_29_858_F3 T23302010303131123123022203111123222123122122321 212 >Experimental Design Sample Prep Sequencing Mapping Analysis Reporting Meta Analysis
Workflow Engine
Federated Data Engine Pipeline Engine
Visualization/EDA Sample LIMS
User Experience
Sample Data Reference Attribute Sheet Pipeline
IBM Power System Solution with GPFS and Platform LSF delivers:
Superior compute infrastructure ---
Superior performance, scalability & maximum throughput
8
Outstanding enterprise-grade reliability and security:
•
Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime
•
IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes
reporting for compliance measurement and audit (HIPAA)
8
Total cost of ownership ---
Very affordable compared to like-sized x86 systems
Lab7 ESP
Comprehensive software platform
---combines LIMS and informatics functionalities
h
Data provenance ---
maintains continuous
data provenance by:
• Tracking the history of samples, analyses,
and results
• Providing detailed audit trails
9
Sequencing platform flexibility ---
manages
data generated from any sequencing platform
IBM Power System Solution with GPFS and Platform LSF delivers:
Superior compute infrastructure ---
Superior performance, scalability & maximum throughput
8
Outstanding enterprise-grade reliability and security:
• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime
• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and
includes reporting for compliance measurement and audit (HIPAA)
Total cost of ownership ---
Very affordable compared to like-sized x86 systems
3 C’s (Configure, Command, Collaborate)
Ontologies Annotation Samples Comments + Attachments Roles + Access Shopping Basket Social Scientific Lifecycle Management Meta Information Financial + Resource Mgmt Task Management
Project Management Applications Import Analysis Visualization Infrastructure Network Storage Compute Configuration Instruments
Compute and Storage Softlayer – LSF – GPFS
Transport DBE Download Manager S3, SCP, RSync, SFTP, FTP HTTP Logic Version Control + Reproducible Data Provenance Everything as an app: Scripts, Binaries, Pipelines, Workflow Management, Virtual Machines Portal API Custom Web Apps via API
DBE Multiprot Email + WF Integration Identity Management In fo rm a ti o n M a n a g e m e n t In te rf a c e O rc h e s tr a ti o n
Databiology for Enterprise Functional Architecture
Databiology for Enterprise
SaaS +
customer specific
instances
Central hub to
manage all ‘omics
data
and to
orchestrate all activities
Functionally rich
and orientated on
key steps
in R&D life cycle
Insight to Instrument
with best in
class applications
Easy integration
with existing
environments
Automatic data provenance
and
reporting
Cost neutral deployment
Gradual roll-out /
Low risk
tranSMART - Optimized on Power8 and Spectrum Scale
•
tranSMART
associates genotypic & phenotypic data for complex analytics
•
Watson Explorer
extracts insight from scientific literature and data record and provides
R Analytics Tools Solr Full Text index Gene Patterns PLINK Watson Analytics
Application
Browser
PostgreSQL tranSMART DBGPFS
JDBC
I2b2
Application
Server
Application
Server
(Tomcat 7)
tranSMART
JDBC
Quartz Job Call
Web Server
(Apache2)
HTTP
HTTP
Users
Power8
Watson
Analytics
Server
Dataset
TCGA_OV Simulation
GSE32583
GSE13168 GSE1456
GSE15258
No. Records 5,789,632 40,774,968
942,724
1,203,282 3,600,555 4,702,050
Accelerate tranSMART ETL by Power8/Spectrum Scale
NIH Data
CDC Data
NLM Data
Internet
Lab
Results
Imaging
Data
Radiology
Reports
Microbiology
Reports
Nursing Home
Records
Claims
Data
VPN
VPN
VPN
LAN
LAN
LAN
LAN
LAN
Electronic
Health
Record Data
Genomic
Data
Accepted
Medical
Knowledge
Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and GPFS