Analyzing NGS data with clinical
data: open source software for
translational medicine
BASEL LIFE SCIENCE WEEK – NGS FORUM – SEPTEMBER 24, 2015
2
Agenda
1. Introduction
2. Open Source in Translational Medicine 3. cBioPortal
4. TranSMART
1.
INTRODUCTION
4
The Hyve
Professional support for open source software for bioinformatics and translational research software, such as tranSMART, cBioPortal, i2b2, Galaxy, ADAM and OHDSI
Mission
Enable pre-competitive collaboration
in life science R&D by leveraging
open source software
Core values Share Reuse Specialize Office Locations Utrecht, Netherlands
Cambridge, MA, United States
Services
Software development Data science services Consultancy
Hosting / SLAs
Fast-growing
Started in 2012 30 people by now
Interdisciplinary team
software engineers, data scientists, project managers & staff; expertise in bioinformatics, medical informatics, software engineering, biostatistics etc.
2.
OPEN SOURCE IN TRANSLATIONAL MEDICINE
http://lanyrd.com/2015/innovation-spotlight-session
The Open Source Definition
1. Free Redistribution
2. Availability of Source Code 3. Allow Derived Works
4. Integrity of The Author's Source Code
5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Redistribution of License
8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software
Open Source
Source code openly accessible and reusable for everyone
Enables pre-competitive collaboration: both academics and industry can use and enhance it; which grows a community
Transparency: verification (scientific as well as IT security) can be done by anyone, no ‘black box’
The software engineering process in an
open source community is not different
from a closed commercial setting…
But the stakeholders, contributors, business models, engagement models etc. are!
Different Non-Functional Requirements for Software
Bioinformatician in academics: create a novel solution for a
problem which has publication value
Basic Research: new frontiers
Software should demonstrate working
principle
Bioinformatician / IT Services in
pharma/clinic: mainly applied research:
Software should be well tested,
maintainable, extensible, scalable etc.
Need for commercial support for
13
Open Source in Translational Medicine
Study design: Biobanking: Scientific compute: Data visualisation: Workflow / NGS: Datawarehousing: Imaging:
3.
CBIOPORTAL
14
cBioPortal – study portal
16
Gene alteration events per sample
Which genes are altered in each individual tumor sample?
Data type Alteration event calls
Mutations Non-synonymous somatic mutations Copy number changes Homozygous deletion or amplification
Methylation Epigenetic silencing
mRNA and/or DNA Gene fusions
mRNA expression changes Over- or under-expression
Alteration types and thresholds can be customized for each gene
Visualization of events across genes and data types
Review cancer genomics events in clinical context
GenePrint Visualisation (from cBioPortal) in tranSMART
21
TM2CBIO
In collaboration with Netherlands Cancer Institute
ETL pipeline between tranSMART and cBioPortal
TranSMART used as data warehouse, and cBioPortal as a
study-based analytics mart for cancer studies
Going from individual data points (e.g. mRNA intensity levels) in tranSMART to alteration events in cBioPortal
4.
TRANSMART
23
TranSMART as a product
Datawarehouse bringing together scientists from clinical
sciences, preclinical research and discovery – around the data
Combination of internal datasets and documents with public
datasets and knowledge
Tailored to both biologists/clinicians and bioinformaticians Dual nature: in use for translational research in both pharma
In early 2014, over 50 tranSMART implementations
International Research Initiatives IMI – eTRIKS, EMIF
CTMM – TraIT Pharma & Biotech
Sanofi, Millennium, Pfizer, JNJ, Roche Government Aligned Institutions
FDA Non-Profits
1Mind4Research, Orion Bionetworks, Critical Path Institute Hospitals / Academics
U Michigan, Harvard / Boston Children's Hospital, HEGP, Johns Hopkins, St. Jude Service Providers
ConvergeHEALTH, theHyve, Rancho Biosciences, BTGS, Thomson Reuters, Saama Tech, Cognizant
Start Organization Type Stage
2008 Johnson & Johnson Pharma Production 2008 Recombinant by Deloitte Services Multiple 2010 Sage Bionetworks Non Profit Production 2010 Thomson Reuters Services Support 2010 U-BIOPRED Consortium Production 2011 SAFE-T Consortium Pilot 2011 University of Michigan, Comprehensive Cancer Center Academic Production 2012 APHP-HEGP Paris France Academic Production 2012 BT Cure Consortium Pilot 2012 CTMM/TraIT Consortium Dev 2012 FDA Government Dev 2012 IMI/eTRIKS Consortium Dev 2012 Merck Pharma Pilot 2012 Millennium Pharmaceuticals Pharma Production 2012 One Mind for Research (1M4R) Non Profit Production 2012 Pfizer Pharma Production 2012 Roche Pharma Evaluation 2012 Sanofi-Aventis Pharma Dev 2012 St. Jude Non Profit Dev 2012 U Michigan, Computational Medicine & Bioinformatics Academic Multiple 2013 Agios Biotech Evaluation 2013 CARPEM – Cancer personalized medicine Academic Dev 2013 Harvard University / Boston Children's Hospital Academic Autism Pilot 2013 Boehringer Ingelheim Pharma Pilot 2013 Bristol Myers Squibb Pharma Evaluation 2013 BT Global Services Services Pilot 2013 Accelerated Cure Project for MS Non Profit Dev 2014 Personalized medicine and colorectal cancers (France) Academic Dev 2014 PCORI PRRN Phelan-McDermid Syndrome Data Network Academic Dev
25
TranSMART Open Source History
February 2012: J&J releases tranSMART as open
source on GitHub under GPL v3
December 2012: CTMM TraIT project decides to use
tranSMART as core infrastructure component
January 2013: IMI eTRIKS starts, uses tranSMART as
core infrastructure component
February 2013: kickoff of tranSMART Foundation, U.
Michigan publishes PostgreSQL port
March 2014: IMI EMIF kickoff, tranSMART is used as
26
Center for Translational Molecular Medicine (CTMM)
Public-private consortium Dedicated to the development of Molecular
Diagnostics and Molecular Imaging technologies
Focusing on the translational aspects of molecular
medicine.
120 partners
universities, academic medical centers, medical
technology enterprises and chemical and pharmaceutical companies.
Budget 300 M€
22 projects / research consortia
TraIT is the Translational Research IT project
27
TraIT Consortium
TraIT data workflow
Hospital (IT) Translational Research (IT)
data domains
clinical data
imaging data
experimental data biobanking
integrated data translational analytics workbench HIS PACS LIS Galaxy tranSMART/ cohort explorer R tranSMART/i2b2 datawarehouse CBM-NL OpenClinica NBIA + AIM e.g. PhenotypeDB, Annai Systems e.g. Galaxy, Chipster Samples (IT) P s e u d o n y m i z a t i o n Public Data BIMS
Amsterdam, June 2013: tranSMART Workshop
Attendees from 10 Pharma companies, 11 University Medical Centers and 12 IT companies http://lanyrd.com/2013/transmart 29 VUmc Sanofi Recombina nt / Deloitte University of Michigan Thomson
Reuters Pfizer Astra Zeneca CDISC University of Luxembourg Philips Johnson & Johnson The Hyve
70
Ann Arbor, Michigan, October 2014: Annual Meeting
http://lanyrd.com/2014/transmart
30
Bio IT World, Boston, April 2015
http://bit.ly/1R2N6uz
31
33
The Hyve – tranSMART 1.3 Contributions
Improvements for handling GWAS data & cohort selection on
SNP data
Build a number of interactive advanced analytics workflows &
correct statistical assumptions
Imaging workflow: ETL for imaging metadata and results
Prototype of a tranSMART 2.0 interface: new look & feel, user
experience
5.
USING APACHE SPARK FOR NGS DATA ANALYSIS
NGS data storage & analysis
Don't import BAM, Cram, VCF and BCF to a database!
They are the databases!
Indexed
Compressed
Highly specialized & optimized storage formats
Whole ecosystem is build around this concept.
All tools read and write these through a rich API
HTS-JDK
Genome Analysis ToolKit (GATK)
MapReduce framework for processing BAM and VCF files / databases
Provides “walkers” that provide access pattern as a stream trough BAM and VCF files
On top of these walkers there are analysis tools:
Indel realignment
Base Quality recalibration
Unified Genotyper (=old variant caller) Haplotype Caller (= new variant caller)
ADAM –
http://bdgenomics.org
Genomics processing engine &specialized file format built with: • Apache Avro (uniform data format defintion)
• Apache Spark (memory-based cluster execution)
• Apache Parquet (Hadoop based columnar storage format)
Resulting data can be accessed by Hadoop Map-Reduce, Spark, Shark,
Impala, Pig, Hive etc.
Support for conversion to and from BAM and VCF. MAF conversion and
somatic variant calling unclear.
User Interfaces R, Spotfire etc.
Galaxy GUI TranSMART GUI TranSMART RESTful API
Genome Analysis ToolKit (GATK) Transcriptome Analysis ToolKit (TATK) (R / Bioconductor) Proteome Analysis ToolKit (PATK) HTS - JDK ADAM/ Spark Clinical API cTAKES Clinical
Data/i2b2 BAM, CRAM, VCF, BCF Transcriptome files / DB MzML, MzIdent Isilon High Performance Storage
Sun Grid Engine (SGE cluster) Download / ETL
Sequence Read Archive
(SRA) Gene Expression Omnibus (GEO)
PRoteomics IDEntifications Database (PRIDE) Public Archives
Translational Research Infrastructure
Mapping between patients in clinical db and samples in omics data
Hadoop HDFS / Apache Parquet (ADAM) Archiving Object Storage (e.g. Glacier) Galaxy XNAT Imaging Data Repo- sitory