Analyzing NGS data with clinical data: open source software for translational medicine

(1)

Analyzing NGS data with clinical

data: open source software for

translational medicine

BASEL LIFE SCIENCE WEEK – NGS FORUM – SEPTEMBER 24, 2015

(2)

2

Agenda

1. Introduction

2. Open Source in Translational Medicine 3. cBioPortal

4. TranSMART

(3)

1. INTRODUCTION

(4)

4

The Hyve

 Professional support for open source software for bioinformatics and translational research software, such as tranSMART, cBioPortal, i2b2, Galaxy, ADAM and OHDSI

Mission

Enable pre-competitive collaboration

in life science R&D by leveraging

open source software

Core values Share Reuse Specialize Office Locations Utrecht, Netherlands

Cambridge, MA, United States

Services

Software development Data science services Consultancy

Hosting / SLAs

Fast-growing

Started in 2012 30 people by now

(5)

Interdisciplinary team

software engineers, data scientists, project managers & staff; expertise in bioinformatics, medical informatics, software engineering, biostatistics etc.

(6)

2. OPEN SOURCE IN TRANSLATIONAL MEDICINE

(7)

http://lanyrd.com/2015/innovation-spotlight-session

(8)

(9)

The Open Source Definition

1. Free Redistribution

2. Availability of Source Code 3. Allow Derived Works

4. Integrity of The Author's Source Code

5. No Discrimination Against Persons or Groups 6. No Discrimination Against Fields of Endeavor 7. Redistribution of License

8. License Must Not Be Specific to a Product 9. License Must Not Restrict Other Software

(10)

Open Source

 Source code openly accessible and reusable for everyone

 Enables pre-competitive collaboration: both academics and industry can use and enhance it; which grows a community

 Transparency: verification (scientific as well as IT security) can be done by anyone, no ‘black box’

(11)

The software engineering process in an

open source community is not different

from a closed commercial setting…

But the stakeholders, contributors, business models, engagement models etc. are!

(12)

Different Non-Functional Requirements for Software

 Bioinformatician in academics: create a novel solution for a

problem which has publication value

 Basic Research: new frontiers

 Software should demonstrate working

principle

 Bioinformatician / IT Services in

pharma/clinic: mainly applied research:

 Software should be well tested,

maintainable, extensible, scalable etc.

 Need for commercial support for

(13)

13

Open Source in Translational Medicine

Study design: Biobanking: Scientific compute: Data visualisation: Workflow / NGS: Datawarehousing: Imaging:

(14)

3. CBIOPORTAL

14

(15)

cBioPortal – study portal

(16)

16

(17)

Gene alteration events per sample

Which genes are altered in each individual tumor sample?

Data type Alteration event calls

Mutations Non-synonymous somatic mutations Copy number changes Homozygous deletion or amplification

Methylation Epigenetic silencing

mRNA and/or DNA Gene fusions

mRNA expression changes Over- or under-expression

Alteration types and thresholds can be customized for each gene

(18)

Visualization of events across genes and data types

(19)

Review cancer genomics events in clinical context

(20)

GenePrint Visualisation (from cBioPortal) in tranSMART

(21)

21

TM2CBIO

 In collaboration with Netherlands Cancer Institute

 ETL pipeline between tranSMART and cBioPortal

 TranSMART used as data warehouse, and cBioPortal as a

study-based analytics mart for cancer studies

 Going from individual data points (e.g. mRNA intensity levels) in tranSMART to alteration events in cBioPortal

(22)

4. TRANSMART

(23)

23

TranSMART as a product

 Datawarehouse bringing together scientists from clinical

sciences, preclinical research and discovery – around the data

 Combination of internal datasets and documents with public

datasets and knowledge

 Tailored to both biologists/clinicians and bioinformaticians  Dual nature: in use for translational research in both pharma

(24)

In early 2014, over 50 tranSMART implementations

International Research Initiatives IMI – eTRIKS, EMIF

CTMM – TraIT Pharma & Biotech

Sanofi, Millennium, Pfizer, JNJ, Roche Government Aligned Institutions

FDA Non-Profits

1Mind4Research, Orion Bionetworks, Critical Path Institute Hospitals / Academics

U Michigan, Harvard / Boston Children's Hospital, HEGP, Johns Hopkins, St. Jude Service Providers

ConvergeHEALTH, theHyve, Rancho Biosciences, BTGS, Thomson Reuters, Saama Tech, Cognizant

Start Organization Type Stage

2008 Johnson & Johnson Pharma Production 2008 Recombinant by Deloitte Services Multiple 2010 Sage Bionetworks Non Profit Production 2010 Thomson Reuters Services Support 2010 U-BIOPRED Consortium Production 2011 SAFE-T Consortium Pilot 2011 University of Michigan, Comprehensive Cancer Center Academic Production 2012 APHP-HEGP Paris France Academic Production 2012 BT Cure Consortium Pilot 2012 CTMM/TraIT Consortium Dev 2012 FDA Government Dev 2012 IMI/eTRIKS Consortium Dev 2012 Merck Pharma Pilot 2012 Millennium Pharmaceuticals Pharma Production 2012 One Mind for Research (1M4R) Non Profit Production 2012 Pfizer Pharma Production 2012 Roche Pharma Evaluation 2012 Sanofi-Aventis Pharma Dev 2012 St. Jude Non Profit Dev 2012 U Michigan, Computational Medicine & Bioinformatics Academic Multiple 2013 Agios Biotech Evaluation 2013 CARPEM – Cancer personalized medicine Academic Dev 2013 Harvard University / Boston Children's Hospital Academic Autism Pilot 2013 Boehringer Ingelheim Pharma Pilot 2013 Bristol Myers Squibb Pharma Evaluation 2013 BT Global Services Services Pilot 2013 Accelerated Cure Project for MS Non Profit Dev 2014 Personalized medicine and colorectal cancers (France) Academic Dev 2014 PCORI PRRN Phelan-McDermid Syndrome Data Network Academic Dev

(25)

25

TranSMART Open Source History

 February 2012: J&J releases tranSMART as open

source on GitHub under GPL v3

 December 2012: CTMM TraIT project decides to use

tranSMART as core infrastructure component

 January 2013: IMI eTRIKS starts, uses tranSMART as

core infrastructure component

 February 2013: kickoff of tranSMART Foundation, U.

Michigan publishes PostgreSQL port

 March 2014: IMI EMIF kickoff, tranSMART is used as

(26)

26

Center for Translational Molecular Medicine (CTMM)

 Public-private consortium

 Dedicated to the development of Molecular

Diagnostics and Molecular Imaging technologies

 Focusing on the translational aspects of molecular

medicine.

 120 partners

 universities, academic medical centers, medical

technology enterprises and chemical and pharmaceutical companies.

 Budget 300 M€

 22 projects / research consortia

 TraIT is the Translational Research IT project

(27)

27

TraIT Consortium

(28)

TraIT data workflow

Hospital (IT) Translational Research (IT)

data domains

clinical data

imaging data

experimental data biobanking

integrated data translational analytics workbench HIS PACS LIS Galaxy tranSMART/ cohort explorer R tranSMART/i2b2 datawarehouse CBM-NL OpenClinica NBIA + AIM e.g. PhenotypeDB, Annai Systems e.g. Galaxy, Chipster Samples (IT) P s e u d o n y m i z a t i o n Public Data BIMS

(29)

Amsterdam, June 2013: tranSMART Workshop

Attendees from 10 Pharma companies, 11 University Medical Centers and 12 IT companies http://lanyrd.com/2013/transmart 29 VUmc Sanofi Recombina nt / Deloitte University of Michigan Thomson

Reuters Pfizer Astra Zeneca CDISC University of Luxembourg Philips Johnson & Johnson The Hyve

70

(30)

Ann Arbor, Michigan, October 2014: Annual Meeting

http://lanyrd.com/2014/transmart

30

(31)

Bio IT World, Boston, April 2015

http://bit.ly/1R2N6uz

31

(32)

(33)

33

The Hyve – tranSMART 1.3 Contributions

 Improvements for handling GWAS data & cohort selection on

SNP data

 Build a number of interactive advanced analytics workflows &

correct statistical assumptions

 Imaging workflow: ETL for imaging metadata and results

 Prototype of a tranSMART 2.0 interface: new look & feel, user

experience

(34)

(35)

5. USING APACHE SPARK FOR NGS DATA ANALYSIS

(36)

NGS data storage & analysis

Don't import BAM, Cram, VCF and BCF to a database!

They are the databases!

 Indexed

 Compressed

 Highly specialized & optimized storage formats

Whole ecosystem is build around this concept.

All tools read and write these through a rich API

HTS-JDK

(37)

Genome Analysis ToolKit (GATK)

MapReduce framework for processing BAM and VCF files / databases

Provides “walkers” that provide access pattern as a stream trough BAM and VCF files

On top of these walkers there are analysis tools:

Indel realignment

Base Quality recalibration

Unified Genotyper (=old variant caller) Haplotype Caller (= new variant caller)

(38)

ADAM –

http://bdgenomics.org

 Genomics processing engine &specialized file format built with: • Apache Avro (uniform data format defintion)

• Apache Spark (memory-based cluster execution)

• Apache Parquet (Hadoop based columnar storage format)

 Resulting data can be accessed by Hadoop Map-Reduce, Spark, Shark,

Impala, Pig, Hive etc.

 Support for conversion to and from BAM and VCF. MAF conversion and

somatic variant calling unclear.

(39)

(40)

User Interfaces R, Spotfire etc.

Galaxy GUI TranSMART GUI TranSMART RESTful API

Genome Analysis ToolKit (GATK) Transcriptome Analysis ToolKit (TATK) (R / Bioconductor) Proteome Analysis ToolKit (PATK) HTS - JDK ADAM/ Spark Clinical API cTAKES Clinical

Data/i2b2 BAM, CRAM, VCF, BCF Transcriptome files / DB MzML, MzIdent Isilon High Performance Storage

Sun Grid Engine (SGE cluster) Download / ETL

Sequence Read Archive

(SRA) Gene Expression Omnibus (GEO)

PRoteomics IDEntifications Database (PRIDE) Public Archives

Translational Research Infrastructure

Mapping between patients in clinical db and samples in omics data

Hadoop HDFS / Apache Parquet (ADAM) Archiving Object Storage (e.g. Glacier) Galaxy XNAT Imaging Data Repo- sitory

(41)