Open source analytics for Big Data in Big
Pharma
Big Data SIG
23 Apr 2015
Miika Ahdesmaki
Applications in next generation sequencing
data
Crash course to molecular biology
• DNA is the ~static part
• RNA is the dynamic middle man - Only 1% of DNA is protein-coding
(or “exonic”)
• Proteins are involved in virtually all cell functions
• We can sequence DNA and RNA using ultra high throughput
sequencing (3rd gen Next
Generation Sequencing)
Central dogma
"Centraldogma nodetails" by Narayanese at English Wikipedia - Own work. Licensed under Public Domain via Wikimedia Commons –
Why NGS?
• Personalised medicine:
- One drug for all patients no longer realistic (especially in oncology) - Different demographics have different variations of risks
- Understanding patient specific needs will help guide their individual medication • Cancer is a genetic disease, most often the result of spurious mutations in DNA
- Understanding changes in cancer DNA can help defeat the disease
• Next generation high throughput sequencing offers genome DNA analyses in days and under $10k
What is next generation sequencing?
• NGS: massively parallel DNA sequencing
• Oncology biggest consumer of NGS at AZ
• We sequence RNA and DNA e.g. from
- Clinical samples - Cell lines
- Xenografts / explants
What is next generation sequencing?
• The DNA/RNA is pre-processed, fragmented and the short
fragments are sequenced (in random order)
What is next generation sequencing?
• The short fragments are aligned to a reference sequence, such as the human reference
Alignment
What is next generation sequencing?
• The alignments are further processed to answer the following questions
- How are the alignments different from the reference (SNPs, indels)?
- Which genes are expressed?
Downstream Processing (variants, expression)
NGS Data Explants Tumors-FFPE Tumors –fresh frozen Cell lines Clinical samples RNA-Seq Expression Variants Fusions DNA-Seq
Targeted Coding and non-coding variants Whole exome Coding variants Whole genome Patient stratification Biomarkers for prognosis, drug response, safety New Target ID Mechanism of drug action Mechanism of disease Mechanisms of resistance
Uses of NGS
Data generation and volumes
• AZ: Mix of outsourced sequencing and internal data generation
• Typical size of files per sample:
• In oncology, individuals are often studied in pairs (tumour/normal, parental/daughter), doubling the data volumes
• Typical study sizes: 100GB - 1TB raw compressed data
• One of our most frequent Big Data problems
Whole genome: 60-180GB Exome Dna-seq: 10-20GB RNA-seq 10-15GB Single gene targeted: 100-200MB
Data generation and volumes
• Over the past 3-4 years we accumulated ~400TB of sequencing data via - Acquiring public data sets (TCGA, ICGC)
- Vendor sequencing (major) - Internal sequencing (minor) • Over 2015-2016 we expect
- Internal sequencing to become the major data generation source (5 new sequencers in 2015 to accompany 2 sequencers in 2013-2014)
- 1PB of sequencing data by mid 2016 • Long term prediction of volumes difficult
• 3 tiered storage for processing, short term storage and long term storage - Amazon Glacier strongly considered for long term storage
Partnering with the leaders
• “Illumina Announces Strategic Partnerships with AstraZeneca, Janssen and Sanofi to Redefine Companion Diagnostics for Oncology”
- http://investor.illumina.com/phoenix.zhtml?c=121127&p=irol-newsArticle&ID=1960007
- Illumina, Inc. … announced it has formed collaborative partnerships with leading pharmaceutical companies to develop a universal … NGS-based oncology test system
- The system will be used for clinical trials of targeted cancer therapies with a goal of developing and commercializing a multi-gene panel for therapeutic selection,
Production – Dealing with the complexity
Number of NGS tools increases daily..
annotateBed bcbio_nextgen.py ctest hash_tar plot_roc.r srf_info vcffilter vcfrandom append_sff bcftools cuffcompare index_tar plot-vcfstats srf_list vcffixup vcfrandomsample bam12auxmerge bed12ToBed6 cuffdiff interpolate_sam.pl prep_reads STAR vcfflatten vcfregionreduce bam12split bedGraphToBigWig cufflinks intersectBed psl2sam.pl subtractBed vcfgeno2alleles vcfregionreduce_and_cut bam12strip bedpeToBam cuffmerge io_lib-config qualimap tabix vcfgeno2haplo vcfregionreduce_pipe bam2fastx bedpeToBed12 dbilogstrip isnovoindex randomBed tabtk vcfgenosamplenames vcfregionreduce_uncompressed bamadapterclip bedpeToVcf dbiprof juncs_db rtg tagBam vcfgenosummarize vcfremap
bamadapterfind bedToBam dbiproxy kmerprob s3cmd tophat vcfgenotypecompare vcfremoveaberrantgenotypes bamauxsort bedToBigBed expandCols liftOver sam2vcf.pl tophat2 vcfgenotypes vcfremovenonATGC bamcat bedToIgv export2sam.pl linksBed sambamba tophat-fusion-post vcfglbound vcfremovesamples bamchecksort bed_to_juncs extract_fastq long_spanning_reads samblaster tophat_reports vcfglxgt vcfroc bamclipreinsert bedtools extract_qual lumpy sam_juncs trace_dump vcfgtcompare.sh vcfsample2info bamcollate bgzip extract_seq makeSCF samtools twoBitInfo vcfhetcount vcfsamplediff bamcollate2 bigBedInfo faCount map2gtf samtools.pl twoBitToFa vcfhethomratio vcfsamplenames bamdownsamplerandom bigBedSummary faSize mapBed scalpel unionBedGraphs vcfindelproximity vcfsitesummarize bamfilteraux bigBedToBed fastaFromBed maq2sam-long scf_dump variant_effect_predictor.pl vcfindels vcfsnps bamfilterflags bigWigInfo fastqc maq2sam-short scf_info vcf2fasta vcfindex vcfsom bamfilterheader bigWigSummary fastqtobam maskFastaFromBed scf_update vcf2sqlite.py vcfintersect vcfsort bamfilterrg bigWigToBedGraph faToTwoBit md5fa scramble vcf2tsv vcfkeepgeno vcfstats bamfixmateinformation bigWigToWig featureCounts md5sum-lite scram_flagstat vcfaddinfo vcfkeepinfo vcfstreamsort bamindex blast2sam.pl fetchChromSizes mergeBed scram_merge vcfafpath vcfkeepsamples vcf_strip_extra_headers bamleftalign bowtie2 filter_vep.pl multiBamCov scram_pileup vcfallelicprimitives vcfleftalign vcfToBedpe
bammapdist bowtie2-align fix_map_ordering multiIntersectBed segment_juncs vcfaltcount vcflength vcfuniq bammarkduplicates bowtie2-build flankBed muTect-1.1.6.jar seqtk vcfannotate vcfmultiallelic vcfuniqalleles bammarkduplicates2 bowtie2-inspect freebayes normalisefasta shuffleBed vcfannotategenotypes vcfmultiway vcfutils.pl bammaskflags bowtie2sam.pl gatk-framework novo2paf slopBed vcfbiallelic vcfmultiwayscripts vcfvarstats bammdnm brew GenomeAnalysisTK.jar novo2sam.pl snpEff vcfbreakmulti vcfnobiallelicsnps vep_convert_cache.pl bammerge bwa genomeCoverageBed novoalign soap2sam.pl vcfcat vcfnoindels vep_install.pl bam_merge ccmake get_comment novoalignCS SomaticAnalysisTK.jar vcfcheck vcfnosnps vt
bamrank closestBed getOverlap novoalignCSMPI sortBed vcfclassify vcfnulldotslashdot wgsim bamrecompress clusterBed gffread novoalignMPI speedseq vcfcleancomplex vcfnumalt wgsim_eval.pl bamreset cmake glia novobarcode speedseq.config vcfclearid vcfoverlay wigToBigWig bamseqchksum complementBed grabix novoindex splitReadSamToBedpe vcfclearinfo vcfparsealts windowBed bamsort contig_to_chr_coords groupBy novomethyl splitterToBreakpoint vcfcombine vcfplotaltdiscrepancy.r windowMaker bamsplit convert_trace gtf_juncs novope2bed.pl sra_to_solid vcfcommonsamples vcfplotaltdiscrepancy.sh xmlwf bamsplitdiv coverageBed gtf_to_fasta novorun.pl srf2fasta vcfcomplex vcfplotsitediscrepancy.r zoom2sam.pl bamToBed cpack gtfToGenePred novosort srf2fastq vcfcountalleles vcfplottstv.sh ztr_dump bamtofastq cpanm gtf_to_sam novoutil srf_dump_all vcfcreatemulti vcfprimers
bamToFastq cram_dump hash_exp nucBed srf_extract_hash vcfdistance vcfprintaltdiscrepancy.r bamtools cram_index hash_extract pairToBed srf_extract_linear vcfecho vcfprintaltdiscrepancy.sh bamtools-2.3.0 cramtools hash_list pairToPair srf_filter vcfentropy vcfqual2info
bamzztoname crc32 hash_sff platypus srf_index_hash vcfevenregions vcfqualfilter
Production – Overcoming the Complexity
• “Forced” to use open source tools and OS (Linux), no closed source alternatives exist
- Integration challenging
- Variant calling and expression analysis very much an open research questions, rapidly changing code
- No licensing costs, but costs in internal and external consulting • Bcbio-nextgen
- An open source Python toolkit providing best practice pipelines for fully automated NGS analysis
- Main developer Brad Chapman (HSPH)
- Unit tested, version controlled, development in Github
https://github.com/chapmanb/bcbio-nextgen
- Scalable across different clusters, schedulers, Amazon cloud
• AZ is active recognised contributor and collaborator to HSPH and bcbio-nextgen
Production – Overcoming the Complexity
• The user writes/modifies a high level configuration file specifying inputs and analysis parameters
- Very few “tuning parameters” -> Given the same data, two analysts will produce the same results
Getting it right
• Given the rapid changes in the individual analysis tools, how do we know the pipeline “gets it right”?
• Solution: reference standards
• For germline sequencing, the Genome in A Bottle Consortium established a gold standard for an individual (NA12878)
- Samples from NA12878 can be bought off the shelf
- Compare sequencing and analytics results to the gold standard, establish sensitivity, PPV of variant calls, compare to other people’s results
• For tumour sequencing, several standards exist - Horizon Diagnostics’ tumour standard
Processing and managing the data
• NGS HPC clusters on 4 main R&D sites - UK (SGE, ~200 cores, gpfs)
- Sweden (SLURM, >500 cores, Lustre) - China (SGE, >100 cores, gpfs)
- US (UGE, >200 cores, gpfs)
• Data generated or received in one place processed locally by the NGS Production Team (each member has access to all HPC clusters)
- Processed data handed over to disease area bioinformaticians in a controlled manner
• Quick pipes between the sites allows data sharing when required • Cloud computing …
NGS + Cloud
• Large scale storage needs
• High computational power that can continue to scale • Inherently (embarrassingly) parallel, easily ported
• Peaks and valleys in compute needs, so burst into cloud as needed instead of large investment upfront
• Launch-able computing centre utilising Amazon EC2
StarCluster from MIT with our pipeline
40 TB GlusterFS /ngs 32 Core 32 Core 320 SSD 32 Core 32 Core 320 SSD 32 Core 32 Core 320 SSD 32 Core 32 Core 320 SSD 32 Core 32 Core 320 SSD 32 Core 32 Core 320 SSD 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 Core 32 CoreWhy not Hadoop?
• The use of a large number of mostly academic open source tools that are 99.9% not written for Hadoop
• No pipeline implements wrapping up of the above tools in a Hadoop framework • Disk I/O admittedly the bottle neck in current parallel file system architectures for
NGS analytics - Gpfs locally at AZ
Visualising the data
• Most popular genome analysis viewer is the Integrated Genome Viewer (IGV, Broad Institute), a Java based standalone program
- Requires a Java app - Requires configuration
• JBrowse, a web browser based genome viewer is inherently easier for non-tech savvy people: point your browser to it and it just works
- Physical location of data less important, only the part that is shown transferred • Data of interest, such as genomic variants, can be annotated by a URL to JBrowse
JBrowse
BRCA2 gene screenshot
Reference DNA sequence and amino acidsBRCA2 alternative exons Detected gene variant (G to A mutation)
Evidence in the data for the variant
Summary
• NGS data is accumulating faster and faster
• Analysing and interpreting the data is I/O intensive (+CPU and RAM) • Easily parallelised using SMP and simple schedulers (SGE, Slurm)
• Current challenges in integrating all the processed data (in e.g. no-SQL databases) • Long term storage (due to e.g. regulatory requirements) in e.g. Amazon Glacier
Confidentiality Notice