Titanic
Titanic
Mohit Kothari
Mohit Kothari
Computer Science and Engineering Computer Science and Engineering University of California, San Diego University of California, San Diego
Roger
Roger T
Tanuatmadja
anuatmadja
Computer Science and Engineering Computer Science and Engineering University of California, San Diego University of California, San DiegoGautam Akiwate
Gautam Akiwate
Computer Science and Engineering Computer Science and Engineering University of California, San Diego University of California, San Diego
Abstract
Abstract—Nex—Next-gent-generatieration on sequesequencing (NGS), ncing (NGS), also also knowknown n asas high throughp
high throughput ut sequesequencinncing g is is currcurrently being ently being utiliutilized in zed in thethe Bioinformatics Department of the University of California at San Bioinformatics Department of the University of California at San Die
Diego go as as a a rereseasearcrch h tootool l to to exaexaminmine e and and detdetecect t varvarianiants ts (mu (mu--tations) in genomes of individuals and related family members. tations) in genomes of individuals and related family members. Unfortunately, given the current tools and compute environment, Unfortunately, given the current tools and compute environment, a typical NGS work-flow presently takes between
a typical NGS work-flow presently takes between 1212 toto 1414 days days to
to complcomplete. There is ete. There is therthereforefore e a a desidesire within the re within the depardepartmenttment to investigate the possibility of reducing the turnaround time to to investigate the possibility of reducing the turnaround time to accom
accommodate currenmodate current/futt/future ure resresearcearch h needneeds s that that woulwould d treatread d aa similar path.
similar path.
This paper describes the investigation that has been carried This paper describes the investigation that has been carried out over a period of few weeks, on the first phase of the sequencing out over a period of few weeks, on the first phase of the sequencing that takes around
that takes around 5 5 to to 7 7 days with the current software tool chain days with the current software tool chain and
and hardhardware utilizware utilizing ing a a set of set of smallsmaller er but still but still reprepreseresentatintativeve data sets and it will conclude with some recommendations and data sets and it will conclude with some recommendations and les
lessonsons s lealearnrned. ed. AddAdditiitionaonallylly, , we we werwere e alsalso o ablable e to to buibuild ld aa surp
surprisirisingly accurate model ngly accurate model whicwhich h prepredictdicts s the the behabehavior of vior of thethe tool chain.
tool chain. Keywords
Keywords—— NGS, NGS, VVariant ariant Calling, Calling, BWA, BWA, Picard, Picard, GATK, GATK, BAM,BAM, SAM
SAM
I.
I. IINTRODUCTIONNTRODUCTION A
A fufull ll enend d to to enend d NGNGS S woworkrk-fl-flow ow ininvovolvlves es a a nunumbmberer of smaller work-flows/phases including utilizing hardware of smaller work-flows/phases including utilizing hardware se-quencers such as ones made by Illumina, pre-processing the quencers such as ones made by Illumina, pre-processing the output BAM files by running them through a set of software output BAM files by running them through a set of software too
tools ls to to perperforform m tastasks ks sucsuch h as as seqsequenuence ce mapmappinping g (he(herebrebyy called the pre-processing phase) before concluding with variant called the pre-processing phase) before concluding with variant call
calling, annotating, annotation and ion and filterfiltering. A ing. A BAM file is BAM file is an industran industryy sta
standandard rd binbinary versary version of ion of a a SAM file SAM file witwith h the latethe later r beibeingng a
a tab-tab-delidelimitemited d textext t file file contcontainiaining ng sequsequence ence aligalignmennment t data.data. For the purposes of this paper, we will only focus on the For the purposes of this paper, we will only focus on the pre-processing phase since this is the work-flow that is currently processing phase since this is the work-flow that is currently taking the longest to complete
taking the longest to complete 11.. An
An ovoververvieiew w of of the the curcurrenrent t harhardwadware re and and sofsoftwatware re tootooll chain (including how much time each tool is currently taking chain (including how much time each tool is currently taking to complete in a typical execution) has been provided in Table. to complete in a typical execution) has been provided in Table. II
II The
The curcurrenrent t worwork-flk-flow ow is is curcurrenrently tly exexecuecuted ted via via a a PerPerll script that executes each stage in Table. II in sequential order. script that executes each stage in Table. II in sequential order. Whe
When n pospossibsible, le, the the comcommunmunicaicatition on betbetweeween n the the stastages ges isis done through Unix pipes to reduce I/O with temporary files done through Unix pipes to reduce I/O with temporary files being used when piping is not feasible. Additionally, although being used when piping is not feasible. Additionally, although each stage proceeds sequentially, wherever applicable, the tool each stage proceeds sequentially, wherever applicable, the tool utilized in each stage is currently executed with parameters that utilized in each stage is currently executed with parameters that would take advantage of extra processors/cores.
would take advantage of extra processors/cores.
1
1http://www.slideshare.net/AustralianBioinformatics/introduction-to-
http://www.slideshare.net/AustralianBioinformatics/introduction-to-nextgeneration nextgeneration
TABLE I: Machine Description TABLE I: Machine Description
S
Syysstteem m DDeessccrriippttiioonn Processor Model
Processor Model Intel Intel XeonXeon
Clock Speed
Clock Speed 1.8GHz1.8GHz
No. of Processors
No. of Processors 11
No. of Cores per Processor
No. of Cores per Processor 44
RAM Size
RAM Size 10100 0 GBGB
Disk
Disk 22228 8 GBGB
IIII. . OOVERVIEWVERVIEW W
We e stastart rt by by taltalkinking g aboabout ut the the difdifferferent ent tootools ls thathat t makmakee up
up the pipelthe pipeline in ine in SecSectiotion n IIIIII. . WWe e thethen n momove ve to to desdescricribinbingg the
the worworkinkings gs of of the the pippipelieline ne itsitself in elf in SecSectiotion n IV. IV. SecSectiotionn V
V descdescriberibes s our our framframewoework rk to to measmeasure ure difdifferenferent t systsystem em re- re-source usage while running the pipeline and also tries explain source usage while running the pipeline and also tries explain challenges faced while replicating the environment on a new challenges faced while replicating the environment on a new sys
systemtem. . SecSectiotion n VI VI desdescricribes bes iniinitiatial l anaanalyslysis is and and incincludludeses some insights into CPU, I/O and Memory usage along with some insights into CPU, I/O and Memory usage along with Ja
Java va GC GC and and harhardwadware re perperforformanmance ce coucountenters. rs. SecSectition on VIIVII attempts to model the entire pipeline in mathematical terms attempts to model the entire pipeline in mathematical terms and section VIII evaluates the model and lists some suprising and section VIII evaluates the model and lists some suprising results. Section IX uses the model to analyse reoccuring issues results. Section IX uses the model to analyse reoccuring issues in
in the pipeline and the pipeline and makemakes s some future predictsome future predictionsions. . FinalFinallyly,, se
sectctioion n X X coconcncluludedes s wiwith th lelessssonons s leleararnt nt anand d sosome me of of ththee recommendations for the pipeline.
recommendations for the pipeline.
IIIIII. . TTOOLS ANDOOLS AND C CONFIGURATIONSONFIGURATIONS
To understand the pipeline it is important that we To understand the pipeline it is important that we under-sta
stand nd the tools that the tools that makmake e up up the pipelthe pipelineine. . HenHence, we ce, we bebegingin by describing the tools and the roles they play. The pipeline by describing the tools and the roles they play. The pipeline consists of the following tool chains for processing
consists of the following tool chains for processing 22 1
1) ) BBWWAA 2)
2) SSAMAMttoooollss 3
3) ) HHTTSSlliibb 4
4) ) PPiiccaarrdd 5)
5) Genome Genome AnalyAnalysis sis TToolkoolkit it (GA(GATK)TK)
A. A. BWABWA
BW
BWAA 33 is a software package for mapping low-divergentis a software package for mapping low-divergent
seq
sequenuences ces agaagainsinst t a a larlarge ge refrefereerence nce gengenomeome, , sucsuch h as as thethe human genome. It consists of three variations of the human genome. It consists of three variations of the Burton-Whee
Wheeler ler AlignAligner er algoralgorithmithm: : BWBWA-bacA-backtraktrack, ck, BWBWA-SW A-SW andand BWA-MEM. The pipeline uses BWA-MEM as it is capable of BWA-MEM. The pipeline uses BWA-MEM as it is capable of for processing up to 1 million base pairs (bp)[1].BWA-MEM for processing up to 1 million base pairs (bp)[1].BWA-MEM
2
2http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-
http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-
software-packages-software-packages-required-to-frequired-to-follow-thollow-the-gatk-best-pre-gatk-best-practicesactices
3
TABLE II: Software Tool Chain TABLE II: Software Tool Chain
S
Sttaagge e NNaamme e TTooool l FFaammiilly y SSooffttwwaarre e TTooool l CCuurrrreennt t PPrroocceessssiinng g TTiimmee Shuf
Shuffling and fling and AligninAligning g Input FileInput File NN//A A hhttssccmmdd, , bbwwaa, , ssaammttoooolls s ((C C llaanngguuaaggee) ) 333 3 hhoouurrss
SAM File Sorting
SAM File Sorting PPiiccaarrd d SSAAMMSSoorrt t ((JJaavvaa) ) 8 8 hhoouurrss
Mark Duplicates
Mark Duplicates PPiiccaarrd d MaMarrkkDDuupplliiccaattees s ((JJaavvaa) ) 8 8 hhoouurrss
BAM File Index construction
BAM File Index construction PPiiccaarrd d BBuuiillddBBaammIInnddeex x ((JJaavvaa) ) 1 1 hhoouurr
Building Insert Delete (Indel) Realignment Targets
Building Insert Delete (Indel) Realignment Targets GGAATTK K RReeaalliiggnneerrTTaarrggeettCCrreeaattoor r ((JJaavvaa) ) 8 8 hhoouurrss
Realignment around Indel
Realignment around Indel GGAATTK K IInnddeellRReeaalliiggnneer r ((JJaavvaa) ) 8 8 hhoouurrss
Base Q Covariance 1st Stage
Base Q Covariance 1st Stage GGAATTK K BBaasseeRReeccaalliibbrraattoor r ((JJaavvaa) ) 330 0 hhoouurrss
Base Q Covariance 2nd Stage
Base Q Covariance 2nd Stage GGAATTK K BBaasseeRReeccaalliibbrraattoor r ((JJaavvaa) ) 880 0 hhoouurrss
Plot Base Q Results
Plot Base Q Results GGAATTK K AAnnaallyyzze e CCoovvaarriiaattees s ((JJaavvaa) ) 0 0 hhoouurrss
Base Q Recalibration
Base Q Recalibration GGAATTK K PPrriinnttRReeaad d ((JJaavvaa) ) 333 3 hhoouurrss
is highly parallel as it works on independent chunks of base is highly parallel as it works on independent chunks of base pair reads.
pair reads.
B.
B. SAMtoolsSAMtools The
The SAM SAM (Se(Sequequence nce AliAlignmgnmentent/M/Map) ap) file file forformat mat is is aa gener
generic ic formformat at for for storstoring ing larlarge ge nuclenucleotide sequencotide sequence e aligalign- n-ments.SAM Tools provides various utilities for
ments.SAM Tools provides various utilities for manipulatinmanipulatingg alig
alignmentnments s in in the the SAM SAM formformat, at, inclincluding sortinguding sorting, , mermerging,ging, indexing and generating alignments in a per-position format. indexing and generating alignments in a per-position format. It is inherently a single threaded application[2].
It is inherently a single threaded application[2].
C.
C. HTSHTSliblib HTSlib
HTSlib44 is is an an imimpleplemenmentattation of ion of a a uniunified C fied C lilibrabrary ry forfor acc
accessessing ing comcommon mon file formafile formats, ts, sucsuch h as as SAMSAM, , CRACRAM M andand VCF
VCF, , used for used for high-high-throuthroughput sequencghput sequencing data, ing data, and is and is thethe core library used by samtools and bcftools. The binary is called core library used by samtools and bcftools. The binary is called htscmd
htscmd and it is used to shuffle the input data and convert the and it is used to shuffle the input data and convert the lat
later er intinto o a a sinsingle fastgle fastq q filefile.Th.This is file is file is thethen n proprovidvided ed as as anan input to the BWA tool. This tool is also single threaded and input to the BWA tool. This tool is also single threaded and doesn’t have any parallelism in it.
doesn’t have any parallelism in it.
D.
D. Picard Picard toolstools Picard
Picard55 is comprised of Java-based command-line utilitiesis comprised of Java-based command-line utilities tha
that t manmanipuipulatlate e SAM files, and SAM files, and a a JaJava API va API (HT(HTSJDSJDK) K) forfor creating new programs that read and write SAM files. Both creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported SAM text format and SAM binary (BAM) format are supported by
by the toolthe tool. . The pipelThe pipeline uses ine uses 3 3 utiutilitlities from ies from the Picarthe Picardd too
tool l set set for for thrthree ee purpurposposes, es, namnamelyely, , sorsortiting ng the the SAM SAM filefile (SortSam), marking duplicates (MarkDuplicates) and building (SortSam), marking duplicates (MarkDuplicates) and building the BAM index (BuildBamIndex). All these utilities are again the BAM index (BuildBamIndex). All these utilities are again unfortunately single threaded and the algorithms don’t employ unfortunately single threaded and the algorithms don’t employ any parallelism.
any parallelism.
E.
E. Genome Genome Analysis Analysis TToolkit oolkit (gatk)(gatk) Similar to Picard, GATK
Similar to Picard, GATK66 too is a set of tools and it servestoo is a set of tools and it serves
as the core of the pipeline performing the major analysis tasks as the core of the pipeline performing the major analysis tasks on the genome. The GATK is the industry standard for analyses on the genome. The GATK is the industry standard for analyses such as identifying rare mutations among exomes as well as such as identifying rare mutations among exomes as well as specific mutations within a group of patients. The specific tools specific mutations within a group of patients. The specific tools that are being used are RealignerTargetCreator, IndelRealigner, that are being used are RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads. The GATK was built from the BaseRecalibrator, PrintReads. The GATK was built from the ground up with performance in mind. It employs the concept ground up with performance in mind. It employs the concept of
of Map ReducMap Reduce, e, whiwhich ch is is basbasicaically a lly a strstrateategy gy to to spespeed ed upup
4 4https://github.com/samtools/htslibhttps://github.com/samtools/htslib 5 5http://picard.sourceforge.net/ http://picard.sourceforge.net/ 6 6http://www.broadinstitute.org/gatk/about/ http://www.broadinstitute.org/gatk/about/ T
Tooool l FFuulll l NNaamme e SSuuppppoorrtteed d PPaarraalllleelliissmm R
RTC TC ReRealaligign n TTarargeget t CrCreaeatotor r NTNT IIR R IInnddeel l RReeaalliiggnneer r SSGG B
BR R BBaasse e RReeccaalliibbrraattoor r NNCCTT, , SSGG P
PR R PPrriinnt t RReeaadds s NNCCTT
TABLE III: Parallelism in GATK TABLE III: Parallelism in GATK
performance by breaking down large iterative tasks into shorter performance by breaking down large iterative tasks into shorter segments whose output will then be merged into some overall segments whose output will then be merged into some overall result. Additionally
result. Additionally, , it also it also employs multi-threading heavilyemploys multi-threading heavily.. Mult
Multi-thi-threadreading ing is is enabenabled led simpsimply ly by by using theusing the nt nt andand nct nct command line arguments[3].
command line arguments[3]. Here
Here nt nt rep represresentents s the numbethe number r of of datdata a thrthreadeads s sensent t toto the processor and
the processor and nct nct represents the number of CPU threads represents the number of CPU threads allocated to each data thread. Apart from multi-threading, they allocated to each data thread. Apart from multi-threading, they also have a notion of “Scatter-Gather” which can be applied to also have a notion of “Scatter-Gather” which can be applied to a cluster of machines. Scatter-Gather (SG) is a very different a cluster of machines. Scatter-Gather (SG) is a very different process from multi-threading because the parallelization process from multi-threading because the parallelization hap-pens outside of the program itself. It basically creates separate pens outside of the program itself. It basically creates separate GA
GATK TK comcommamands for nds for a a porportiotion n of of inpinput ut datdata a and send and send thithiss command to different nodes of the cluster.
command to different nodes of the cluster.
Not all the tools support all different kinds of parallelism, Not all the tools support all different kinds of parallelism, below is the table which lists down these support.
below is the table which lists down these support.
As seen in Table III, SG is not supported by all the tools. As seen in Table III, SG is not supported by all the tools. Ones that support are not the bottle neck of the pipeline as Ones that support are not the bottle neck of the pipeline as explained in section VII
explained in section VII Next section describ
Next section describes es our our framframewoework rk for for measmeasurinuring g thethe system performance.
system performance.
IIVV. . TTHEHE P PIPELINEIPELINE
In this section, we attempt to describe our understanding of In this section, we attempt to describe our understanding of each step in the workflow in computer science layman terms each step in the workflow in computer science layman terms as opposed to the point of view of a trained Bioinformatics as opposed to the point of view of a trained Bioinformatics sci
sciententistist. . For For the purpothe purposes ses on on anaanalyslysis, is, we we alsalso o didividvide e thethe pipeline into logical phases.
pipeline into logical phases.
The input to the workflow is a set of BAM files which The input to the workflow is a set of BAM files which is
is essessententialially ly a a reareally larglly large e set of set of filefiles s (to(totaltallinling g 1 1 BilBilliolionn short reads per genome per person) containing a collection of short reads per genome per person) containing a collection of short reads. Short reads are fragments of a much longer DNA short reads. Short reads are fragments of a much longer DNA sequence and these are produced by hardware sequencers such sequence and these are produced by hardware sequencers such as ones produced by Illumina. There are technologies in the as ones produced by Illumina. There are technologies in the marketplace that can produce long reads as well, but we will marketplace that can produce long reads as well, but we will not discuss them here given our limited understanding of the not discuss them here given our limited understanding of the topic.
Once this is done, these short reads are fed into the pipeline Once this is done, these short reads are fed into the pipeline that is made up of
that is made up of
14
14
stages. stages.A.
A. Phase Phase 1: 1: Shuffle Shuffle and and AlignAlign
In this stage, the short reads in the BAM files will first be In this stage, the short reads in the BAM files will first be shuffled to minimize bias during the alignment process
shuffled to minimize bias during the alignment process 7 7. Once. Once
shuffling is completed, each of the short reads will then need to shuffling is completed, each of the short reads will then need to be aligned (mapped to a specific position) to a large reference be aligned (mapped to a specific position) to a large reference genome (in our case this is a 3 billion long sequence). From genome (in our case this is a 3 billion long sequence). From the
the outoutskiskirt, rt, rearead d alialignmgnment ent seeseems ms to to be be a a simsimple ple proprobleblemm i.e. find a particular substring within a bigger string, however i.e. find a particular substring within a bigger string, however rea
read d errerrors can ors can and do and do haphappenpen, , henhence ce turturninning g thithis s intinto o anan appro
approximaximate te hashhashing ing probproblem lem which is which is then solved then solved utilutilizinizingg the
the BurrBurrows ows WheeWheeler ler AlignAligner er (BW(BWA) A) trantransform sform algoralgorithmithm.. It
It is our is our undunderserstantandinding g thathat t eaceach h of of the shorthe short t reareads ds can becan be sequenced in this stage without knowledge of any other short sequenced in this stage without knowledge of any other short read data as long as the input BAM file is valid. While running read data as long as the input BAM file is valid. While running intial experiments we found an interesting problem related to intial experiments we found an interesting problem related to missing base-pairs which is further discussed in section V. missing base-pairs which is further discussed in section V.
B.
B. Phase Phase 2: 2: SAM SAM SortingSorting
Many of the downstream analysis programs which utilizes Many of the downstream analysis programs which utilizes BA
BAM M filefiles s actactualually ly reqrequiruire e the files the files to to be be sorsorted since thisted since this allows reading from these files to be done more efficiently. allows reading from these files to be done more efficiently.
C.
C. Phase 3: RemovPhase 3: Remove Filese Files
Phase 3 removes the temporary files created in the earlier Phase 3 removes the temporary files created in the earlier phases.
phases.
D.
D. Phase Phase 4: 4: Mark Mark DuplicatesDuplicates
In this phase, duplicates of any particular unique short read In this phase, duplicates of any particular unique short read will be marked to prevent a skew during the variant calling will be marked to prevent a skew during the variant calling proce
process. ss. DupliDuplicatecates s are usually produced due are usually produced due to to a a partparticulicularar DNA preparation process and may be unavoidable. Marking DNA preparation process and may be unavoidable. Marking duplicates sounds like something that can be built on top of duplicates sounds like something that can be built on top of the canonical example of using Map reduce i.e. counting the the canonical example of using Map reduce i.e. counting the number of words in a given document.
number of words in a given document.
E.
E. Phase Phase 5: 5: Remove Remove FilesFiles
This phase too we remove some files This phase too we remove some files
F
F. . Phase 6: Index DeduPhase 6: Index Dedup (Build BAM Index)p (Build BAM Index)
In this phase, the output BAM files from preceding stages In this phase, the output BAM files from preceding stages are indexed for fast access. This essentially allows a particular are indexed for fast access. This essentially allows a particular short read to be accessed by jumping immediately to a specific short read to be accessed by jumping immediately to a specific offset within a particular BAM file thus negating the need to offset within a particular BAM file thus negating the need to read preceding data into memory. The output of this process read preceding data into memory. The output of this process is a set of accompanying index files to the original BAM files. is a set of accompanying index files to the original BAM files.
7
7http://www.broadinstitute.org/gatk/guide/tagged?tag=bamhttp://www.broadinstitute.org/gatk/guide/tagged?tag=bam
G.
G. Phase 7 and 8: Phase 7 and 8: InDel TInDel Targargets and Realign InDelsets and Realign InDels
The next two phases pertains to Insertion Deletion (InDel) The next two phases pertains to Insertion Deletion (InDel) an
and d ththus us wowoululd d bebenenefit fit frfrom om a a shshorort t ovoverervivieeww. . ThThe e tetermrm In
InDeDel l rerefefers rs to to a a clclasass s of of vavaririatatioions ns ththat at is is prpresesenent t in in aa human genome. The need to align short reads around InDels human genome. The need to align short reads around InDels arises due to 2 major reasons. The first reason is that InDel arises due to 2 major reasons. The first reason is that InDel can
can caucause se mamappeppers rs (su(such ch as as the the BWBWA A algalgoriorithm thm empemploloyedyed in
in PhaPhase se 1) 1) to to mimisalsalign shorign short t reareads. The ds. The secsecond reasoond reason n isis tha
that t thothose se mimisalsalignignmements nts thethen n wouwould ld harharm m the the accaccurauracy cy of of downstream processes such as base quality recalibration and downstream processes such as base quality recalibration and variant detection.
variant detection.88
Here, the regions in the BAM files that will need to be Here, the regions in the BAM files that will need to be rea
realigligned ned are are ideidentintifiedfied. . In In gengeneraeral l thethere re are are thrthree ee typtypes es of of real
realignmignment ent tartargetsgets: : knowknown n sitesites s such as such as ones coming fromones coming from the 1000 Genome project, InDels that are seen in the original the 1000 Genome project, InDels that are seen in the original alignments (as part of the application of the BWA algorithm), alignments (as part of the application of the BWA algorithm), and finally sites where evidence suggests a hidden InDel. and finally sites where evidence suggests a hidden InDel.
Onc
Once e the the InDInDel el reregiogions ns hahave ve beebeen n ideidentintifiedfied, , thithis s stastagege would then perform the actual realignment.
would then perform the actual realignment.
H.
H. Phase Phase 9: 9: Remove Remove FilesFiles
Phase 9 removes the temporary files created by the earlier Phase 9 removes the temporary files created by the earlier phase.
phase.
I.
I. Phase Phase 10: 10: Baseq Baseq (Base (Base Quality) Quality) Covariance Covariance Stage Stage 1 1 (Base(Base Recalibration)
Recalibration)
Hardware sequencers would associate a quality score with Hardware sequencers would associate a quality score with their reads. There is however a tendency for sequencers to be their reads. There is however a tendency for sequencers to be overly optimistic in terms of their confidence scores. In this overly optimistic in terms of their confidence scores. In this stage, a recalibration table will be built utilizing some machine stage, a recalibration table will be built utilizing some machine learning algorithm based on covariation among several features learning algorithm based on covariation among several features of base such as read group, the original quality score from the of base such as read group, the original quality score from the sequencer, 1st/2nd read in a pair, etc.
sequencer, 1st/2nd read in a pair, etc.99
J.
J. Phase Phase 11: 11: Baseq Baseq (Base (Base Quality) Quality) Covariance Covariance Stage Stage 22 In this phase, the recalibration table built on the previous In this phase, the recalibration table built on the previous stage would be utilized to recompute the base quality score. stage would be utilized to recompute the base quality score.
K.
K. Phase 12: Plot Base Quality CovariaPhase 12: Plot Base Quality Covariancence
In this stage, the plot of the recalibration tables are In this stage, the plot of the recalibration tables are gen-er
eratated ed so so ththat at an an evevalaluauatition on cacan n be be mamade de on on whwhetetheher r ththee recalibration has worked properly.
recalibration has worked properly.
L.
L. Phase Phase 13: 13: Base Base Quality Quality RecalibrationRecalibration In
In thithis s phaphase, se, the recalthe recalibribrateated d datdata a is is subsubjecjected to ted to somsomee final processing before written out to disk.
final processing before written out to disk.
M.
M. Phase Phase 14: 14: Remove Remove FilesFiles
Phase 14 simply removes the Index realignment files. Phase 14 simply removes the Index realignment files.
8
8http://hmg.oxfordjournals.org/content/19/R2/R131.fullhttp://hmg.oxfordjournals.org/content/19/R2/R131.full 9
9http://whttp://weallseqtoseq.bloeallseqtoseq.blogspot.com/20gspot.com/2013/10/gatk13/10/gatk-best-practices-w-best-practices-workshop-
orkshop-data-pre.html data-pre.html
V
V. . FFRAMEWORKRAMEWORK
In this section, we describe the framework that was setup In this section, we describe the framework that was setup to run the pipeline and the changes and modifications done to to run the pipeline and the changes and modifications done to the framew
the framework for ork for the purposethe purposes s of of analanalysisysis. . In additionIn addition, , wewe also discuss the challenges faced, while trying to get working also discuss the challenges faced, while trying to get working the pipeline with a subset of the reads to ease analysis. Further, the pipeline with a subset of the reads to ease analysis. Further, we also talk about the challenges we faced while duplicating we also talk about the challenges we faced while duplicating the environment on another machine.
the environment on another machine.
A.
A. FrFramework amework ChangesChanges
The framework - essentially a script provided to us needed The framework - essentially a script provided to us needed an overhaul and further additons so that we could start an overhaul and further additons so that we could start mea-sur
suring ing syssystem tem perperforformamance nce for for eaceach h phaphase se of of the the pippipelieline.ne. T
To o measmeasure ure basibasic c resouresources like, rces like, CPU CPU utilutilizatization, Memoryion, Memory utilization and Disk I/O we decided to use
utilization and Disk I/O we decided to use dstat dstat tool for each tool for each command. For collecting hardware performance counters, e.g. command. For collecting hardware performance counters, e.g. L1 cache misses, last level cache misses etc. we used
L1 cache misses, last level cache misses etc. we used perf-stat perf-stat utility. Also, since most of the tools are Java based, we thought utility. Also, since most of the tools are Java based, we thought it would beneficial to collect JVM Garbage Collector logs as it would beneficial to collect JVM Garbage Collector logs as well.
well.
B.
B. Data Data SourcesSources Our initi
Our initial al exexperperimimentents s werwere e run on run on the mastthe master er nodnode e of of the Bioinformatics lab during a short period of time when it the Bioinformatics lab during a short period of time when it was not being utilized by the the department. This gave us an was not being utilized by the the department. This gave us an advantage since we did not have to simultaneously understand advantage since we did not have to simultaneously understand both tools and environment at the outset. However, once the both tools and environment at the outset. However, once the normal usage of the node was resumed, we had to utilize a normal usage of the node was resumed, we had to utilize a dif
differferent ent macmachinhine. e. The The cricritictical al ississue ue wawas s thathat t we we had had beebeenn using private patient data up to that point and due to privacy using private patient data up to that point and due to privacy concerns it was impossible for us to move the data set into concerns it was impossible for us to move the data set into the new machine. We were thus faced with the challenge of the new machine. We were thus faced with the challenge of finding an equivalent data set. We tried a number of sources finding an equivalent data set. We tried a number of sources and finally found the 1000 Genome project to be fruitful. and finally found the 1000 Genome project to be fruitful.1010
C.
C. PPair issue and ’R’air issue and ’R’ Ou
Our r wowoes es didid d nonot t ststop op afafteter r bebeining g abable le to to finfind d a a dadatata set
set frofrom m 1001000 0 GenGenome ome proprojecject, t, WWe e expexpectected ed the the subsubset set of of the
the reareads ds to to sucsuccescessfusfully lly run run in in the the pippipelieline. ne. HoHoweweveverr, , wewe fac
faced ed crycryptiptic c errerrors ors in in the the firsfirst t phaphase se of of the the pippipelieline ne itsitselfelf.. Aft
After er mulmultitiple ple lonlong g debdebuggugging ing sessessiosions ns and and helhelp p frofrom m thethe bioinformatics people we found out that the base pair reads bioinformatics people we found out that the base pair reads come in pairs and the bwa tool requires that every base pair come in pairs and the bwa tool requires that every base pair has its pair in the data set.
has its pair in the data set.
Since we were running our experiments for a subset of the Since we were running our experiments for a subset of the reads there was a probability that certain base-pairs didn’t have reads there was a probability that certain base-pairs didn’t have their pair and because of this pipeline was failing at the first their pair and because of this pipeline was failing at the first phase itself. So to remove this error, we had to write a wrapper phase itself. So to remove this error, we had to write a wrapper script around the given pipeline to pre-process the input bam script around the given pipeline to pre-process the input bam file and remove all the reads whose pairs didn’t exist.
file and remove all the reads whose pairs didn’t exist.
Challenges didn’t end here, there is a phase in the pipeline Challenges didn’t end here, there is a phase in the pipeline which uses ’R’ to plot graphs and it turns out that tools are which uses ’R’ to plot graphs and it turns out that tools are using deprecated libraries of ’R’ for plotting which we couldn’t using deprecated libraries of ’R’ for plotting which we couldn’t install while replicating the environment, so we had to skip install while replicating the environment, so we had to skip tha
that t phaphase se durduring ing our our exexperperimimentents. s. Our Our iniinitiatial l anaanalyslyses es onon bioin
bioinformformaticatics s machmachine showed that ine showed that this phase did this phase did not takenot take
10
10http://www.1000genomes.org/ http://www.1000genomes.org/
significant time estate of the pipeline and hence it shouldn’t significant time estate of the pipeline and hence it shouldn’t impact our future experiments.
impact our future experiments.
Finally, after all these changes, we were able to replicate Finally, after all these changes, we were able to replicate the complet
the complete e envenvironmironment on ent on a a new servenew server r where we where we coulcouldd experiment with different data-set sizes and analyse results. experiment with different data-set sizes and analyse results.
V
VII. . BBASELINEASELINE R RUNUN In
In ordorder er to to gaigain n a a betbetter underter understastandinding ng of of the pipelthe pipelineine,, we started our investigation by running the pipeline using the we started our investigation by running the pipeline using the simplest possible configuration and yet still have it perform simplest possible configuration and yet still have it perform meaningful work. This is accomplished by setting the number meaningful work. This is accomplished by setting the number of
of ththrereadads s to to 1 1 in in eaeach ch phphasase e of of ththe e pipipepeliline ne as as wewell ll asas utilizing only
utilizing only
1%
1%
of the full data set used in a typical run of the full data set used in a typical run in the Bioinformatics department, which comes to be around in the Bioinformatics department, which comes to be around10
10
million reads. The choice of million reads. The choice of1%
1%
of the full set was informed of the full set was informed through conversations with a student from the department. through conversations with a student from the department.The
The nexnext t levlevel el of of expeexperimeriments consistents consisted d of of runnnrunnning theing the same pipeline and dataset but with higher number of threads same pipeline and dataset but with higher number of threads and since we were working on a quad-core machine, we chose and since we were working on a quad-core machine, we chose
44
thre threads for ads for immeimmediate compardiate comparison. Later in ison. Later in the paperthe paper, , wewe look at how SMT performs.look at how SMT performs. T
Tablable e IV IV shoshows ws thathat t thethere re are mainlare mainlyy
55
phase phases s whicwhichh contrcontributibute e to to apprapprox.ox.
84%
84%
of the tota of the total l timtime. e. LooLookinking g at theat the graphs for those phases namely, figures 1, 2, 3, 4, 5, 6, 7, 8, 9 graphs for those phases namely, figures 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, it is evident that neither I/O or Memory is a bottleneck. and 10, it is evident that neither I/O or Memory is a bottleneck. Remaining graphs of the phases can be found in Appendix A Remaining graphs of the phases can be found in Appendix AThi
This s is is furfurthether r evevideident nt whewhen n we we capcapturtured ed the the resresourourcece usage pattern for an actual run that lasted for approx. 7 days. usage pattern for an actual run that lasted for approx. 7 days. Figures 11 and 12 show that although I/O was increased, its Figures 11 and 12 show that although I/O was increased, its not the bottle neck of the system and the tools never ran out not the bottle neck of the system and the tools never ran out of memory.
of memory. Altho
Although ugh we we belibelieveve e that memory that memory bandwbandwidth contentiidth contentionon mig
might ht be be one of one of the possithe possible reasoble reasons ns why these phasewhy these phases s areare slow but due to timing constraints we were not able to explore slow but due to timing constraints we were not able to explore that avenue. that avenue. P h P haas e s e N aN am e m e S iS in gn gl e l e TTh rh re ae ad d TTi mi me e ( s( s) ) 4 4 T hT hr er ea d a d TTi mi me e ( s( s)) S Shhuufff f AAllggn n 4949332 2 11550055 S Soorrt t SSaam m 22444 4 224444 R Reemmoovve e BBAAM M 1 1 11 D DeeDDuup p SSrrttd d 22999 9 229999 R Reemmoovve e SSrrttd d 1 1 11 IInnddeex x DDeeDDuup p 440 0 4400 IInnddeel l TTaarrggeet t 2299661 1 883300 R
Reeaalln n TTaarrggeet t 33993 3 339944 R Reemmoovve e DDeedduup p 1 1 11 B Baasse e CCoovvaar r 1122776 6 771199 B Baasse e CCoovvaar r 2 2 1919007 7 11333300 P
Plloot t ((DDiiddnn’’t t MMeeaassuurree) ) 0 0 00 B
Baasseeq q RReeaalln n 1100885 5 772211 R
Reemmoovve e RReeaalln n 1 1 11
TABLE IV: Time taken by different phases of the pipeline for TABLE IV: Time taken by different phases of the pipeline for different number of threads for dataset of size 10 Million reads different number of threads for dataset of size 10 Million reads
A.
A. Investigating Investigating Java Java Runtime Runtime EnvirEnvironment onment
With the exception of the first phase of the pipeline, all With the exception of the first phase of the pipeline, all the utilized tools are programs written in Java. This provided the utilized tools are programs written in Java. This provided us with another avenue of investigation to pursue through the us with another avenue of investigation to pursue through the enabling of Java Garbage Collection (GC) logging.
0 0 20 20 40 40 60 60 80 80 100 100 0 0 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 50005000 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 3300000 0 3355000 0 4400000 0 4455000 0 55000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 1: Single thread resource usage for Phase 1: Shuffle and Align Fig. 1: Single thread resource usage for Phase 1: Shuffle and Align
0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 11660000 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 11660000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 2: 4 threaded resource usage for Phase 1: Shuffle and Align Fig. 2: 4 threaded resource usage for Phase 1: Shuffle and Align
Given the batch nature of each phase in the pipeline, we Given the batch nature of each phase in the pipeline, we are primarily interested in knowing the phase throughput i.e. are primarily interested in knowing the phase throughput i.e. the
the perpercencentagtage e of of timtime e spespent nt by by eaceach h pippipelieline ne phaphase se doidoingng useful work instead of GC. In general, any throughput number useful work instead of GC. In general, any throughput number at 95% and above are considered good
at 95% and above are considered good 1111. Additionally, should. Additionally, should the throughput number falls below 95%, we are also interested the throughput number falls below 95%, we are also interested in
in seeseeing any ing any insinstantances of ces of a a GC GC taktaking ing an an exexcescessisivevely ly lonlongg amount of time.
amount of time. W
We e momodidifiefied d ththe e scscriript pt usused ed to to rurun n ththe e totool ol chchaiain n toto augment every ”java” command with the following flags augment every ”java” command with the following flags
-Xloggc:logs -Xloggc:logs -XX:+PrintGCDetails -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCTimeStamps 11 11http://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guidehttp://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guide ..
The set of flags serve to output GC data in sufficient details The set of flags serve to output GC data in sufficient details including the amount of memory freed in both Young and Old including the amount of memory freed in both Young and Old Generations in each iteration as well as the times during which Generations in each iteration as well as the times during which GC happens into the specified log file. We then used a tool GC happens into the specified log file. We then used a tool called JClarify Censum
called JClarify Censum1212 to visualize the data and collect theto visualize the data and collect the throughput metric.
throughput metric.
1) Throughput Results:
1) Throughput Results: When it comes to the pipeline, the When it comes to the pipeline, the phases utilizing Java that contribute the most to the running phases utilizing Java that contribute the most to the running tim
time e are Indel Taare Indel Targrgetsets, , BasBase e CoCovavar r 1, 1, BasBase e CovCovar ar 2, 2, andand Bas
Baseq eq RecRecal al so so we we are going are going to to conconstrstrain our ain our disdiscuscussiosionn to
to thothose se 4 4 phaphasesses. . For any For any comcombinbinatiation on of of inpinput ut sizsize e andand number of threads, we found that the throughput number never number of threads, we found that the throughput number never dropped below 95% with the exception of the Base Covariance dropped below 95% with the exception of the Base Covariance phase where the number dropped steadily when the number of phase where the number dropped steadily when the number of
12
0 0 20 20 40 40 60 60 80 80 100 100 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 33000000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 1.6 1.6 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2.5 2.5 3 3 3.5 3.5 4 4 4.5 4.5 5 5 5.5 5.5 6 6 6.5 6.5 7 7 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 33000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 3: Single thread resource usage for Phase 7: Indel Target Index Fig. 3: Single thread resource usage for Phase 7: Indel Target Index
0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 88000 0 990000 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 88000 0 990000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index
1 100M M 2200M M 4400M M 110000MM I n I nd ed el l TTa ra rg eg et s t s 998 .8 .66% % 9 89 8. 9. 9% % 9 89 8. 7. 7% % 9 89 8. 9. 9%% B Baasse e CCoovvaar r 1 1 9955..88% % 9494..77% % 9933..66% % 9922%% B Baasse e CCoovvaar r 2 2 9977..11% % 9696..88% % 9966..66% % 9966..66%% B Baasseeq q RReeccaal l 9977..44% % 9977..55% % 9977..55% % 9977..66%%
TABLE V: GC Throughput for number of threads = 8 TABLE V: GC Throughput for number of threads = 8
threads specified is 8 (at 4 threads and below, the throughput threads specified is 8 (at 4 threads and below, the throughput remained at above 95
remained at above 95
We then made some modifications to the ”java” command We then made some modifications to the ”java” command to
to sesee e if if we we cacan n imimprprovove e ththe e ththrorougughphput ut nunumbmber er fofor r ththatat ph
phasase. e. In In papartrticiculular ar we we fofounund d ththat at fofor r a a dadata ta sisize ze of of 4040 million, we were able to increase the throughput from 93.6% million, we were able to increase the throughput from 93.6% to 96.1% by specifying a number of additional flags
to 96.1% by specifying a number of additional flags
-Xms15000m -Xms15000m -Xmx15000m -Xmx15000m -Xmn10000m -Xmn10000m
The choice of those numbers were informed based on the raw The choice of those numbers were informed based on the raw GC data collected in particular the sizes of the Young and Old GC data collected in particular the sizes of the Young and Old Generation at the end of the GC log for that phase with some Generation at the end of the GC log for that phase with some buffer built in to tolerate possible memory spikes.
buffer built in to tolerate possible memory spikes. Impr
Improveovement aside, we ment aside, we don’don’t t thinthink k that it that it is is partparticulicularlyarly mea
meaninningfugful, l, for for exexampample, le, if if we we refrefer er to to tabtable le V, V, impimprorovinvingg Base Covar 1’s throughput from 92% to 95% would only result Base Covar 1’s throughput from 92% to 95% would only result in an improvement of around 1 minute (3% of 35 minutes) in in an improvement of around 1 minute (3% of 35 minutes) in running time. It is true that given that the particular phase is running time. It is true that given that the particular phase is currently taking around 30 hours (using the full data set) that currently taking around 30 hours (using the full data set) that we may be seeing some real savings in time, however we also we may be seeing some real savings in time, however we also know that the phase is currently being run with the number of know that the phase is currently being run with the number of threads set to 5 in the Bioinformatics head node (and earlier threads set to 5 in the Bioinformatics head node (and earlier
0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 0 0 0.02 0.02 0.04 0.04 0.06 0.06 0.08 0.08 0.1 0.1 0.12 0.12 0.14 0.14 0.16 0.16 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1 Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1
0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)
(b) MemorMemory y consumconsumptionption
Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1 Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1
we have mentioned that throughput is not an issue when the we have mentioned that throughput is not an issue when the number of threads is 4). In other words, without further work, number of threads is 4). In other words, without further work, we
we are not are not ablable e to to ascascertertain how ain how mucmuch h of of an an imimproprovevemenmentt would tuning this particular phase result in.
would tuning this particular phase result in.
Ano
Anothether r thithing ng thathat t is is worworth th mementintionioning ng is is thathat t earearly ly inin our experiments, we were seeing Java programs ran in single our experiments, we were seeing Java programs ran in single th
threreadaded ed coconfinfiguguraratition on ththrorougugh h ththe e ususe e of of ncnct t (s(see ee sosomeme oth
other er tabtable le for inforfor informatmation ion on on tootools ls thathat t accaccept ept numnumber ber of of data/compute threads) consuming more than a thread’s worth data/compute threads) consuming more than a thread’s worth of CPU utilization (”top” would occasionally show utilization of CPU utilization (”top” would occasionally show utilization above 30%) . We utilized an application called JConsole which above 30%) . We utilized an application called JConsole which is shipped with any JDK since version 5.0 to investigate the is shipped with any JDK since version 5.0 to investigate the issue since the later has the ability to show all running threads issue since the later has the ability to show all running threads in a particular Java Virtual Machine (JVM).
in a particular Java Virtual Machine (JVM).
Wha
What t we we foufound nd wawas s thathat t altalthouhough gh the tool did the tool did ststart upart up a
a numnumber ber of of datdata/ca/comomputpute e thrthreadeads s as as spespecificified ed thrthrougough h aa
con
configufiguratration ion optoptionion, , thethere re werwere e alsalso o a a numnumber ber of of utiutilitlityy threads started by the tool and the JVM including a Progress threads started by the tool and the JVM including a Progress Tracking thread (started by GATK), a number of GC threads Tracking thread (started by GATK), a number of GC threads and a number of TCP/IP threads. We were not able to ascertain and a number of TCP/IP threads. We were not able to ascertain why the JVM would start a number of TCP/IP threads, and why the JVM would start a number of TCP/IP threads, and the
there re doedoes s not seem to not seem to be be anany y flagflags s spespecificific c turturninning g thothosese thre
threads off, ads off, howehowever the ver the prespresence of ence of thesthese e extrextra a threthreads doads do explain the CPU utilization phenomenon that we were seeing. explain the CPU utilization phenomenon that we were seeing.
B.
B. PPerformance erformance counter counter measurementsmeasurements
Just to be sure that the delay in the phases are not caused Just to be sure that the delay in the phases are not caused by
by lot lot of of L1 L1 caccache he mismissesses, , brabranchnch-pr-prediedictoctor r mismisses ses or or OfOff- f-Chip
Chip acceaccessessses, , we we measmeasured ured hardhardware ware perfperformaormance nce countcountersers using
using perf-stat perf-stat to tool ol anand d fofounund d it it to to be be coconsnsisistetent nt acacrorossss mult
multiple runs iple runs for for difdiffereferent nt inpuinput t sizesizes s and and threathread d numbnumbers.ers. There was about
There was about
33
..5%
5%
of L1 Data cache miss rate, of L1 Data cache miss rate,11
..2%
2%
of of Branch predictor misses. But an interesting result was0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 100800 1000 0 1201200 0 1401400 0 1601600 0 1801800 0 20020000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 1166000 0 1188000 0 22000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption
Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2 Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2
0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 0 0 0.05 0.05 0.1 0.1 0.15 0.15 0.2 0.2 0.25 0.25 0.3 0.3 0.35 0.35 0.4 0.4 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption
Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2 Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2
of LLC
of LLC missmisses which es which was an was an inteinterestresting result but becauseing result but because of timing constraints we couldn’t explore this avenue as well. of timing constraints we couldn’t explore this avenue as well. However, since we saw that these measurements seemed pretty However, since we saw that these measurements seemed pretty much constant across runs with different sizes and threads we much constant across runs with different sizes and threads we concl
concluded that uded that while the while the cachcache e missmisses es coulcould d be be potepotentialntiallyly interesting it wouldn’t affect our ability to model the pipeline. interesting it wouldn’t affect our ability to model the pipeline.
V
VIIII. . TTHEHE P PIPELINEIPELINE M MODELODEL
One of the ideas behind profiling the pipeline was to build One of the ideas behind profiling the pipeline was to build a
a momodedel l babasesed d of of whwhicich h we we cocoululd d mamake ke prprededicictitionons. s. ThThee idea being that if we could accurately predict and model the idea being that if we could accurately predict and model the beh
behavavior of ior of the softthe softwarware e pippipelieline ne we we hadhad, , in in a a mamannenner r of of spe
speakiaking, ng, trutruly ly undunderserstootood d the the worworkinkings gs of of the the syssystem tem as as aa black box.
black box.
As mentioned previously, to better understand the pipeline As mentioned previously, to better understand the pipeline we logically split the pipeline into phases. In retrospect, this we logically split the pipeline into phases. In retrospect, this
turned out to be a crucial step in building the model as this turned out to be a crucial step in building the model as this enabled us to better predict the behavior of the entire pipeline enabled us to better predict the behavior of the entire pipeline as
as a a comcombinbinatiation on of of phaphases ratheses rather r thathan n as as a a sinsingle entitgle entityy.. Ess
Essententialiallyly, , we we hahave ve bubuilt a ilt a modmodel el for each for each of of the phasethe phasess which is then used to make a prediction for the entire pipeline. which is then used to make a prediction for the entire pipeline. More specifically, the model will utilize the size of input data More specifically, the model will utilize the size of input data and number of threads to make a prediction for the time the and number of threads to make a prediction for the time the pipeline will take to complete.
pipeline will take to complete.
A.
A. Building Building the the ModelModel In
In bubuildilding the ing the momodeldel, , we we rearealilized that zed that thethere re are fourare four important factors that affect the running time. Further more, important factors that affect the running time. Further more, each of the phases had a different behavior which seemed to each of the phases had a different behavior which seemed to stem from change in the four factors.
stem from change in the four factors.
The factors that affect running time and form an integral The factors that affect running time and form an integral part of the model are:
0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 11220000 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 11220000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption
Fig. 9: Single thread resource usage for Phase 13: Base Recalibration Fig. 9: Single thread resource usage for Phase 13: Base Recalibration
0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 0 0 10 10 20 20 30 30 40 40 50 50 60 60 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)
(a) CPU and I/O Usage (a) CPU and I/O Usage
2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption
Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration
Fig. 11: CPU and I/O utilization for full run on 1 billion reads. Fig. 11: CPU and I/O utilization for full run on 1 billion reads.