Titanic

(1)

Titanic

Mohit Kothari

Computer Science and Engineering Computer Science and Engineering University of California, San Diego University of California, San Diego

Roger

Roger T

Tanuatmadja

anuatmadja

Gautam Akiwate

Abstract

Abstract—Nex—Next-gent-generatieration on sequesequencing (NGS), ncing (NGS), also also knowknown n asas high throughp

high throughput ut sequesequencinncing g is is currcurrently being ently being utiliutilized in zed in thethe Bioinformatics Department of the University of California at San Bioinformatics Department of the University of California at San Die

Diego go as as a a rereseasearcrch h tootool l to to exaexaminmine e and and detdetecect t varvarianiants ts (mu (mu--tations) in genomes of individuals and related family members. tations) in genomes of individuals and related family members. Unfortunately, given the current tools and compute environment, Unfortunately, given the current tools and compute environment, a typical NGS work-ﬂow presently takes between

a typical NGS work-ﬂow presently takes between 1212 toto 1414 days days to

to complcomplete. There is ete. There is therthereforefore e a a desidesire within the re within the depardepartmenttment to investigate the possibility of reducing the turnaround time to to investigate the possibility of reducing the turnaround time to accom

accommodate currenmodate current/futt/future ure resresearcearch h needneeds s that that woulwould d treatread d aa similar path.

similar path.

This paper describes the investigation that has been carried This paper describes the investigation that has been carried out over a period of few weeks, on the ﬁrst phase of the sequencing out over a period of few weeks, on the ﬁrst phase of the sequencing that takes around

that takes around 5 5 to to 7 7 days with the current software tool chain days with the current software tool chain and

and hardhardware utilizware utilizing ing a a set of set of smallsmaller er but still but still reprepreseresentatintativeve data sets and it will conclude with some recommendations and data sets and it will conclude with some recommendations and les

lessonsons s lealearnrned. ed. AddAdditiitionaonallylly, , we we werwere e alsalso o ablable e to to buibuild ld aa surp

surprisirisingly accurate model ngly accurate model whicwhich h prepredictdicts s the the behabehavior of vior of thethe tool chain.

tool chain. Keywords

Keywords—— NGS, NGS, VVariant ariant Calling, Calling, BWA, BWA, Picard, Picard, GATK, GATK, BAM,BAM, SAM

SAM

I.

I. IINTRODUCTIONNTRODUCTION A

A fufull ll enend d to to enend d NGNGS S woworkrk-fl-flow ow ininvovolvlves es a a nunumbmberer of smaller work-flows/phases including utilizing hardware of smaller work-flows/phases including utilizing hardware se-quencers such as ones made by Illumina, pre-processing the quencers such as ones made by Illumina, pre-processing the output BAM files by running them through a set of software output BAM files by running them through a set of software too

tools ls to to perperforform m tastasks ks sucsuch h as as seqsequenuence ce mapmappinping g (he(herebrebyy called the pre-processing phase) before concluding with variant called the pre-processing phase) before concluding with variant call

calling, annotating, annotation and ion and filterfiltering. A ing. A BAM file is BAM file is an industran industryy sta

standandard rd binbinary versary version of ion of a a SAM ﬁle SAM ﬁle witwith h the latethe later r beibeingng a

a tab-tab-delidelimitemited d textext t file file contcontainiaining ng sequsequence ence aligalignmennment t data.data. For the purposes of this paper, we will only focus on the For the purposes of this paper, we will only focus on the pre-processing phase since this is the work-flow that is currently processing phase since this is the work-flow that is currently taking the longest to complete

taking the longest to complete 11.. An

An ovoververvieiew w of of the the curcurrenrent t harhardwadware re and and sofsoftwatware re tootooll chain (including how much time each tool is currently taking chain (including how much time each tool is currently taking to complete in a typical execution) has been provided in Table. to complete in a typical execution) has been provided in Table. II

II The

The curcurrenrent t worwork-ﬂk-ﬂow ow is is curcurrenrently tly exexecuecuted ted via via a a PerPerll script that executes each stage in Table. II in sequential order. script that executes each stage in Table. II in sequential order. Whe

When n pospossibsible, le, the the comcommunmunicaicatition on betbetweeween n the the stastages ges isis done through Unix pipes to reduce I/O with temporary ﬁles done through Unix pipes to reduce I/O with temporary ﬁles being used when piping is not feasible. Additionally, although being used when piping is not feasible. Additionally, although each stage proceeds sequentially, wherever applicable, the tool each stage proceeds sequentially, wherever applicable, the tool utilized in each stage is currently executed with parameters that utilized in each stage is currently executed with parameters that would take advantage of extra processors/cores.

would take advantage of extra processors/cores.

1

1_{http://www.slideshare.net/AustralianBioinformatics/introduction-to-}

http://www.slideshare.net/AustralianBioinformatics/introduction-to-nextgeneration nextgeneration

TABLE I: Machine Description TABLE I: Machine Description

S

Syysstteem m DDeessccrriippttiioonn Processor Model

Processor Model Intel Intel XeonXeon

Clock Speed

Clock Speed 1.8GHz1.8GHz

No. of Processors

No. of Processors 11

No. of Cores per Processor

No. of Cores per Processor 44

RAM Size

RAM Size 10100 0 GBGB

Disk

Disk 22228 8 GBGB

IIII. . OOVERVIEWVERVIEW W

We e stastart rt by by taltalkinking g aboabout ut the the difdifferferent ent tootools ls thathat t makmakee up

up the pipelthe pipeline in ine in SecSectiotion n IIIIII. . WWe e thethen n momove ve to to desdescricribinbingg the

the worworkinkings gs of of the the pippipelieline ne itsitself in elf in SecSectiotion n IV. IV. SecSectiotionn V

V descdescriberibes s our our framframewoework rk to to measmeasure ure difdifferenferent t systsystem em re- re-source usage while running the pipeline and also tries explain source usage while running the pipeline and also tries explain challenges faced while replicating the environment on a new challenges faced while replicating the environment on a new sys

systemtem. . SecSectiotion n VI VI desdescricribes bes iniinitiatial l anaanalyslysis is and and incincludludeses some insights into CPU, I/O and Memory usage along with some insights into CPU, I/O and Memory usage along with Ja

Java va GC GC and and harhardwadware re perperforformanmance ce coucountenters. rs. SecSectition on VIIVII attempts to model the entire pipeline in mathematical terms attempts to model the entire pipeline in mathematical terms and section VIII evaluates the model and lists some suprising and section VIII evaluates the model and lists some suprising results. Section IX uses the model to analyse reoccuring issues results. Section IX uses the model to analyse reoccuring issues in

in the pipeline and the pipeline and makemakes s some future predictsome future predictionsions. . FinalFinallyly,, se

sectctioion n X X coconcncluludedes s wiwith th lelessssonons s leleararnt nt anand d sosome me of of ththee recommendations for the pipeline.

recommendations for the pipeline.

IIIIII. . TTOOLS ANDOOLS AND C CONFIGURATIONSONFIGURATIONS

To understand the pipeline it is important that we To understand the pipeline it is important that we under-sta

stand nd the tools that the tools that makmake e up up the pipelthe pipelineine. . HenHence, we ce, we bebegingin by describing the tools and the roles they play. The pipeline by describing the tools and the roles they play. The pipeline consists of the following tool chains for processing

consists of the following tool chains for processing 22 1

1) ) BBWWAA 2)

2) SSAMAMttoooollss 3

3) ) HHTTSSlliibb 4

4) ) PPiiccaarrdd 5)

5) Genome Genome AnalyAnalysis sis TToolkoolkit it (GA(GATK)TK)

A. A. BWABWA

BW

BWAA 33 _{is a software package for mapping low-divergent}_{is a software package for mapping low-divergent}

seq

sequenuences ces agaagainsinst t a a larlarge ge refrefereerence nce gengenomeome, , sucsuch h as as thethe human genome. It consists of three variations of the human genome. It consists of three variations of the Burton-Whee

Wheeler ler AlignAligner er algoralgorithmithm: : BWBWA-bacA-backtraktrack, ck, BWBWA-SW A-SW andand BWA-MEM. The pipeline uses BWA-MEM as it is capable of BWA-MEM. The pipeline uses BWA-MEM as it is capable of for processing up to 1 million base pairs (bp)[1].BWA-MEM for processing up to 1 million base pairs (bp)[1].BWA-MEM

2

2_{http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-}

http://gatkforums.broadinstitute.org/discussion/2899/howto-install-all-

software-packages-software-packages-required-to-frequired-to-follow-thollow-the-gatk-best-pre-gatk-best-practicesactices

3

(2)

TABLE II: Software Tool Chain TABLE II: Software Tool Chain

S

Sttaagge e NNaamme e TTooool l FFaammiilly y SSooffttwwaarre e TTooool l CCuurrrreennt t PPrroocceessssiinng g TTiimmee Shuf

Shufﬂing and ﬂing and AligninAligning g Input FileInput File NN//A A hhttssccmmdd, , bbwwaa, , ssaammttoooolls s ((C C llaanngguuaaggee) ) 333 3 hhoouurrss

SAM File Sorting

SAM File Sorting PPiiccaarrd d SSAAMMSSoorrt t ((JJaavvaa) ) 8 8 hhoouurrss

Mark Duplicates

Mark Duplicates PPiiccaarrd d MaMarrkkDDuupplliiccaattees s ((JJaavvaa) ) 8 8 hhoouurrss

BAM File Index construction

BAM File Index construction PPiiccaarrd d BBuuiillddBBaammIInnddeex x ((JJaavvaa) ) 1 1 hhoouurr

Building Insert Delete (Indel) Realignment Targets

Building Insert Delete (Indel) Realignment Targets GGAATTK K RReeaalliiggnneerrTTaarrggeettCCrreeaattoor r ((JJaavvaa) ) 8 8 hhoouurrss

Realignment around Indel

Realignment around Indel GGAATTK K IInnddeellRReeaalliiggnneer r ((JJaavvaa) ) 8 8 hhoouurrss

Base Q Covariance 1st Stage

Base Q Covariance 1st Stage GGAATTK K BBaasseeRReeccaalliibbrraattoor r ((JJaavvaa) ) 330 0 hhoouurrss

Base Q Covariance 2nd Stage

Base Q Covariance 2nd Stage GGAATTK K BBaasseeRReeccaalliibbrraattoor r ((JJaavvaa) ) 880 0 hhoouurrss

Plot Base Q Results

Plot Base Q Results GGAATTK K AAnnaallyyzze e CCoovvaarriiaattees s ((JJaavvaa) ) 0 0 hhoouurrss

Base Q Recalibration

Base Q Recalibration GGAATTK K PPrriinnttRReeaad d ((JJaavvaa) ) 333 3 hhoouurrss

is highly parallel as it works on independent chunks of base is highly parallel as it works on independent chunks of base pair reads.

pair reads.

B.

B. SAMtoolsSAMtools The

The SAM SAM (Se(Sequequence nce AliAlignmgnmentent/M/Map) ap) ﬁle ﬁle forformat mat is is aa gener

generic ic formformat at for for storstoring ing larlarge ge nuclenucleotide sequencotide sequence e aligalign- n-ments.SAM Tools provides various utilities for

ments.SAM Tools provides various utilities for manipulatinmanipulatingg alig

alignmentnments s in in the the SAM SAM formformat, at, inclincluding sortinguding sorting, , mermerging,ging, indexing and generating alignments in a per-position format. indexing and generating alignments in a per-position format. It is inherently a single threaded application[2].

It is inherently a single threaded application[2].

C.

C. HTSHTSliblib HTSlib

HTSlib44 is is an an imimpleplemenmentattation of ion of a a uniuniﬁed C ﬁed C lilibrabrary ry forfor acc

accessessing ing comcommon mon ﬁle formaﬁle formats, ts, sucsuch h as as SAMSAM, , CRACRAM M andand VCF

VCF, , used for used for high-high-throuthroughput sequencghput sequencing data, ing data, and is and is thethe core library used by samtools and bcftools. The binary is called core library used by samtools and bcftools. The binary is called htscmd

htscmd and it is used to shufﬂe the input data and convert the and it is used to shufﬂe the input data and convert the lat

later er intinto o a a sinsingle fastgle fastq q filefile.Th.This is file is file is thethen n proprovidvided ed as as anan input to the BWA tool. This tool is also single threaded and input to the BWA tool. This tool is also single threaded and doesn’t have any parallelism in it.

doesn’t have any parallelism in it.

D.

D. Picard Picard toolstools Picard

Picard55 is comprised of Java-based command-line utilitiesis comprised of Java-based command-line utilities tha

that t manmanipuipulatlate e SAM files, and SAM files, and a a JaJava API va API (HT(HTSJDSJDK) K) forfor creating new programs that read and write SAM files. Both creating new programs that read and write SAM files. Both SAM text format and SAM binary (BAM) format are supported SAM text format and SAM binary (BAM) format are supported by

by the toolthe tool. . The pipelThe pipeline uses ine uses 3 3 utiutilitlities from ies from the Picarthe Picardd too

tool l set set for for thrthree ee purpurposposes, es, namnamelyely, , sorsortiting ng the the SAM SAM ﬁleﬁle (SortSam), marking duplicates (MarkDuplicates) and building (SortSam), marking duplicates (MarkDuplicates) and building the BAM index (BuildBamIndex). All these utilities are again the BAM index (BuildBamIndex). All these utilities are again unfortunately single threaded and the algorithms don’t employ unfortunately single threaded and the algorithms don’t employ any parallelism.

any parallelism.

E.

E. Genome Genome Analysis Analysis TToolkit oolkit (gatk)(gatk) Similar to Picard, GATK

Similar to Picard, GATK66 _{too is a set of tools and it serves}_{too is a set of tools and it serves}

as the core of the pipeline performing the major analysis tasks as the core of the pipeline performing the major analysis tasks on the genome. The GATK is the industry standard for analyses on the genome. The GATK is the industry standard for analyses such as identifying rare mutations among exomes as well as such as identifying rare mutations among exomes as well as specific mutations within a group of patients. The specific tools specific mutations within a group of patients. The specific tools that are being used are RealignerTargetCreator, IndelRealigner, that are being used are RealignerTargetCreator, IndelRealigner, BaseRecalibrator, PrintReads. The GATK was built from the BaseRecalibrator, PrintReads. The GATK was built from the ground up with performance in mind. It employs the concept ground up with performance in mind. It employs the concept of

of Map ReducMap Reduce, e, whiwhich ch is is basbasicaically a lly a strstrateategy gy to to spespeed ed upup

4 4_{https://github.com/samtools/htslib}_{https://github.com/samtools/htslib} 5 5_{http://picard.sourceforge.net/}_{http://picard.sourceforge.net/} 6 6_{http://www.broadinstitute.org/gatk/about/}_{http://www.broadinstitute.org/gatk/about/} T

Tooool l FFuulll l NNaamme e SSuuppppoorrtteed d PPaarraalllleelliissmm R

RTC TC ReRealaligign n TTarargeget t CrCreaeatotor r NTNT IIR R IInnddeel l RReeaalliiggnneer r SSGG B

BR R BBaasse e RReeccaalliibbrraattoor r NNCCTT, , SSGG P

PR R PPrriinnt t RReeaadds s NNCCTT

TABLE III: Parallelism in GATK TABLE III: Parallelism in GATK

performance by breaking down large iterative tasks into shorter performance by breaking down large iterative tasks into shorter segments whose output will then be merged into some overall segments whose output will then be merged into some overall result. Additionally

result. Additionally, , it also it also employs multi-threading heavilyemploys multi-threading heavily.. Mult

Multi-thi-threadreading ing is is enabenabled led simpsimply ly by by using theusing the nt nt andand nct nct command line arguments[3].

command line arguments[3]. Here

Here nt nt rep represresentents s the numbethe number r of of datdata a thrthreadeads s sensent t toto the processor and

the processor and nct nct represents the number of CPU threads represents the number of CPU threads allocated to each data thread. Apart from multi-threading, they allocated to each data thread. Apart from multi-threading, they also have a notion of “Scatter-Gather” which can be applied to also have a notion of “Scatter-Gather” which can be applied to a cluster of machines. Scatter-Gather (SG) is a very different a cluster of machines. Scatter-Gather (SG) is a very different process from multi-threading because the parallelization process from multi-threading because the parallelization hap-pens outside of the program itself. It basically creates separate pens outside of the program itself. It basically creates separate GA

GATK TK comcommamands for nds for a a porportiotion n of of inpinput ut datdata a and send and send thithiss command to different nodes of the cluster.

command to different nodes of the cluster.

Not all the tools support all different kinds of parallelism, Not all the tools support all different kinds of parallelism, below is the table which lists down these support.

below is the table which lists down these support.

As seen in Table III, SG is not supported by all the tools. As seen in Table III, SG is not supported by all the tools. Ones that support are not the bottle neck of the pipeline as Ones that support are not the bottle neck of the pipeline as explained in section VII

explained in section VII Next section describ

Next section describes es our our framframewoework rk for for measmeasurinuring g thethe system performance.

system performance.

IIVV. . TTHEHE P PIPELINEIPELINE

In this section, we attempt to describe our understanding of In this section, we attempt to describe our understanding of each step in the workﬂow in computer science layman terms each step in the workﬂow in computer science layman terms as opposed to the point of view of a trained Bioinformatics as opposed to the point of view of a trained Bioinformatics sci

sciententistist. . For For the purpothe purposes ses on on anaanalyslysis, is, we we alsalso o didividvide e thethe pipeline into logical phases.

pipeline into logical phases.

The input to the workflow is a set of BAM files which The input to the workflow is a set of BAM files which is

is essessententialially ly a a reareally larglly large e set of set of ﬁleﬁles s (to(totaltallinling g 1 1 BilBilliolionn short reads per genome per person) containing a collection of short reads per genome per person) containing a collection of short reads. Short reads are fragments of a much longer DNA short reads. Short reads are fragments of a much longer DNA sequence and these are produced by hardware sequencers such sequence and these are produced by hardware sequencers such as ones produced by Illumina. There are technologies in the as ones produced by Illumina. There are technologies in the marketplace that can produce long reads as well, but we will marketplace that can produce long reads as well, but we will not discuss them here given our limited understanding of the not discuss them here given our limited understanding of the topic.

(3)

Once this is done, these short reads are fed into the pipeline Once this is done, these short reads are fed into the pipeline that is made up of

that is made up of

14

stages. stages.

A.

A. Phase Phase 1: 1: Shufﬂe Shufﬂe and and AlignAlign

In this stage, the short reads in the BAM files will first be In this stage, the short reads in the BAM files will first be shuffled to minimize bias during the alignment process

shufﬂed to minimize bias during the alignment process 7 7_{. Once}_{. Once}

shuffling is completed, each of the short reads will then need to shuffling is completed, each of the short reads will then need to be aligned (mapped to a specific position) to a large reference be aligned (mapped to a specific position) to a large reference genome (in our case this is a 3 billion long sequence). From genome (in our case this is a 3 billion long sequence). From the

the outoutskiskirt, rt, rearead d alialignmgnment ent seeseems ms to to be be a a simsimple ple proprobleblemm i.e. ﬁnd a particular substring within a bigger string, however i.e. ﬁnd a particular substring within a bigger string, however rea

read d errerrors can ors can and do and do haphappenpen, , henhence ce turturninning g thithis s intinto o anan appro

approximaximate te hashhashing ing probproblem lem which is which is then solved then solved utilutilizinizingg the

the BurrBurrows ows WheeWheeler ler AlignAligner er (BW(BWA) A) trantransform sform algoralgorithmithm.. It

It is our is our undunderserstantandinding g thathat t eaceach h of of the shorthe short t reareads ds can becan be sequenced in this stage without knowledge of any other short sequenced in this stage without knowledge of any other short read data as long as the input BAM ﬁle is valid. While running read data as long as the input BAM ﬁle is valid. While running intial experiments we found an interesting problem related to intial experiments we found an interesting problem related to missing base-pairs which is further discussed in section V. missing base-pairs which is further discussed in section V.

B.

B. Phase Phase 2: 2: SAM SAM SortingSorting

Many of the downstream analysis programs which utilizes Many of the downstream analysis programs which utilizes BA

BAM M filefiles s actactualually ly reqrequiruire e the files the files to to be be sorsorted since thisted since this allows reading from these files to be done more efficiently. allows reading from these files to be done more efficiently.

C.

C. Phase 3: RemovPhase 3: Remove Filese Files

Phase 3 removes the temporary ﬁles created in the earlier Phase 3 removes the temporary ﬁles created in the earlier phases.

phases.

D.

D. Phase Phase 4: 4: Mark Mark DuplicatesDuplicates

In this phase, duplicates of any particular unique short read In this phase, duplicates of any particular unique short read will be marked to prevent a skew during the variant calling will be marked to prevent a skew during the variant calling proce

process. ss. DupliDuplicatecates s are usually produced due are usually produced due to to a a partparticulicularar DNA preparation process and may be unavoidable. Marking DNA preparation process and may be unavoidable. Marking duplicates sounds like something that can be built on top of duplicates sounds like something that can be built on top of the canonical example of using Map reduce i.e. counting the the canonical example of using Map reduce i.e. counting the number of words in a given document.

number of words in a given document.

E.

E. Phase Phase 5: 5: Remove Remove FilesFiles

This phase too we remove some ﬁles This phase too we remove some ﬁles

F

F. . Phase 6: Index DeduPhase 6: Index Dedup (Build BAM Index)p (Build BAM Index)

In this phase, the output BAM files from preceding stages In this phase, the output BAM files from preceding stages are indexed for fast access. This essentially allows a particular are indexed for fast access. This essentially allows a particular short read to be accessed by jumping immediately to a specific short read to be accessed by jumping immediately to a specific offset within a particular BAM file thus negating the need to offset within a particular BAM file thus negating the need to read preceding data into memory. The output of this process read preceding data into memory. The output of this process is a set of accompanying index files to the original BAM files. is a set of accompanying index files to the original BAM files.

7

7_{http://www.broadinstitute.org/gatk/guide/tagged?tag=bam}_{http://www.broadinstitute.org/gatk/guide/tagged?tag=bam}

G.

G. Phase 7 and 8: Phase 7 and 8: InDel TInDel Targargets and Realign InDelsets and Realign InDels

The next two phases pertains to Insertion Deletion (InDel) The next two phases pertains to Insertion Deletion (InDel) an

and d ththus us wowoululd d bebeneneﬁt ﬁt frfrom om a a shshorort t ovoverervivieeww. . ThThe e tetermrm In

InDeDel l rerefefers rs to to a a clclasass s of of vavaririatatioions ns ththat at is is prpresesenent t in in aa human genome. The need to align short reads around InDels human genome. The need to align short reads around InDels arises due to 2 major reasons. The ﬁrst reason is that InDel arises due to 2 major reasons. The ﬁrst reason is that InDel can

can caucause se mamappeppers rs (su(such ch as as the the BWBWA A algalgoriorithm thm empemploloyedyed in

in PhaPhase se 1) 1) to to mimisalsalign shorign short t reareads. The ds. The secsecond reasoond reason n isis tha

that t thothose se mimisalsalignignmements nts thethen n wouwould ld harharm m the the accaccurauracy cy of of downstream processes such as base quality recalibration and downstream processes such as base quality recalibration and variant detection.

variant detection.88

Here, the regions in the BAM ﬁles that will need to be Here, the regions in the BAM ﬁles that will need to be rea

realigligned ned are are ideidentintiﬁedﬁed. . In In gengeneraeral l thethere re are are thrthree ee typtypes es of of real

realignmignment ent tartargetsgets: : knowknown n sitesites s such as such as ones coming fromones coming from the 1000 Genome project, InDels that are seen in the original the 1000 Genome project, InDels that are seen in the original alignments (as part of the application of the BWA algorithm), alignments (as part of the application of the BWA algorithm), and ﬁnally sites where evidence suggests a hidden InDel. and ﬁnally sites where evidence suggests a hidden InDel.

Onc

Once e the the InDInDel el reregiogions ns hahave ve beebeen n ideidentintiﬁedﬁed, , thithis s stastagege would then perform the actual realignment.

would then perform the actual realignment.

H.

H. Phase Phase 9: 9: Remove Remove FilesFiles

Phase 9 removes the temporary ﬁles created by the earlier Phase 9 removes the temporary ﬁles created by the earlier phase.

phase.

I.

I. Phase Phase 10: 10: Baseq Baseq (Base (Base Quality) Quality) Covariance Covariance Stage Stage 1 1 (Base(Base Recalibration)

Recalibration)

Hardware sequencers would associate a quality score with Hardware sequencers would associate a quality score with their reads. There is however a tendency for sequencers to be their reads. There is however a tendency for sequencers to be overly optimistic in terms of their conﬁdence scores. In this overly optimistic in terms of their conﬁdence scores. In this stage, a recalibration table will be built utilizing some machine stage, a recalibration table will be built utilizing some machine learning algorithm based on covariation among several features learning algorithm based on covariation among several features of base such as read group, the original quality score from the of base such as read group, the original quality score from the sequencer, 1st/2nd read in a pair, etc.

sequencer, 1st/2nd read in a pair, etc.99

J.

J. Phase Phase 11: 11: Baseq Baseq (Base (Base Quality) Quality) Covariance Covariance Stage Stage 22 In this phase, the recalibration table built on the previous In this phase, the recalibration table built on the previous stage would be utilized to recompute the base quality score. stage would be utilized to recompute the base quality score.

K.

K. Phase 12: Plot Base Quality CovariaPhase 12: Plot Base Quality Covariancence

In this stage, the plot of the recalibration tables are In this stage, the plot of the recalibration tables are gen-er

eratated ed so so ththat at an an evevalaluauatition on cacan n be be mamade de on on whwhetetheher r ththee recalibration has worked properly.

recalibration has worked properly.

L.

L. Phase Phase 13: 13: Base Base Quality Quality RecalibrationRecalibration In

In thithis s phaphase, se, the recalthe recalibribrateated d datdata a is is subsubjecjected to ted to somsomee ﬁnal processing before written out to disk.

ﬁnal processing before written out to disk.

M.

M. Phase Phase 14: 14: Remove Remove FilesFiles

Phase 14 simply removes the Index realignment ﬁles. Phase 14 simply removes the Index realignment ﬁles.

8

8_{http://hmg.oxfordjournals.org/content/19/R2/R131.full}_{http://hmg.oxfordjournals.org/content/19/R2/R131.full} 9

9_http://w_{http://weallseqtoseq.blo}_{eallseqtoseq.blogspot.com/20}_{gspot.com/2013/10/gatk}_{13/10/gatk-best-practices-w}_{-best-practices-workshop-}

orkshop-data-pre.html data-pre.html

(4)

V

V. . FFRAMEWORKRAMEWORK

In this section, we describe the framework that was setup In this section, we describe the framework that was setup to run the pipeline and the changes and modiﬁcations done to to run the pipeline and the changes and modiﬁcations done to the framew

the framework for ork for the purposethe purposes s of of analanalysisysis. . In additionIn addition, , wewe also discuss the challenges faced, while trying to get working also discuss the challenges faced, while trying to get working the pipeline with a subset of the reads to ease analysis. Further, the pipeline with a subset of the reads to ease analysis. Further, we also talk about the challenges we faced while duplicating we also talk about the challenges we faced while duplicating the environment on another machine.

the environment on another machine.

A.

A. FrFramework amework ChangesChanges

The framework - essentially a script provided to us needed The framework - essentially a script provided to us needed an overhaul and further additons so that we could start an overhaul and further additons so that we could start mea-sur

suring ing syssystem tem perperforformamance nce for for eaceach h phaphase se of of the the pippipelieline.ne. T

To o measmeasure ure basibasic c resouresources like, rces like, CPU CPU utilutilizatization, Memoryion, Memory utilization and Disk I/O we decided to use

utilization and Disk I/O we decided to use dstat dstat tool for each tool for each command. For collecting hardware performance counters, e.g. command. For collecting hardware performance counters, e.g. L1 cache misses, last level cache misses etc. we used

L1 cache misses, last level cache misses etc. we used perf-stat perf-stat utility. Also, since most of the tools are Java based, we thought utility. Also, since most of the tools are Java based, we thought it would beneﬁcial to collect JVM Garbage Collector logs as it would beneﬁcial to collect JVM Garbage Collector logs as well.

well.

B.

B. Data Data SourcesSources Our initi

Our initial al exexperperimimentents s werwere e run on run on the mastthe master er nodnode e of of the Bioinformatics lab during a short period of time when it the Bioinformatics lab during a short period of time when it was not being utilized by the the department. This gave us an was not being utilized by the the department. This gave us an advantage since we did not have to simultaneously understand advantage since we did not have to simultaneously understand both tools and environment at the outset. However, once the both tools and environment at the outset. However, once the normal usage of the node was resumed, we had to utilize a normal usage of the node was resumed, we had to utilize a dif

differferent ent macmachinhine. e. The The cricritictical al ississue ue wawas s thathat t we we had had beebeenn using private patient data up to that point and due to privacy using private patient data up to that point and due to privacy concerns it was impossible for us to move the data set into concerns it was impossible for us to move the data set into the new machine. We were thus faced with the challenge of the new machine. We were thus faced with the challenge of finding an equivalent data set. We tried a number of sources finding an equivalent data set. We tried a number of sources and finally found the 1000 Genome project to be fruitful. and finally found the 1000 Genome project to be fruitful.1010

C.

C. PPair issue and ’R’air issue and ’R’ Ou

Our r wowoes es didid d nonot t ststop op afafteter r bebeining g abable le to to ﬁnﬁnd d a a dadatata set

set frofrom m 1001000 0 GenGenome ome proprojecject, t, WWe e expexpectected ed the the subsubset set of of the

the reareads ds to to sucsuccescessfusfully lly run run in in the the pippipelieline. ne. HoHoweweveverr, , wewe fac

faced ed crycryptiptic c errerrors ors in in the the ﬁrsﬁrst t phaphase se of of the the pippipelieline ne itsitselfelf.. Aft

After er mulmultitiple ple lonlong g debdebuggugging ing sessessiosions ns and and helhelp p frofrom m thethe bioinformatics people we found out that the base pair reads bioinformatics people we found out that the base pair reads come in pairs and the bwa tool requires that every base pair come in pairs and the bwa tool requires that every base pair has its pair in the data set.

has its pair in the data set.

Since we were running our experiments for a subset of the Since we were running our experiments for a subset of the reads there was a probability that certain base-pairs didn’t have reads there was a probability that certain base-pairs didn’t have their pair and because of this pipeline was failing at the first their pair and because of this pipeline was failing at the first phase itself. So to remove this error, we had to write a wrapper phase itself. So to remove this error, we had to write a wrapper script around the given pipeline to pre-process the input bam script around the given pipeline to pre-process the input bam file and remove all the reads whose pairs didn’t exist.

ﬁle and remove all the reads whose pairs didn’t exist.

Challenges didn’t end here, there is a phase in the pipeline Challenges didn’t end here, there is a phase in the pipeline which uses ’R’ to plot graphs and it turns out that tools are which uses ’R’ to plot graphs and it turns out that tools are using deprecated libraries of ’R’ for plotting which we couldn’t using deprecated libraries of ’R’ for plotting which we couldn’t install while replicating the environment, so we had to skip install while replicating the environment, so we had to skip tha

that t phaphase se durduring ing our our exexperperimimentents. s. Our Our iniinitiatial l anaanalyslyses es onon bioin

bioinformformaticatics s machmachine showed that ine showed that this phase did this phase did not takenot take

10

10_{http://www.1000genomes.org/}_{http://www.1000genomes.org/}

signiﬁcant time estate of the pipeline and hence it shouldn’t signiﬁcant time estate of the pipeline and hence it shouldn’t impact our future experiments.

impact our future experiments.

Finally, after all these changes, we were able to replicate Finally, after all these changes, we were able to replicate the complet

the complete e envenvironmironment on ent on a a new servenew server r where we where we coulcouldd experiment with different data-set sizes and analyse results. experiment with different data-set sizes and analyse results.

V

VII. . BBASELINEASELINE R RUNUN In

In ordorder er to to gaigain n a a betbetter underter understastandinding ng of of the pipelthe pipelineine,, we started our investigation by running the pipeline using the we started our investigation by running the pipeline using the simplest possible conﬁguration and yet still have it perform simplest possible conﬁguration and yet still have it perform meaningful work. This is accomplished by setting the number meaningful work. This is accomplished by setting the number of

of ththrereadads s to to 1 1 in in eaeach ch phphasase e of of ththe e pipipepeliline ne as as wewell ll asas utilizing only

utilizing only

1%

of the full data set used in a typical run of the full data set used in a typical run in the Bioinformatics department, which comes to be around in the Bioinformatics department, which comes to be around

10

million reads. The choice of million reads. The choice of

1%

of the full set was informed of the full set was informed through conversations with a student from the department. through conversations with a student from the department.

The

The nexnext t levlevel el of of expeexperimeriments consistents consisted d of of runnnrunnning theing the same pipeline and dataset but with higher number of threads same pipeline and dataset but with higher number of threads and since we were working on a quad-core machine, we chose and since we were working on a quad-core machine, we chose

44

thre threads for ads for immeimmediate compardiate comparison. Later in ison. Later in the paperthe paper, , wewe look at how SMT performs.

look at how SMT performs. T

Tablable e IV IV shoshows ws thathat t thethere re are mainlare mainlyy

55

phase phases s whicwhichh contr

contributibute e to to apprapprox.ox.

84%

of the tota of the total l timtime. e. LooLookinking g at theat the graphs for those phases namely, ﬁgures 1, 2, 3, 4, 5, 6, 7, 8, 9 graphs for those phases namely, ﬁgures 1, 2, 3, 4, 5, 6, 7, 8, 9 and 10, it is evident that neither I/O or Memory is a bottleneck. and 10, it is evident that neither I/O or Memory is a bottleneck. Remaining graphs of the phases can be found in Appendix A Remaining graphs of the phases can be found in Appendix A

Thi

This s is is furfurthether r evevideident nt whewhen n we we capcapturtured ed the the resresourourcece usage pattern for an actual run that lasted for approx. 7 days. usage pattern for an actual run that lasted for approx. 7 days. Figures 11 and 12 show that although I/O was increased, its Figures 11 and 12 show that although I/O was increased, its not the bottle neck of the system and the tools never ran out not the bottle neck of the system and the tools never ran out of memory.

of memory. Altho

Although ugh we we belibelieveve e that memory that memory bandwbandwidth contentiidth contentionon mig

might ht be be one of one of the possithe possible reasoble reasons ns why these phasewhy these phases s areare slow but due to timing constraints we were not able to explore slow but due to timing constraints we were not able to explore that avenue. that avenue. P h P haas e s e N aN am e m e S iS in gn gl e l e TTh rh re ae ad d TTi mi me e ( s( s) ) 4 4 T hT hr er ea d a d TTi mi me e ( s( s)) S Shhuufff f AAllggn n 4949332 2 11550055 S Soorrt t SSaam m 22444 4 224444 R Reemmoovve e BBAAM M 1 1 11 D DeeDDuup p SSrrttd d 22999 9 229999 R Reemmoovve e SSrrttd d 1 1 11 IInnddeex x DDeeDDuup p 440 0 4400 IInnddeel l TTaarrggeet t 2299661 1 883300 R

Reeaalln n TTaarrggeet t 33993 3 339944 R Reemmoovve e DDeedduup p 1 1 11 B Baasse e CCoovvaar r 1122776 6 771199 B Baasse e CCoovvaar r 2 2 1919007 7 11333300 P

Plloot t ((DDiiddnn’’t t MMeeaassuurree) ) 0 0 00 B

Baasseeq q RReeaalln n 1100885 5 772211 R

Reemmoovve e RReeaalln n 1 1 11

TABLE IV: Time taken by different phases of the pipeline for TABLE IV: Time taken by different phases of the pipeline for different number of threads for dataset of size 10 Million reads different number of threads for dataset of size 10 Million reads

A.

A. Investigating Investigating Java Java Runtime Runtime EnvirEnvironment onment

With the exception of the ﬁrst phase of the pipeline, all With the exception of the ﬁrst phase of the pipeline, all the utilized tools are programs written in Java. This provided the utilized tools are programs written in Java. This provided us with another avenue of investigation to pursue through the us with another avenue of investigation to pursue through the enabling of Java Garbage Collection (GC) logging.

(5)

0 0 20 20 40 40 60 60 80 80 100 100 0 0 500 500 1000 1000 1500 1500 2000 2000 2500 2500 3000 3000 3500 3500 4000 4000 4500 4500 50005000 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

(a) CPU and I/O Usage (a) CPU and I/O Usage

2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 3300000 0 3355000 0 4400000 0 4455000 0 55000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

(b) MemorMemory y consumconsumptionption

Fig. 1: Single thread resource usage for Phase 1: Shufﬂe and Align Fig. 1: Single thread resource usage for Phase 1: Shufﬂe and Align

0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 11660000 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 3 3 4 4 5 5 6 6 7 7 8 8 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 11660000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

Fig. 2: 4 threaded resource usage for Phase 1: Shufﬂe and Align Fig. 2: 4 threaded resource usage for Phase 1: Shufﬂe and Align

Given the batch nature of each phase in the pipeline, we Given the batch nature of each phase in the pipeline, we are primarily interested in knowing the phase throughput i.e. are primarily interested in knowing the phase throughput i.e. the

the perpercencentagtage e of of timtime e spespent nt by by eaceach h pippipelieline ne phaphase se doidoingng useful work instead of GC. In general, any throughput number useful work instead of GC. In general, any throughput number at 95% and above are considered good

at 95% and above are considered good 1111. Additionally, should. Additionally, should the throughput number falls below 95%, we are also interested the throughput number falls below 95%, we are also interested in

in seeseeing any ing any insinstantances of ces of a a GC GC taktaking ing an an exexcescessisivevely ly lonlongg amount of time.

amount of time. W

We e momodidifiefied d ththe e scscriript pt usused ed to to rurun n ththe e totool ol chchaiain n toto augment every ”java” command with the following flags augment every ”java” command with the following flags

-Xloggc:logs -Xloggc:logs -XX:+PrintGCDetails -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGCTimeStamps 11 11_{http://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guide}_{http://www.slideshare.net/jclarity/hotspot-garbage-collection-tuning-guide} ..

The set of flags serve to output GC data in sufficient details The set of flags serve to output GC data in sufficient details including the amount of memory freed in both Young and Old including the amount of memory freed in both Young and Old Generations in each iteration as well as the times during which Generations in each iteration as well as the times during which GC happens into the specified log file. We then used a tool GC happens into the specified log file. We then used a tool called JClarify Censum

called JClarify Censum1212 to visualize the data and collect theto visualize the data and collect the throughput metric.

throughput metric.

1) Throughput Results:

1) Throughput Results: When it comes to the pipeline, the When it comes to the pipeline, the phases utilizing Java that contribute the most to the running phases utilizing Java that contribute the most to the running tim

time e are Indel Taare Indel Targrgetsets, , BasBase e CoCovavar r 1, 1, BasBase e CovCovar ar 2, 2, andand Bas

Baseq eq RecRecal al so so we we are going are going to to conconstrstrain our ain our disdiscuscussiosionn to

to thothose se 4 4 phaphasesses. . For any For any comcombinbinatiation on of of inpinput ut sizsize e andand number of threads, we found that the throughput number never number of threads, we found that the throughput number never dropped below 95% with the exception of the Base Covariance dropped below 95% with the exception of the Base Covariance phase where the number dropped steadily when the number of phase where the number dropped steadily when the number of

12

(6)

0 0 20 20 40 40 60 60 80 80 100 100 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 33000000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 1.6 1.6 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2.5 2.5 3 3 3.5 3.5 4 4 4.5 4.5 5 5 5.5 5.5 6 6 6.5 6.5 7 7 0 0 55000 0 1100000 0 1155000 0 2200000 0 2255000 0 33000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

Fig. 3: Single thread resource usage for Phase 7: Indel Target Index Fig. 3: Single thread resource usage for Phase 7: Indel Target Index

0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 88000 0 990000 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 88000 0 990000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index Fig. 4: 4 threaded resource usage for Phase 7: Indel Target Index

1 100M M 2200M M 4400M M 110000MM I n I nd ed el l TTa ra rg eg et s t s 998 .8 .66% % 9 89 8. 9. 9% % 9 89 8. 7. 7% % 9 89 8. 9. 9%% B Baasse e CCoovvaar r 1 1 9955..88% % 9494..77% % 9933..66% % 9922%% B Baasse e CCoovvaar r 2 2 9977..11% % 9696..88% % 9966..66% % 9966..66%% B Baasseeq q RReeccaal l 9977..44% % 9977..55% % 9977..55% % 9977..66%%

TABLE V: GC Throughput for number of threads = 8 TABLE V: GC Throughput for number of threads = 8

threads speciﬁed is 8 (at 4 threads and below, the throughput threads speciﬁed is 8 (at 4 threads and below, the throughput remained at above 95

remained at above 95

We then made some modiﬁcations to the ”java” command We then made some modiﬁcations to the ”java” command to

to sesee e if if we we cacan n imimprprovove e ththe e ththrorougughphput ut nunumbmber er fofor r ththatat ph

phasase. e. In In papartrticiculular ar we we fofounund d ththat at fofor r a a dadata ta sisize ze of of 4040 million, we were able to increase the throughput from 93.6% million, we were able to increase the throughput from 93.6% to 96.1% by specifying a number of additional ﬂags

to 96.1% by specifying a number of additional ﬂags

-Xms15000m -Xms15000m -Xmx15000m -Xmx15000m -Xmn10000m -Xmn10000m

The choice of those numbers were informed based on the raw The choice of those numbers were informed based on the raw GC data collected in particular the sizes of the Young and Old GC data collected in particular the sizes of the Young and Old Generation at the end of the GC log for that phase with some Generation at the end of the GC log for that phase with some buffer built in to tolerate possible memory spikes.

buffer built in to tolerate possible memory spikes. Impr

Improveovement aside, we ment aside, we don’don’t t thinthink k that it that it is is partparticulicularlyarly mea

meaninningfugful, l, for for exexampample, le, if if we we refrefer er to to tabtable le V, V, impimprorovinvingg Base Covar 1’s throughput from 92% to 95% would only result Base Covar 1’s throughput from 92% to 95% would only result in an improvement of around 1 minute (3% of 35 minutes) in in an improvement of around 1 minute (3% of 35 minutes) in running time. It is true that given that the particular phase is running time. It is true that given that the particular phase is currently taking around 30 hours (using the full data set) that currently taking around 30 hours (using the full data set) that we may be seeing some real savings in time, however we also we may be seeing some real savings in time, however we also know that the phase is currently being run with the number of know that the phase is currently being run with the number of threads set to 5 in the Bioinformatics head node (and earlier threads set to 5 in the Bioinformatics head node (and earlier

(7)

0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 0 0 0.02 0.02 0.04 0.04 0.06 0.06 0.08 0.08 0.1 0.1 0.12 0.12 0.14 0.14 0.16 0.16 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1 Fig. 5: Single thread resource usage for Phase 10: Base Covariance 1

0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 1.6 1.6 1.8 1.8 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b)

Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1 Fig. 6: 4 threaded resource usage for Phase 10: Base Covariance 1

we have mentioned that throughput is not an issue when the we have mentioned that throughput is not an issue when the number of threads is 4). In other words, without further work, number of threads is 4). In other words, without further work, we

we are not are not ablable e to to ascascertertain how ain how mucmuch h of of an an imimproprovevemenmentt would tuning this particular phase result in.

would tuning this particular phase result in.

Ano

Anothether r thithing ng thathat t is is worworth th mementintionioning ng is is thathat t earearly ly inin our experiments, we were seeing Java programs ran in single our experiments, we were seeing Java programs ran in single th

threreadaded ed coconﬁnﬁguguraratition on ththrorougugh h ththe e ususe e of of ncnct t (s(see ee sosomeme oth

other er tabtable le for inforfor informatmation ion on on tootools ls thathat t accaccept ept numnumber ber of of data/compute threads) consuming more than a thread’s worth data/compute threads) consuming more than a thread’s worth of CPU utilization (”top” would occasionally show utilization of CPU utilization (”top” would occasionally show utilization above 30%) . We utilized an application called JConsole which above 30%) . We utilized an application called JConsole which is shipped with any JDK since version 5.0 to investigate the is shipped with any JDK since version 5.0 to investigate the issue since the later has the ability to show all running threads issue since the later has the ability to show all running threads in a particular Java Virtual Machine (JVM).

in a particular Java Virtual Machine (JVM).

Wha

What t we we foufound nd wawas s thathat t altalthouhough gh the tool did the tool did ststart upart up a

a numnumber ber of of datdata/ca/comomputpute e thrthreadeads s as as spespeciﬁciﬁed ed thrthrougough h aa

con

conﬁguﬁguratration ion optoptionion, , thethere re werwere e alsalso o a a numnumber ber of of utiutilitlityy threads started by the tool and the JVM including a Progress threads started by the tool and the JVM including a Progress Tracking thread (started by GATK), a number of GC threads Tracking thread (started by GATK), a number of GC threads and a number of TCP/IP threads. We were not able to ascertain and a number of TCP/IP threads. We were not able to ascertain why the JVM would start a number of TCP/IP threads, and why the JVM would start a number of TCP/IP threads, and the

there re doedoes s not seem to not seem to be be anany y flagflags s spespecificific c turturninning g thothosese thre

threads off, ads off, howehowever the ver the prespresence of ence of thesthese e extrextra a threthreads doads do explain the CPU utilization phenomenon that we were seeing. explain the CPU utilization phenomenon that we were seeing.

B.

B. PPerformance erformance counter counter measurementsmeasurements

Just to be sure that the delay in the phases are not caused Just to be sure that the delay in the phases are not caused by

by lot lot of of L1 L1 caccache he mismissesses, , brabranchnch-pr-prediedictoctor r mismisses ses or or OfOff- f-Chip

Chip acceaccessessses, , we we measmeasured ured hardhardware ware perfperformaormance nce countcountersers using

using perf-stat perf-stat to tool ol anand d fofounund d it it to to be be coconsnsisistetent nt acacrorossss mult

multiple runs iple runs for for difdiffereferent nt inpuinput t sizesizes s and and threathread d numbnumbers.ers. There was about

There was about

33

..

5%

of L1 Data cache miss rate, of L1 Data cache miss rate,

11

..

2%

of of Branch predictor misses. But an interesting result was

(8)

0 0 20 20 40 40 60 60 80 80 100 100 0 0 200 200 400 400 600 600 800 100800 1000 0 1201200 0 1401400 0 1601600 0 1801800 0 20020000 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 0.8 1 1 1.2 1.2 1.4 1.4 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 1144000 0 1166000 0 1188000 0 22000000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption

Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2 Fig. 7: Single thread resource usage for Phase 11: Base Covariance 2

0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 0 0 0.05 0.05 0.1 0.1 0.15 0.15 0.2 0.2 0.25 0.25 0.3 0.3 0.35 0.35 0.4 0.4 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 4 4 6 6 8 8 10 10 12 12 14 14 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 1122000 0 11440000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption

Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2 Fig. 8: 4 threaded resource usage for Phase 11: Base Covariance 2

of LLC

of LLC missmisses which es which was an was an inteinterestresting result but becauseing result but because of timing constraints we couldn’t explore this avenue as well. of timing constraints we couldn’t explore this avenue as well. However, since we saw that these measurements seemed pretty However, since we saw that these measurements seemed pretty much constant across runs with different sizes and threads we much constant across runs with different sizes and threads we concl

concluded that uded that while the while the cachcache e missmisses es coulcould d be be potepotentialntiallyly interesting it wouldn’t affect our ability to model the pipeline. interesting it wouldn’t affect our ability to model the pipeline.

V

VIIII. . TTHEHE P PIPELINEIPELINE M MODELODEL

One of the ideas behind proﬁling the pipeline was to build One of the ideas behind proﬁling the pipeline was to build a

a momodedel l babasesed d of of whwhicich h we we cocoululd d mamake ke prprededicictitionons. s. ThThee idea being that if we could accurately predict and model the idea being that if we could accurately predict and model the beh

behavavior of ior of the softthe softwarware e pippipelieline ne we we hadhad, , in in a a mamannenner r of of spe

speakiaking, ng, trutruly ly undunderserstootood d the the worworkinkings gs of of the the syssystem tem as as aa black box.

black box.

As mentioned previously, to better understand the pipeline As mentioned previously, to better understand the pipeline we logically split the pipeline into phases. In retrospect, this we logically split the pipeline into phases. In retrospect, this

turned out to be a crucial step in building the model as this turned out to be a crucial step in building the model as this enabled us to better predict the behavior of the entire pipeline enabled us to better predict the behavior of the entire pipeline as

as a a comcombinbinatiation on of of phaphases ratheses rather r thathan n as as a a sinsingle entitgle entityy.. Ess

Essententialiallyly, , we we hahave ve bubuilt a ilt a modmodel el for each for each of of the phasethe phasess which is then used to make a prediction for the entire pipeline. which is then used to make a prediction for the entire pipeline. More speciﬁcally, the model will utilize the size of input data More speciﬁcally, the model will utilize the size of input data and number of threads to make a prediction for the time the and number of threads to make a prediction for the time the pipeline will take to complete.

pipeline will take to complete.

A.

A. Building Building the the ModelModel In

In bubuildilding the ing the momodeldel, , we we rearealilized that zed that thethere re are fourare four important factors that affect the running time. Further more, important factors that affect the running time. Further more, each of the phases had a different behavior which seemed to each of the phases had a different behavior which seemed to stem from change in the four factors.

stem from change in the four factors.

The factors that affect running time and form an integral The factors that affect running time and form an integral part of the model are:

(9)

0 0 20 20 40 40 60 60 80 80 100 100 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 11220000 0 0 5 5 10 10 15 15 20 20 25 25 30 30 35 35 40 40 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 0 0 22000 0 44000 0 66000 0 88000 0 1100000 0 11220000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption

Fig. 9: Single thread resource usage for Phase 13: Base Recalibration Fig. 9: Single thread resource usage for Phase 13: Base Recalibration

0 0 20 20 40 40 60 60 80 80 100 100 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 0 0 10 10 20 20 30 30 40 40 50 50 60 60 C C P P U U U U t t i i l l i i z z a a t t i i o o n n ( ( % % ) ) R R e e a a d d / / W W r r i i t t e e ( ( M M B B ) ) Time (s) Time (s) CPU Utilization CPU Utilization Read(MB) Read(MB) Write(MB) Write(MB)

2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 0 0 11000 0 22000 0 33000 0 44000 0 55000 0 66000 0 77000 0 880000 M M e e m m o o r r y y ( ( G G B B ) ) Time (s) Time (s) Memory Utilization Memory Utilization (b) Memory consumption (b) Memory consumption

Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration Fig. 10: 4 threaded resource usage for Phase 13: Base Recalibration

Fig. 11: CPU and I/O utilization for full run on 1 billion reads. Fig. 11: CPU and I/O utilization for full run on 1 billion reads.