Particle MCMC algorithms and architectures for accelerating inference in state-space models.

(1)

Contents lists available atScienceDirect

International

Journal

of

Approximate

Reasoning

www.elsevier.com/locate/ijar

Particle

MCMC

algorithms

and

architectures

for

accelerating

inference

in

state-space

models

Grigorios Mingas

a

,

Leonardo Bottolo

b

,

Christos-Savvas Bouganis

a

,

∗

a_Department_of_Electrical_and_Electronic_Engineering,_Imperial_College_London,_London,_SW7_2AZ,_UK b_Department_of_Medical_Genetics,_University_of_Cambridge,_Cambridge,_CB2_0QQ,_UK

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received15May2016

Receivedinrevisedform23September 2016

Accepted24October2016 Availableonlinexxxx

Keywords:

MarkovChainMonteCarlo Particleﬁlter

Fieldprogrammablegatearray Bayesianinference

Hardwareacceleration

ParticleMarkovChainMonteCarlo(pMCMC)isastochasticalgorithmdesignedtogenerate samples froma probability distribution, when the density of the distributiondoes not admit aclosed form expression. pMCMC is most commonly used to sample from the Bayesian posterior distribution in State-Space Models (SSMs), a class of probabilistic models used in numerous scientific applications. Nevertheless, this task is prohibitive whendealingwithcomplexSSMswithmassivedata,duetothehighcomputationalcost of pMCMC and its poor performance when the posterior exhibits multi-modality. This paper aimstoaddressbothissuesby:1)Proposing anovelpMCMC algorithm(denoted ppMCMC), whichusesmultiple Markov chains (insteadofthe one used bypMCMC) to improve sampling efficiency for multi-modal posteriors, 2) Introducingcustom, parallel hardware architectures, which are tailored for pMCMC and ppMCMC. The architectures are implemented on Field Programmable Gate Arrays (FPGAs), a type of hardware accelerator with massive parallelization capabilities. The new algorithm and the two FPGA architectures are evaluated using a large-scale case study from genetics. Results indicatethatppMCMCachieves1.96xhighersamplingefficiencythanpMCMCwhenusing sequentialCPUimplementations.TheFPGAarchitectureofpMCMCis12.1xand10.1xfaster thanstate-of-the-art, parallelCPU and GPU implementationsof pMCMC and upto 53x moreenergyefficient;theFPGAarchitectureofppMCMCincreasesthesespeedupsto34.9x and 41.8x respectivelyand is 173xmorepower efficient,bringingpreviouslyintractable SSM-baseddataanalyseswithinreach.

1. Introduction

MarkovChainMonteCarlo(MCMC)algorithmsareoneofthefundamentaltoolsusedtosamplefromcomplexprobability distributions.TheyarepopularforsamplingfromposteriordistributionsinBayesianinference,whicharetypicallycomplex andhavelargedimensions. Thesesamplesare usedto estimateintegrals,whichare neededinorderto inferparameters, make predictions,performmodel comparison, etc.[1].This processis calledMonteCarlointegration.[2]and[1]contain examplesofhowMCMCsamplesareusedinvariousapplications.

MostMCMC algorithms [1]are based on theassumption that the densityof the sampleddistribution (denoted p

(θ )

, where

θ

isthesampledvariable)canbeevaluatedpointwiseuptomultiplicativeconstant.Nevertheless,therearemodeling

*

Correspondingauthor.

E-mailaddresses:[email protected](G. Mingas),[email protected](L. Bottolo).

http://dx.doi.org/10.1016/j.ijar.2016.10.011

(2)

scenarios in Bayesian statistics where it is not possible to evaluate the probability density because it doesnot admit a closedformexpression.TheseincludeinferenceforStateSpacemodels(SSMs)[3],stochastickineticmodels[4],undirected graphicalmodels[5].Thesecasesareoftencalled“analyticallyintractable”intheMCMCliterature.

The pseudo-marginalMCMC sampler introduced by [6]marked a major breakthroughin this ﬁeld:It showed that if an unbiased estimatorofthe sampleddensityisused insideMCMC (instead oftheclosed formexpression), thesampler still convergestothecorrectdistribution.Alargebodyofworkhasexploitedthisremarkablepropertyduringthelastﬁve years toconstructnewMCMCalgorithms.ThemostwidelyappliedamongthemisParticle MCMC(pMCMC)[3],whichis designedforinferenceonSSMswithunknownparameters.

SSMs withunknown parameters is a widely usedclass ofBayesian models. Theyare applicable toproblems wherea sequence ofunknown (hidden)statesneeds tobeinferredfromknownobservationsandtherelationshipsbetweenstates andobservationshaveunknownparameterswhichalsoneedtobeinferred.Asimpleexampleistargettracking,whereboth the positionofthetracked objectandthenoisein sensormeasurementsare unknown andneedtobe inferredfromthe measurements.ThistypeofSSMinferenceisknowntoleadtoanalyticallyintractabledensities[3].pMCMCusesaParticle Filter(PF)[7]tounbiasedlyestimatethe density(which isanaturalchoiceforSSMs).Therefore,itiscapable ofinferring theSSMposterior.

pMCMChasextendedtheapplicabilityofSSMstoareaslikeecology[8],communicationnetworks[9],systemsbiology

[10]andmarinebiogeochemistry[11].Nevertheless,thecomputationalcostofestimatingthedensityusingaPFisO

(

T

·

P

)

(where T isthe numberofhiddenSSM statesand P isthe numberofparticles ofthe PF) [12].Moreover, each pMCMC iteration requires a separate PF run andtypically thousands ofiterations are needed. This means that usingpMCMC for large SSMsiscomputationally prohibitive.When conventionalCentralProcessingUnits (CPUs)are employed theruntimes can reach months oryears [11,10]. The work presented herewas initially motivatedby such a complex problem:SSMs in genetics,where T,which corresponds toDNA bases,can reachmillions (see[13] andSection 6).This situationforces practitionerstocollectfewerMCMCsamples(whichleadstoincreasedvariance)oruseasimplermodeland/orfewerdata. Inaddition tothiscomputational burden,theissueofmulti-modalityintheposterior(i.e.theexistenceofmanyareas ofhighprobability)isentirelyunaddressedinpMCMCliterature.Nevertheless,multi-modalitycanappearingenetics appli-cations(seeSection6and[13]).Insuchcases,pMCMCexplorestheposteriordensityslowly,acommonproblemformany MCMCmethodswhichisknownasslowmixing.

Inordertohandletheabovelimitations,thisworkfollowstwomaindirections.First,itproposesanewalgorithm,which is abletosample frommulti-modaldistributions moreeﬃcientlythanpMCMC.Second, itleveragesthepower ofparallel hardwareacceleration.Morespeciﬁcally,itusesaclassofhardwaredevicescalledFieldProgrammableGateArrays(FPGAs), which offermassiveparallelresourcesandhavethedistinctivecharacteristicthat theirhardware architecturecan befully customizedaccordingtotherequirementsoftheimplementedalgorithm.Evenmorecrucially,thisworkcombinestheabove twoapproachestofurtherincreaseperformance;algorithmicandhardwaredesignaredonejointlysothatinformationabout thenatureoftheunderlyinghardwarecanbeexploitedwhendesigningthealgorithmandviceversa.Thecontributionsof thisworkaresummarizedbelow:

1. AnewMCMCalgorithm,denotedPopulation-based Particle MCMC–ppMCMC,whichis anextension ofpMCMCand whichusesapopulationofchainsinsteadofthesinglechainusedbypMCMC.ppMCMCimprovesmixingfordensities whichare bothanalyticallyintractableandmulti-modal(Section4).AjustiﬁcationofwhyppMCMCworks, i.e.whyit convergestothedesiredposteriordistribution,isalsoprovided.

2. AnovelparallelhardwarearchitectureforpMCMC,implementedonanFPGA,whichexploitstheparallelisminsidethe PFtoimprovesamplingthroughput(Section5.2).

3. Anovelparallelhardware architectureforppMCMC,alsoimplementedonanFPGA,whichpipelinesthecomputations ofthemultipleppMCMCchainstoincreasetheutilizationofthePFdatapathinsidetheFPGA(Section5.3).

ThenewalgorithmandthetwoFPGAsamplersareappliedtoalarge-scaleinferenceproblemingenetics–anSSMmodel ofDNAmethylationwithunknownparameters(Section6).Thismodelcanleadtouni-modalormulti-modalposteriors.In theformercase,theFPGApMCMCsampleriscomparedtooptimized,state-of-the-artCPUandGPUimplementationsandit isfoundtobe6.4x–14.9xand3.8x–30.8xfasterrespectively.Inthelattercase,thetrade-offbetweenthenumberofchains and thenumber ofPF particles inppMCMC is explored tomaximize performance. Whencomparing sequential software implementationsofthealgorithms,ppMCMCis1.96xfasterthanpMCMC.WhencomparingtheFPGAimplementations,the FPGAppMCMCisupto3.24xfasterthantheFPGApMCMCsamplerandupto34.9xand41.8xfasterthantheCPUpMCMC andGPUpMCMCsamplersrespectively.

Thepaperisorganizedasfollows.Section2providesthenecessarybackgroundonBayesianinference,SSMsandpMCMC, aswellasashortintroductiononFPGAs,whileSection3givesanoverviewofprevious literature.Section4introducesthe newppMCMC algorithmandSection 5describestheproposedFPGAarchitectures.Section 6containsa descriptionofthe case studies used to evaluate the method, followed by Section 7 which presents the results of the evaluation. Finally, Section8concludesthepaper.

(3)

Fig. 1.Hidden states, observations and unknown parameters of SSM.

2. Background

2.1. Bayesianinference

Bayesianinferenceisasimpleframework,whichconsistsininferringtheprobabilitydistributionoftheunknown quan-titiesofaprobabilisticmodelgiventheknowndata.Thisisdonebycombining:1)Informationabouttheunknownsbefore anydataareavailable,i.e.thepriordistributionoftheunknowns,2)Informationabouttheunknownsprovidedbythedata, i.e.thelikelihoodoftheunknowns. Theinferreddistributioniscalledtheposteriordistributionoftheunknownsgiventhe data.Practitionersaretypicallyinterestedinevaluatingorestimatingexpectations(integrals)overthisposterior.Theform oftheseintegralsdependsontheapplicationandisoutsidethescopeofthispaper(formoreinformationsee[2]and[1]).

2.2. State-spacemodelswithunknownparameters

SSMswithunknown parametersareaclassofBayesianmodels,whereahiddensequenceofstateswhichevolvesover time emitsasequence ofobservations,whilethe functionsthatdeﬁne therelationships betweenstatesandobservations haveunknownparameters.Moreformally,letXt

∈

R

nX bearandomvariablerepresentingthestateoftheSSMattimestep t

∈ {

1

,

...,

T

}

,wherenX

∈

N

.LetYk

∈

R

nY be another randomvariablerepresenting theobservationvector attime stept,

where nY

∈

N

. Timesteps here can representtime, space, etc., depending onthe application. AnSSM is deﬁnedby the followingsetofequations:

X1

∼

e

(X

1) (1)

Xt

∼

f

(X

t

|

Xt−1,

θ),

t

>

1 (2)

Yt

∼

g

(Y

t

|

Xt

,θ

),

t

>

0 (3)

Here,e

(

X1

)

istheinitialprobability density, f

(

Xt

|

Xt−1

,

θ

)

istheprobability densityofmovingto thecurrentstate given

thepreviousstate(transitiondensity)andg

(

Yt

|

Xt

,

θ

)

istheprobabilitydensityofobservingthecurrentobservationgiven

the current state (observationdensity). The two densities also depend on a set of unknown parameters,

θ

∈

R

nθ _with nθ

∈

N

.Usually,differentpartsofthe

θ

vector areusedin(2)and(3)butherethewholevectorappearsinbothdensities

forsimplicity.Symbol“

∼

”meansdistributedaccordingto.Fig. 1depictsthestructureofanSSM.

From the standpoint ofBayesian inference, the unknown quantities that need to be inferred are 1) the hidden state sequence X1:T and 2) the unknown parameters

θ

. It is assumed that the random variablerepresenting the observation

vectorisknown,i.e.Y1:T

=

y1:T,wherey1:T isaknownvector(alsocalledthedata).ThereforethegoalofSSMinferenceis

toinferthefollowingjointposterior:

p

(X

1:T

,θ

|

Y1:T

=

y1:T

)

=

p

(X

1:T

,

θ

|

y1:T

)

∝

ρ

(θ)

π

(X

1:T

|

θ) λ(y

1:T

|

X1:T

,θ

)

=

ρ

(θ

)

e

(X

1)

T t=2f

(X

t

|

Xt−1,θ

)

_tT₌₁g

(y

t

|

Xt

,

θ)

(4)

Here,

ρ

(

θ

)

is the Bayesian prior density ofthe parameter

θ

,

π

(

X1:T

|

θ

)

is the Bayesian prior densityof the states and

λ(

y1:T

|

X1:T

,

θ

)

isthelikelihoodofthedata(observations)giventheunknowns.Theright-most partofthesecond lineof

theequationiseasilyderivedfromthedensitiesofSection2.2(see[7]fordetails).

2.3. pMCMC:jointstateandparameterestimationinSSMs

pMCMC(Algorithm 1)solvestheSSMinferenceproblemstochastically,i.e.bydrawingsamplesfromthejointposteriorof equation(4).Inotherwords,pMCMCgeneratesasequenceofN jointsamples

(

X1:T

,

θ

)

1:N,whicharedistributedaccording

to p

(

X1:T

,

θ

|

y1:T

)

.

Ateach loop iterationin line9,pMCMCperforms threesteps:1) Itproposesa candidatesample

θ

∗ fortheunknown parameter basedonthe previous sample

θ

, usinga simpleprobability densityq

(

· | ·

)

(line10). 2) Itcalls a PFalgorithm (which willbeexaminedlater)toachieve twothings:First,togenerateasetof P samplesfromp

(

X1:T

|

y1:T

,

θ

∗

)

.One of

(4)

thesesamplesisselectedrandomlyinline12andserves astheproposed statesequencesample X∗₁_:_T.Second,toproduce anunbiasedestimateofthequantityl

(

y1:T

|

θ

∗

)

,whichisthelikelihoodofthedata,giventheSSMparameters.Theestimate

isdenotedby

˜

l

(

y1:T

|

θ

∗

)

anditisnecessaryforthefollowingstepofthealgorithm.ThePFrunisthemostcomputationally

intensivepartofpMCMC.3)Itacceptsthejointcandidatesample

(

X∗₁_:_T

,

θ

∗

)

withprobabilitymin

(

1

,

a

)

wheretheacceptance ratioais:

a

=

ρ(θ∗)˜l(y1:T|θ∗)q(θ|θ∗) ρ(θ)˜l(y1:T|θ)q(θ∗|θ)

(5)

Otherwisetheprevioussampleisreplicated(lines13–20).Theﬁrstiteration(lines2–7)onlyperformsstep2toinitializethe sampler.TheinputsofpMCMCarethenumberofMCMCiterations(N),theinitialparametersample(

θ

init)andthenumber ofparticles ofthePF(P).The algorithmreturnsthehistoryofsamples Sample

[

1

:

N

]

andposteriors P osterior

[

1

:

N

]

after all N iterationshavebeencompleted.

Algorithm1ParticleMCMC. 1:procedurepMCMC(P,N,

θ

init) 2:Firstiteration(i=1): 3: ˜l(y1:T|θinit),

X

11::PT ←PF(P,θinit)//PFrun 4: Sampleindexp∈ {1,...,P},set

X

init1:T=X

p 1:T 5: Sample[1]=(Xinit1:T,θ

init

) //saveinitialsample 6: P osterior[1]=ρ(θinit)l˜(y1:T|θinit)//saveposterior 7: θ=θinit//temporaryvariable

8:Remainingiterations: 9: fori=2,...,Ndo

10: θ∗∼q(θ∗|θ) //proposenewcandidatesample 11: ˜l(y1:T|θ∗),

X

11::PT

←PF(P,θ∗)//PFrun 12: Sampleindexp∈ {1,...,P},set

X

∗₁_:_T=X₁p_:_T

13: Accept(X∗1:T,θ∗)withprobabilitymin(1,a) 14: ifacceptedthen

15: //savecandidatesampleandposterior 16: Sample[i]=(X∗1:T,θ∗)

17: P osterior[i]=ρ(θ∗)˜l(y1:T|θ∗) 18: θ=θ∗//temporaryvariable 19: else

20: //replicateprevioussampleandposterior 21: Sample[i]=Sample[i−1]

22: P osterior[i]=P osterior[i−1]

23:return(Sample[1:N], P osterior[1:N])

The criticalcontributionof[3]istheproof that,aslongasthelikelihoodestimatesinthenumeratoranddenominator of theratio(5)are unbiased,the algorithmconvergesto the“correct” distribution,i.e.theposterior p

(

X1:T

,

θ

|

y1:T

)

. The

likelihoodestimatesprovidedbythePFareunbiased,sopMCMCsamplesfromthe“correct”posterior.Moreover,thisistrue regardlessofthevarianceofthePFestimate.

Asmentionedabove,runningaPFisnecessaryateachiterationofpMCMCinordertogenerateacandidatesampleofthe SSMstatesandan unbiasedestimateoftheSSMlikelihood.PFsareapowerfulclassofmethodsusedforstate estimation inSSMs.ABootstrapPF(Algorithm 2)isusedhere,asitisthemostpopularPFvariant.

The inputsofAlgorithm 2arethenumberofparticles (P) andthecandidate

θ

∗ (generatedinthepreviousstepofthe pMCMCalgorithm).ThePFusesthesetofP particlestoestimateeachstateinthesequenceX1:T.Ateveryiteration(line 8),

itpropagatesparticlestothenextSSMtimestepusing:1) Sampling(lines13–15)wherethetransitiondensity(2)isused to sample particles for thenext state, 2) Weighting (lines 16–17) wherethe observationdensity(3) is usedto calculate weightsforallparticles.Theweightquantiﬁesthequalityofaparticleasanestimateofthenextstate,3)Resampling(lines 9–12),whereanewparticlesetisgenerated,inwhichparticleswithlargeweightsarereplicatedmanytimesandparticles withlowweightsarediscarded (equivalenttomultinomialsampling).ResamplingiscriticalforthestabilityofthePF(see

[7]).

After all T time steps have ﬁnished, the ensemble of all particle sets X1₁:_:P_T is available. Also, the algorithm uses the computedweights toproduceanestimate ofthelikelihood(line19).Thisestimateisunbiased[7].Theoutputs ofthePF areusedinsidepMCMC.Speciﬁcally,arandomparticlesequenceischosenfromtheensembleX1:P

1:T (line12ofAlgorithm 1)

andthelikelihoodestimateisusedintheacceptancestep,asdescribedabove.

The main tuningparameter ofpMCMCis thenumberof particles(P). Withlarger P,each pMCMC iterationbecomes moretime-consumingbutthevarianceofthelikelihoodestimatedecreases.ThelatterleadstofasterpMCMCmixing[14], i.e.fasterexplorationoftheposterior,sincepMCMCmovesclosertoanexactMCMCalgorithm.Therefore,thereisatradeoff betweentheruntimeofthePFandthemixingofpMCMC.

Thechoiceoftheproposaldistributionq

(

·|·

)

isalsoatuningparameterofthealgorithm.Inpractice,aGaussianproposal isoftenusedanditsvarianceisadjustedtomaximize mixing.Thisistheapproachfollowedinthisarticle;amultivariate

(5)

GaussianproposalisusedanditscovariancematrixistunedforeachexaminedSSMposterior(theexactcovariancematrices canbefoundinSection6.1).Finally,likeinallMCMCalgorithms,itisusualtodiscardaninitialchunkoftheN samplesas burn-in,inordertomakesurethesamplerhasconvergedtothetargetdistribution.Thispracticeisfollowedinthisarticle, aswillbedetailedinSection6.1.

Algorithm2BootstrapParticleFilter.

1:procedurePF(P,

θ

∗) 2:Initialstate(t=1): 3: fork=1,...,Pdo

4: SampleparticlefrominitialdensityX˜k 1∼e(X1) 5: fork=1,...,Pdo

6: CalculateinitialweightW1k←g(y1| ˜Xk1,θ∗) 7:Remainingstates:

8: fort=2,...,T do

9: fork=1,...,Pdo

10: Sampleancestorindexak_from_{₁

,...,P}with 11: probabilitiesproportionalto{W1

t−1,...,W P t−1} 12: andsetresampledparticle

X

k

t−1← ˜X ak t−1 13: fork=1,...,Pdo

14: Sampleparticlefromtransitiondensity 15: X˜k t∼f(Xt|Xkt−1,θ∗) 16: fork=1,...,Pdo 17: CalculateweightWk t←g(yt| ˜Xkt,θ∗) 18:Likelihoodestimate: 19: ˜l(y1:T|θ∗)←tT=1 1 P P k=1Wkt 20:return(˜l(y1:T|θ∗),

X

11::TP= {X11:P,...,X1T:P})

2.4. FieldProgrammableGateArrays

FPGAsareapowerfulanduniqueclassofdigitalhardware devices,whicharetypicallyusedasacceleratorsfor compu-tationally intensivealgorithms. FPGAs fundamentally differfromother hardware devices suchasCentral ProcessingUnits (CPUs)or Graphics Processing Units (GPUs).While CPUs andGPUs haveﬁxed hardware architecturesdeﬁned beforethe chipismanufactured, FPGAsconsist ofare-programmable “fabric”upon whichanycustom hardware architecturecan be mapped[15].AsFig. 2 shows,the FPGAfabricconsist ofanarray ofprogrammable logicblocksandahierarchyof inter-connectswhichallowtheblockstobewiredtogether.Eachblockcanimplementasimplebooleanfunction.Byconnecting theblockstogether anydigitalcircuitcanbe implemented.Moreover,modernFPGAsareequippedwithdedicatedunitsto performcommonarithmeticoperations,aswellasseveralon-chipmemoryblocks.

FPGAsallowthedesignertotailorthehardwaretothespeciﬁccharacteristicsoftheapplication.Thenumberand gran-ularity ofparallel processingelements, the kinds ofarithmetic operators embedded in theseelements, the amount and architectureofcachememoriesandthearithmeticprecisionofcomputationsareallcustomizable.Thus FPGAscanexploit non-obvious and/or limitedparallelism, maximize pipelining eﬃciencyand adaptthe memory architecture to the access

(6)

pattern ofthe algorithm. Forexample, whilea GPU architecture hasa pre-defined amount ofon-chipmemory per pro-cessing core, an FPGA can allocatea custom amountof on-chipmemoryto each “processingelement” or evenallow all elements toaccessallthememory,depending ontherequirementsofthetargetedalgorithm.Moreover,theprogramming of thefabriccan bedone asmanytimesasdesiredandevenduring runtimeand/or partially.Allthisflexibility canlead to significantlyhigherperformance[15] andlowerpowerconsumption [16]comparedtofixed-architecturedevices(CPUs, GPUs).Nevertheless,FPGAdesignrequiresspecializedprogrammingskillsandlongerdevelopmenttimesthanCPUandGPU coding,despitetherecentemergenceofvarioushighleveltools[17].Inaddition,CPUsandGPUsremaincompetitiveforthe typesofproblemstheyaredesignedfor;sequential/conditionalcodeandSingle-Instruction-Multiple-Data(SIMD)algorithms respectively.

Thisisthe mainreasonwhichhasledhardwaremanufacturers todevelop devicesthatcombinetraditionalCPUswith acustomFPGAfabriconthesamechip[18].Thesedevicescombinethebeneﬁtsofbothworlds;theyallowthesequential and non-criticalpart ofthe targetedalgorithm (including librariesthat might be hard to migrateto an FPGA) tobe run on theCPU,while thecustom fabricisresponsible fortheparallel, computationallyintensivepart.Different architectures can be loaded onthe fabric depending onthe algorithm.It isexpected that thesehybridchips willplay an increasingly important roleinthe datacenterandhighperformance computingmarkets,wherethe demandforhighthroughputand low poweris strong.AlgorithmslikeMCMCandPFare commonlyusedinthesecomputingenvironments,since they are fundamentaltoolsfornumerouslarge-scalescientiﬁcapplications.

3. Relatedwork

TheﬁrstworkonparallelaccelerationofpMCMCusingGPUswaspublishedin2012[19].Thesampleandweightstepsof thePFwereparallelizedandasystematicresamplingalgorithmwasimplementedusingaparallelpreﬁx-sumtechnique.No detailsontheGPUdeviceandthespeedupsoversoftwarewerepresentedandthecasestudiesweresmallSSMs(T

=

100). Thereforetheeﬃciencyoftheimplementationcannotbeevaluated.Moreover,thelimitednumberofstatesandparticlesis notrepresentativeofreal,large-scaleproblemswherehardwareaccelerationisneeded.

A more recentwork [10] proposed an automated tool (LibBi) forSSM inference on CPUsand GPUs whichsupported pMCMC.LibBi providesadomain-specific languageforSSMsandaparallelprocessingback-end(combiningMPI, OpenMP, SSE andCUDA). Nevertheless,the language is inflexible in terms of the definable transition/observation densities; some populardensitiesandseveralcommontypesofcomputationsarenotsupported.Some ofthesedisadvantagescanbe mit-igatedby thefactthatthetool isopenlyavailableandcanbemodified. [10]alsocontains limitedresultsonapplyingthe framework to a Lorenz ’96SSMwith T

=

40 and P

=

8192.The evaluationshowedthat using the framework ona CPU withOpenMPandSSEoptimizationswas5xfasterthanrunningsequentialC++code;thecombinedCPU/GPUversion(with CUDA)providedanextra4xspeedup.

TheworkpresentedhereisthefirsttouseFPGAstoaccelerate pMCMC,takingadvantageoftheuniquefeaturesofthe platform (e.g. custom pipelining)to accelerate sampling. It isalso the firstto usea modified, parallelized version of the Residual SystematicResampling algorithm inside the PF (first introduced in [20]). Finally, thiswork isthe first to apply pMCMCtolargeSSMs,whereT andP scaletomanythousands.

In the algorithmic level,[21] and [14] haveproposed ways to optimize P based onthe tradeoffbetweenPF runtime andpMCMCmixing,whichwasmentionedabove.Dependingonassumptions,theyrecommendthat P shouldbechosenso thatthevarianceofthelikelihoodestimateisbetween1.0and3.2,inordertomaximizepMCMC’smixingpersecond(the “exploration”ofthestatespacethatpMCMCachievesperunitofcomputingtime).Nevertheless,thefollowingissuesremain unaddressed:1)WhenoptimizingthevalueofP,both[21]and[14]assumethatthePF’sruntimegrowsproportionatelyto

P;thisisnottruewhenusingparallelimplementationofthePFalgorithm.2)TheuseofmultipleMCMCchainstoimprove mixinghasnotbeenexamined,althoughitisclearthatpMCMCcanmixveryslowlyformulti-modalposteriors,evenwith large P.

This worktackles theﬁrst issueby takingintoaccount theperformance ofparallel PFimplementations (eitherin the FPGAorintheCPU/GPU)whenoptimizingthevalueofP.Inordertoaddressthesecondissue,anovelpMCMCsampler (de-notedppMCMC)isintroduced,alongwithan accompanyingFPGAarchitecture.ppMCMCimprovesmixingformulti-modal posteriors byutilizingmultipleMCMCchainsinsteadoftheoneusedinpMCMC.The“correctness” ofthenewalgorithm, i.e.thefactthatitsamplesfromthecorrectSSMposterior,isalsojustiﬁed.Moreover,incontrastto[21]and[14],where P

istheonlyoptimizedparameter,thepresentworkjointlyoptimizes P andM(thenumberofMCMCchainsinppMCMC)in ordertoﬁndthecombinationthatmaximizedmixingpersecond.

Finally, this is the ﬁrst use of pMCMC and FPGAs for methylation analysis (see the case study of Section 6). So far, pMCMC was considered intractable for such problems and approximatemethods were used [22], leading to bias in the resultingestimates.

4. ppMCMC:anewMCMCalgorithm

4.1. Summary

This section presents a novel MCMCalgorithm, which extendspMCMC by adding auxiliary MCMCchains in order to improve mixinginscenarios wheretheSSM posterior(4)is multi-modal,such asthe onepresented inSection 6.

(7)

Multi-modalityistheexistenceofmultipleareasofhighprobabilityintheposterior[23].StandardpMCMCfacesthewell-known issueofslowmixingwhensamplingfromsuchposteriors.Thisissueiscommoninmanysingle-chainMCMCmethods.The samplertendsto“getstuck”inoneofthemodesoftheposteriorandrarelymanagestomovetoanothermode[23].

ThenewalgorithmiscalledPopulation-based ParticleMCMC(ppMCMC)anditisacombinationoftwoexistingMCMC methods(Population-basedMCMC[23]andpMCMC).Thepseudo-codeofppMCMCisgiveninAlgorithm 3.Itmakesuseofa populationofMpMCMCchains.Ateachiteration(loopinline 10),ppMCMCperformstwokindsofoperationsforallchains. The ﬁrst is the updateoperation (loop in line12), during which the next MCMC sample of each chain is generated, usingthe samethreesteps described forpMCMC: 1)Propose acandidateunknown parameter sample

θ

∗ (line 13) using densityqj

(

· | ·

)

forchain j

∈ {

1

,

...,

M

}

.LikeinpMCMC,multivariateGaussianproposaldistributionsare usedinppMCMC.

The covariance matrix of these distributions can be tuned. 2) Run a PF to propose a candidate state sequence sample

X∗₁_:_T andan estimate ofthe likelihood (lines 14–15). 3) Accept thejoint candidatewith probability min

(

1

,

aj

)

for chain j

∈ {

1

,

...,

M

}

. Otherwise repeat the previous sample (lines 16–24). The difference compared to pMCMC is the use of a differentacceptance ratiofor each chain(step 3). This leads each chain tosample froma differentdistribution. Chain 1 samplesfromthe“correct”posterior(4).Theotherchainssamplefrom“smoothed”versionsof(4)andtheirmixingisfaster thanthemixingoftheﬁrstchain,aswillbeexplainedinSection 4.2.Onlythesamplesoftheﬁrstchainarekept.

Thesecondoperation withineachiterationofppMCMCistheexchangeoperation(loopinline26),whichwasnotused inpMCMC.Duringthisoperation, pairsofneighboring chains (e.g.

(

1

,

2

)

,

(

3

,

4

)

,etc.)exchange their MCMCsampleswith probabilitymin

(

1

,

eq

)

(forpair

(

q

,

q

+

1

),

q

∈ {

1

,

...,

M

−

1

}

).Sample exchangeshelp thealgorithmmixfasterbecausethey

pushsamplesfromtheauxiliarychainstotheﬁrstchain,helpingittoescapefromlocalmodes.Thisisexplainedinmore detailbelow.

Algorithm3Population-basedParticleMCMC.

1:procedureppMCMC(P,N,M,

θ

init 1:M,T emp1:M) 2:Firstiteration(i=1): 3: forj=1,...,Mdo 4: ˜l(y1:T|θinitj ), X11::PT ←PF(P,θinitj )//PFrun 5: Sampleindexp∈ {1,...,P}andset

X

init

1:T←X p 1:T 6: Sample[j][1]←(Xinit 1:T,θ init j )//savesample 7: P osterior[j][1]←ρ(θinit j ) ˜ l(y1:T|θinitj ) 1 T emp j

8: θj←θinitj //temporaryvariable(chain j) 9:Remainingiterations:

10: fori=2,...,Ndo

11: Chainupdates:

12: forj=1,...,Mdo

13: θ∗∼qj(θ∗|θj)//proposenew

θ

(chain j) 14: ˜ l(y1:T|θ∗),

X

11::PT ←PF(P,θ∗)//PFrun 15: Sampleindexp∈ {1,...,P}andset

X

∗₁_:_T=X₁p_:_T

16: Acceptcandidatesample(X∗₁_:_T,θ∗) 17: withprobabilitymin(1,aj) 18: ifacceptedthen

19: Sample[j][i]=(X∗1:T,θ∗)//savesample 20: P osterior[j][i]=ρ(θ∗)˜l(y1:T|θ∗)

1

T emp j

21: θj=θ∗//temporaryvariable(chain j) 22: else 23: Sample[j][i]=Sample[j][i−1] 24: P osterior[j][i]=P osterior[j][i−1] 25: Chainexchanges: 26: for(q,r)=(1,2),(3,4),. . .OR(q,r)=(2,3),(4,5),. . .do 27: Sample[q][i]↔Sample[r][i] 28: P osterior[q][i]↔P osterior[r][i] 29: θq↔θr

30: withprobabilitymin1,eq 31:return(Sample[1][1:N],P osterior[1][1:N])

4.2. Updateandexchangeoperations 4.2.1. Updates

Duringtheupdateoperationforchain j

∈ {

1

,

...,

M

}

,thefollowingacceptanceratioisused:

aj

=

ρ(θ ∗₎_˜_l₍_y 1:T|θ∗) 1 T emp j _q j(θj|θ∗) ρ(θj)˜l(y1:T|θj) 1 T emp j _q j(θ∗|θj)

,

j

∈ {

1

, ...,

M

}

(6)

(8)

where T empjisthetemperatureofchain j and1

=

T emp1

<

T emp2

< ...

<

T empM

<

∞

.Therefore,onlychain j

=

1 uses

theacceptanceratioofpMCMC(givenin(5))andthussamplesfromthe“correct”SSMposterior.Theremaining(auxiliary) chains (j

∈ {

2

,

...

M

}

)useatempered acceptanceratio,i.e.the estimatedlikelihoodinthenumeratorandthedenominator israisedtothepowerof _{T emp}1

j forchain j.Thisleadstheauxiliarychainstosample fromsomesetof“smoothed”(closer

to uniform) versions ofthe “correct” SSMposterior. Chains withsmoother densities mix faster, since itis easier forthe samplertojumpbetweenmodeswhenthemodesaresmoothed.

Althoughapplyingatemperaturetothelikelihoodisawell-knowntechniqueinpopulation-basedmethods,inthecaseof ppMCMCitisnotclearwhatthetargetdistributionofeachtemperedchainis.Thetermp

(θ )

p

˜

(

y1:T

|

θ )

1

T emp j_p

₍

_X₁_:_T

_|

_y₁_:_T

_,

_{θ )}

(which contains the likelihood estimate)is used in the acceptance ratio of chain j but this doesnot lead the chain to convergetotheposteriorp

(θ )

p

(

y1:T

|

θ )

1

T emp j_p

₍

_X₁_:_T

_|

_y₁_:_T

_,

_{θ )}

_,_as_one_would_intuitively_expect_after_comparing_to_the_pMCMC

acceptanceratioin(5).

According to the theory presented in [6], a pMCMC chain converges to a target distribution provided that unbiased estimates ofthe distribution’sdensity areused inthe numeratoranddenominator of theacceptance ratio.Nevertheless, the term p

(θ )

p

˜

(

y1:T

|

θ )

1

T emp j_p

₍

_X₁_:_T

_|

_y₁_:_T

_,

_{θ )}

_is _not_an _unbiased _estimator_of _p

_{(θ )}

_p

₍

_y₁_:_T

_|

_{θ )}

1

T emp j_p

₍

_X₁_:_T

_|

_y₁_:_T

_,

_{θ )}

_: _Running

a PFonthe givenSSM producesan unbiasedestimate p

˜

(

y1:T

|

θ )

of thelikelihood(i.e.

E

˜

p

(

y1:T

|

θ )

=

p

(

y1:T

|

θ )

).

How-ever, applying thetemperature afterthe likelihood estimate is generated (i.e.ﬁnding p

˜

(

y1:T

|

θ )

1

T emp j₎ _does_not _maintain

unbiasedness withrespecttothe “correct”temperedlikelihood (i.e. p

(

y1:T

|

θ )

1

T emp j_). _In_more_detail,_because_the _function

x

→

x

1

T emp j _is_concave_for_{T emp}_j

_≥

_1,_applying_Jensen’s_inequality_[24]_leads_to_the_following:

E

p

˜

(y1

_:T

|

θ )

1 T emp j

_≤

_E

_p

_˜

_(y1

:T

|

θ )

1 T emp j

₌

_p

_(y1

:T

|

θ )

1 T emp j (7)

Thisis trueforall ppMCMCchains sincealltemperaturesare greaterorequalto one.Equalityholdsonlyfor T empj

=

1.

Therefore, unbiasedestimatesofthe“correct”tempered likelihooddensities p

(

y1:T

|

θ )

1

T emp j _(and_thus_the_respective

pos-teriordensities)canbeacquiredonlywhenT empj

=

1 (i.e.onlyinthecaseoftheﬁrstchain).Nevertheless,thisdoesnot

mean that the tempered chains donot convergeto any target distribution. Infact, chain j convergesto the distribution whose density is unbiasedlyestimated by p

(θ )

˜

p

(

y1:T

|

θ )

1 T emp j_p

₍

_X

1:T

|

y1:T

,

θ )

. The densities of thesedistribution can be

writtenas: pj

(X

1:T

, θ

|

y1:T

)

=

p

(θ )

E

˜

p

(y

1:T

|

θ )

1 T emp j

_p

_(X

1:T

|

y1:T

, θ ),

j

∈ {

1

, ...,

M

}

(8)

These are theactual target densities of the M chains ofthe ppMCMC algorithm. Only for the ﬁrst chainthe density is equaltothe“correct”temperedposterior(withT emp1

=

1)andalsotothe“correct”SSMposterior,i.e. pj

(

X1:T

,

θ

|

y1:T

)

=

p

(

X1:T

,

θ

|

y1:T

)

.Thekeypoint hereisthat itisnotnecessaryfortheauxiliary chainstosample fromthesetof“correct”

temperedposteriors,sincetheirsamplesarenotkept.Onlythesamplesofthefirstchainarekeptbecausetheyaretheones distributedaccordingtothedesired,“correct”SSMposterior.Theauxiliarychainsareonlyemployedtohelpthefirstchain mix faster;they need toexplore thedistribution spacequickly(andthereforetheir target distributions needtobe closer touniform)andoccasionallyfeedthefirstchainwithsamplesthroughexchangemoves.Thesesampleshelpthefirstchain escapefromlocalmodes,aswillbeexplainshortly.Itisthereforeenoughfortheauxiliarychainstosamplefromsomeset oftemperedversionsoftheSSMposterior(andnotnecessarilyfromthe“correct”setoftemperedposteriors).Thedensities inequation(8)providethistemperingeffectandthereforefulfill theirpurpose,i.e.theymovefastinthedistributionspace andhelpthefirstchainmixfasterthroughexchangemoves.

Infact,theterm“correct”isonlyusedhereforreasonsofclarity;thereisnoreasontobelievethatthe“correct”densities

p

(

y1:T

|

θ )

1

T emp j _are_the_best_candidates_for_use_in_auxiliary_chains_(with_respect_to_the_mixing_gains_they_offer)._On_the_other

hand,thisdoesnot meanthatanydensitywouldserveasagoodauxiliarydensity,e.g. uniformauxiliary densitieswould not help the mixingof the ﬁrst chain because they are not concentrated around the true modes. In other words, some

smoothingmustbeappliedtothetruedensitiesbuttheformoftheoptimalauxiliarydensitiesisnotknown.

Regardingthechoiceofthetemperaturevalues,theyoftenfollowageometricoradditiveprogressioninMCMCliterature, i.e. Ti+1

Ti=constant orTi+1

−

Ti

=

constant,althoughother choiceshavealsobeenconsidered[25].Intheevaluationsectionof

thisarticle,ppMCMCtemperaturesfollowanadditiveprogression.TheexacttemperaturevaluesaregiveninSection6.

4.2.2. Exchanges

Inordertoexploitthetemperingprocessdescribedabovetoimprovemixing,ppMCMCmakesuseofsampleexchanges, which attempt to swap samples between chains at each iteration. Exchange moves are attempted between chain pairs

(

1

,

2

),

(

3

,

4

),

. . .

orchainpairs

(

2

,

3

),

(

4

,

5

),

. . .

(neighboring chains)inarotatingorder.Asmentionedabove, theexchange movespushMCMCsamplesfromthehigh-temperaturechains,whichareclosertotheuniformdistribution,tothe lower-temperature chains, which are closerto the “correct” target distribution. Eventually samplesreach the ﬁrst chain which

(9)

samplesfromthe“correctdistribution”andhelpitescapefromlocalmodes.Theexchangeacceptanceratiobetweenchains

(

q

,

r

)

is: eq

=

˜l(y1:T|θr) 1 T empq ˜_l₍_y 1:T|θq) 1 T empr ˜l(y1:T|θq) 1 T empq _˜_l₍_y 1:T|θr) 1 T empr (9)

whereq

∈ {

1

,

...,

M

−

1

}

,r

=

q

+

1 and

(

Xq₁_:_T

,

θ

q

₎

_and

₍

_Xr

1:T

,

θ

r

)

arethecurrentsamplesofchainsqandrrespectively.Ithas

tobenotedthattheaboveratiorequiresnoadditionalPFruns(allthevaluesarealreadyknownfromtheprecedingupdate steps).

Itisimportanttojustifywhytheaboveexchangemovefulﬁlls therequirementsofthetheoryofpMCMC[6]withregards to maintainingthecorrecttarget distributions ofthe twochains. Theexchange step isequivalent toa Metropolisupdate where the updated state is the jointstate ofboth chains (withindexes q and r). According to [6],a Metropolis update maintainsthetargetdistributionaslongasthenumeratoranddenominatoroftheacceptanceratioareunbiasedestimates ofthetargetdensity.Inthecaseoftheexchangestep(andfocusingonlyonthenumeratorforsimplicity),thismeansthat the product p

(θ

r

)

p

˜

(

y1:T

|

θ

r

)

1 T empq _p

₍

_Xr 1:T

|

y1:T

,

θ

r

)

p

(θ

q

)

p

˜

(

y1:T

|

θ

q

)

1 T empr _p

₍

_Xq 1:T

|

y1:T

,

θ

q

)

(denoted by U) hasto be an

unbiasedestimateoftheproductofthetargetdensitiesofthetwochains(which weregiveninequation(8)).Itiseasyto show that thisisthe case.Since p

(θ

r

)

, p

(

Xr₁_:_T

|

y1:T

,

θ

r

)

, p

(θ

q

)

and p

(

Xq₁_:_T

|

y1:T

,

θ

q

)

are all zero-varianceestimators,the

followingistrue: E

[

U

] =

E

[

p

(θ

r

)

p

˜

(y

1:T

|

θ

r

)

1 T empq _p

_(X

r 1:T

|

y1:T

, θ

r

)

p

(θ

q

)

˜

p

(y

1:T

|

θ

q

)

1 T empr _p

_(X

q 1:T

|

y1:T

, θ

q

)

] =

p

(θ

r

)

p

(X

r1:T

|

y1:T

, θ

r

)

p

(θ

q

)

p

(X

q₁_:_T

|

y1:T

, θ

q

)

E

˜

p

(y

1:T

|

θ

r

)

1 T empq _p

˜

_(y

1:T

|

θ

q

)

1 T empr

(10)

Moreover,it isknownthat p

˜

(

y1:T

|

θ

r

)

1

T empq _and _p

˜

₍

_y 1:T

|

θ

q

)

1

T empr _are _independent_estimators_(since_they_are _generated_by

two independent PFs, each assigned to its own MCMC chain). Therefore, their product is equal to the product of their expectations.Combiningthiswithequation(10),itisclearthat:

E

[

U

] =

p

(θ

r

)

p

(X

r 1:T

|

y1:T

, θ

r

)

p

(θ

q

)

p

(X

q 1:T

|

y1:T

, θ

q

)

E

p

˜

(y1

_:T

|

θ

r

)

1 T empq

_E

_p

˜

_(y1

:T

|

θ

q

)

1 T empr

=

_p_q

_(X1

_:_T

_{, θ}

r

|

_y1_:_T

₎

_p_r

_(X1

_:_T

_{, θ}

q

|

_y1_:_T

₎

(11)

Thelast lineoftheequation istruedueto equation(8)andproves thattheproduct U isan unbiasedestimateofthe of theproductofthetargetdensitiesofthetwochains.

5. FPGAarchitecturesforpMCMCandppMCMC

ThissectionpresentsnovelFPGAarchitecturesforpMCMCandppMCMC,whichexploittheinherentparallelismofeach algorithm.

5.1. Parallelisminthealgorithms

In pMCMC, the available parallelism is O

(

P

)

, since all particles inside the PF can be processed inparallel (although resampling requires communication between parallel processes). The N iterations of pMCMC are strictly sequential. In ppMCMC,the availableparallelism increasesto O

(

M

·

P

)

duetotheexistence of M MCMCchainswhich canbe updated independently(althoughexchangesrequireinter-chaincommunication).Theindependenceoftheupdateoperationscanbe exploitedtopipelinecomputationsinthePFdatapath.

5.2. pMCMCarchitecture

The pMCMCarchitectureis illustratedin Fig. 3.ThepMCMC blockwithin thearchitecture comprises allthe necessary parts of an MCMC sampler plus a PF block. The PF block comprises three stages: Sample & Weight, Partial sums and Resampling(groupedinadifferentwaycomparedtothedescriptionofSection2.3forimplementationreasons).Theoutput MCMCsamplesarestoredinexternal(off-chip)memory,whichisnotshownintheﬁgure(moredetailsbelow).

AteverypMCMCiteration(i

∈ {

1

,

...,

N

}

),thecurrent(latest)

θ

sampleisreadfromtheCurrentthetamemoryandsentto theSampleProposalblock,whichsamplesfromq

(

θ

∗

|

θ

)

.Thecandidate

θ

∗ iswrittentotheProposedthetamemoryand for-wardedtothePF.Thelatterreturnstheestimatedlikelihood

˜

l

(

y1:T

|

θ

∗

)

andarandomlyselectedstatesampleX₁∗_:_T

=

Xp₁_:_T;

thesearewrittentotherespectivememories.ThePriorEvaluationblockcomputestheprior

ρ

(

θ

∗

)

.TheUpdateblockaccepts orrejects

(

X∗₁_:_T

,

θ

∗

)

basedon(5),usingseveralinputvalues.Iftheupdateissuccessful,thecandidatesample andthe pro-posedlikelihoodandpriorvaluesarewrittentotherespectivecurrentmemories(equivalentto Sample

[

i

]

and P osterior

[

i

]

(10)

Fig. 3.FPGAarchitecturefor pMCMC.Theorangememoryblocksareinputs/outputsofthePF.ThepinkmemoryblocksareoutputsofpMCMC. (For interpretationofthereferencestocolorinthisﬁgurelegend,thereaderisreferredtothewebversionofthisarticle.)

each iteration,thecontentsofthecurrentmemoriesaresent totheoff-chipmemory.Theabovesteps arerepeatedforall pMCMCloopiterations.

PFblock:ThisblockisactivatedonceperpMCMCiteration.Thesample,weightandresamplingoperationsarerepeated

T times,onceforeverytimestep,withtheorderofoperationschangedcomparedtoAlgorithm 2.Eachstephastoprocess all P particles(loopsinlines9,13and16).Computationsforeachparticleareindependent,thereforetheyareparallelized andpipelined. Thearchitecture’sparallelism (number ofparallel modules)isthe sameforall stepsandis denotedby B. Eachparallel moduleprocesses P_B particles.Eachmemorythatfeedsthemodules ispartitionedintoB sub-memoriesand eachsub-memoryisassignedtoonemodule.

The transitionand weight operations are implemented inside the Sample&Weight stage of the architecture. At each stept,particlesX1_t₋:P₁ arereadfromtheParticlememoriesandfedtotheB Statetransitionblock.Thesesamplefrom(2)using

θ

∗astheparameter.TheoutputiswrittentoParticlememories2andpassedtotheBparallelobservationdensitymodulesto computethelog-weightslog

(

W1:P

t

)

using(3)(workingwithlog-weightshelpsavoidnumericalissues).Again,thecandidate

θ

∗isusedastheparameterin(3).Thelog-weightsarewrittentotheLog-weightmemoriesandpassedtoacomparatortree toﬁndthemaximumlog-weight.

Thefollowingtwostagesofthearchitectureimplementresampling(equivalenttolines9–12ofAlgorithm 2),usingthe Residual Systematic Resampling(RSR)algorithm[20].RSRrequiresthepartialsumsofweightsforeachparallelprocessing module,i.e.modulel

∈ {

1

,

...,

B

}

requiresthesum

(l−1)

P B

k=1 W

k

t.RSRoutputsthereplicationcountsofallparticles(r1:P),i.e.

thenumberoftimeseachparticleisreplicated.

In the PartialSumsstage: 1) The maximum log-weight isused for renormalizationto avoid numerical issues,2) Log-weightsareexponentiatedtoﬁndtheweightsand3) Bparallelaccumulatorsproducethesumsofweightsofeachmodule

l

P

B

k=(l−1)P B+1

W_tk

,

l

∈ {

1

,

...

B

}

.Thesearethen usedtogeneratethepartialsumsandthesumofall weights.Inthe Resam-plingstage,theRSRarchitecture proposedin[20]isused. B pipelineddatapathscomputethereplicationcountsandwrite them to therespective memory. Detailscan be found in[20]. Afterall countsare stored, they are read sequentiallyand particlekiscopiedfromParticlememories2toParticlememories1 rk _times._The_use_of_two_memories_is_necessary_to_avoid

overwriting particlesbefore their replicationcountsare examined. Theprocess cannot be parallelizedby replicatingeach bunch of P_B particlesindependently,sincethetotalnumberofreplicationsofabunchmightexceedthememorypartition assignedtoit(1_Bthofthetotalmemory).Forinstance,ifP

=

10 and B

=

2,eachparallelblockwouldprocess5replication counts/particles andhavea particlememoryoflength5to itsdisposal.Ifparticlek

=

1 hasareplication countof6,this

(11)

Fig. 4.FPGA architecture for ppMCMC.

meansthat theﬁrst parallelblock doesnothaveenough memoryspaceto replicatetheparticleandneeds toaccessthe secondmemoryblock.Thereplicationprocesswasnotexaminedin[20].

Inparalleltoresampling,themeanofthecurrentweightsiscomputed(usingthepartialsums),itslogarithmisfound, therenormalization(mentionedabove)isinvertedandtheresultisfedto anaccumulator(below theResamplingstage in

Fig. 3).AfterthePFstageshavebeenrepeated T times,thisaccumulatoroutputs theﬁnalestimate ofthelog-likelihood:

log

(

˜

l

(

y1:T

|

θ

∗

)

=

T

t=1log

(

1P

P

k=1Wtk

)

.ThisiswrittentotheProposedlikelihoodmemory.Also,afterT timestepstheSaved particlesmemorycontains arandomlyselectedsetofparticles,i.e.theproposedstate X∗₁_:_T,whichiscopiedtotheProposed statesmemory.

Allmemories haveconstant sizes, basedonuser-deﬁned parameters (maximum states Tmax,maximumparticles Pmax, nθ,nX andny).

5.3. ppMCMCarchitecture

The ppMCMC architecture is based on the pMCMC architecture but differs in two ways: 1) It permits many chains (M)to beprocessedconcurrently bythe PFdatapath,2)It performsexchange movesbetweenchains.The Proposedtheta,

Proposedstates, Proposedlikelihood, Currenttheta, Currentstates, Currentlikelihood and Priormemories are replicated Mmax

times(maximumnumberofchains,setbytheuser).Thisway,eachchainhasitsseparatememoryandcanbeupdatedand exchangedwithoutinterferingwiththeotherchains.

TheppMCMCarchitectureisshowninFig. 4.TheSampleproposalblockproposes

θ

∗samplesforall Mchainsandwrites themtotherespectivememories.Whenitﬁnishes,thePFcomputesthelikelihoodsandstatesamplesforallchainsusing coarse-grainpipelining(seenext paragraph).OnlyonePFblockisinstantiated(asinpMCMC);allchainsareprocessedby thisoneblock.ThisisfollowedbytheUpdateandExchangeblockswhichuseequations(6)and(9)toperformtherespective operations.TheUpdateblockneedsanextrainputcomparedtopMCMC–thechaintemperatureT empj.TheExchangeblock

attemptsswapsbetweenpairsofchainsassoonasbothchainshavebeenupdatedbytheUpdateblock(i.e.itdoesnotwait forallchaintobeupdatedaswasthecaseinAlgorithm 3).

(12)

Fig. 5.Coarse-grainpipelininginppMCMCarchitecture:Resamplinghasthelargestlatency,thusanewchainisfedtothePFdatapatheveryLatre=350 cycles.Chains j=3:5 areinthedatapath,allofthemattimestept=2.Chain j=4 willstartResamplingonecycleafterchainj=3 ﬁnishesResampling.

Coarse-grainpipelininginthePF:ThePFblockisthesameasinpMCMC(i.e.asingleblock)butexploitstheindependence of ppMCMCchains to increase datapathutilization. Running the PFfora single chain requirestraversing thedatapath T

times,witheach traversalcomprisingthe Sample&Weight, PartialsumsandResampling stages.Eachtraversalhasto wait untiltheprevious oneﬁnishes;also,each stagecanstartonlyafterthepreviousstage ﬁnishes.Thismeansthat,withone chain (asinthecaseofpMCMC), onlyoneofthe stagesisutilizedatagivenmoment.Let Latsw, Latps andLatre be the latenciesofthethreestagesforonetimestep(forall P particles).Thenthetotaldatapathlatencyis Latsw

+

Latps

+

Latre.

Withoutpipelining,thetotallatencyforM chains(withT timestepseach)isM

·

T

· (

Latsw

+

Latps

+

Latre

)

.

The ppMCMC architecture coarsely pipelines chain computations,exploiting the fact that the PF run of chain j does not need to wait forthe PF run ofchain j

−

1 toﬁnish. Multipletasks are processed inthe PFat the sametime, with the following order: First, time step t

=

1 of chain j

=

1 is processed,followed by time step t

=

1 of chain j

=

2, etc. When all the ﬁrsttime steps have ﬁnished,the second time steps (t

=

2) are fed to thePF forall chains andso on.By thischangeintheorderofoperations,theavailable parallelismisexposed.Pipelining iscoarse-grain,sincenoPFstage is shared bytwo chainsatthe sametime(toavoidsigniﬁcant controloverheads);each chainwaits fortheprevious oneto ﬁnisha stagebeforeentering thesamestage.Using thistechnique (Fig. 5), anewchaincanbe fedtothedatapathevery

max

(

Latsw

,

Latps

,

Latre

)

cycles,bringingthetotallatencyforMchains(T timestepseach)toM

·

T

·

max

(

Latsw

,

Latps

,

Latre

))

.

Resourceoverheads:Inordertorunmany“virtual”PFsonthesinglePFhardwareblocksimultaneously,multiplememories areusedforallPF-accessedvariables.TheppMCMCarchitectureusesMmax Particle,Log-weight,Weight,Savedparticles, Repli-cationcountsandPartialsumsmemories (eachpartitionedinto B blocksasinpMCMC). TheExchangeblockisanadditional overheadbuttakesfewresources(seeSection7).

5.4. Performancemodels

ThelatencyoftheSample&weightstageinpMCMCis:

Latsw

=

C1

+

ceil

(

log

(

B

))

·

Latcomp

+

P_B

(12)

whereC1isthe(applicationdependent)latencyoftheparallelmodules,thesecondtermisduetothecomparatortreeand

P

B

isthelatencyforpassing P particlesthroughtheparallelmodules.ThelatencyofthePartialsumsstageis:

Latps

=

C2

+

P_B

·

Latadd

+

B

·

Latadd (13)

where C2 isthe latencyof theparallel modules,

_BP

·

Latadd is thelatency forpassing P particles throughthe datapath

(Latadd istheadderlatency)andB

·

Latadd isthelatencyofthepartialsumsloop.ThelatencyoftheResamplingstageis:

Latre

=

C3

+

P_B

+

P

·

Latrep (14)

whereC3isthelatencyoftheparallelmodules,

P_B

isthelatencyforpassingP particlesthroughthedatapathandP

·

Latrep

istheparticlereplicationlatency(Latrepisthemeanreplicationcount).OnePFruninpMCMCcosts:

Latp f_pMC MC

=

PB

+

T

· (

Latsw

+

Latps

+

Latre

)

(15)

where

P_B

isduetoparticleinitialization.

InppMCMC,processingonetimestepforallchainscosts:

Latts_ppMC MC

=

C4

+

M

·

max

(

Latsw

,

Latps

,

Latre

)

+

Latsw

+

Latps

+

Latre

−

max

(

Latsw

,

Latps

,

Latre

)

(16)

where C4 isduetominorcomputationsbeforeeachtime step.Thelatencyofprocessingall T time stepsofall M chains

is:

Latp f_ppmcmc

=

M

·

_BP

+

T

·

Latts_ppmcmc (17)

(13)

6. Casestudy:SSMsforDNAmethylationproﬁling

Inordertoevaluatetheproposedalgorithmandarchitecturesoftheprevious sections,alarge-scalesyntheticSSMcase studywhichfocusesonDNAmethylationisused.Thiscasestudyisdesignedtodemonstratethebeneﬁtsofthealgorithmic andhardwarearchitecturenoveltiesproposedinthisarticle,usingarealistic,demandingapplication.

MethylationisabiochemicalprocesswhichhappensinspeciﬁcpositionsoftheDNAandisassociatedwithanumberof diseases.Thepositionswheremethylationhappenscannotbedetecteddirectly.Methylationanalysisconsistsindiscovering these “hidden” positions by applying a process called sodium bisulfate treatment to the DNA bases, several times. The treatmentis essentiallyatest formethylation,whose outputiseitherasuccessfulora faileddetection. Nevertheless,the test can belargely unreliableandnoisy in manycasesandfurther analysisis neededtosuccessfully detectmethylation. Methylation analysiscan be applied to single-tissue ofmulti-tissue DNA. The latter caseoccurs when processing mixed tissue,e.g.whole-bloodsamples.

MethylationdatacanbesimulatedusingSSMmodels;thehiddenmethylatedstatescan berepresentedbythehidden statesoftheSSMandSSMinferencecanbeusedtoestimatethembasedonknownobservations(i.e.thebisulfatetreatment testresults).SSMshavetheadvantagethattheycanrepresentdependenciesbetweenneighboring statesusingthetransition density;thisisimportantformethylationmodeling,sincethemethylationstateofaDNAbasedependsonthestateofthe neighboring bases.

Here,twosyntheticDNAmethylationdatasetsaresimulatedusingSSMs:1) Asingle-tissuedataset,2)Amulti-tissue dataset.First,therealmethylationproblemsettingwillbepresented,followedbyadescriptionofhowthisismappedtoan SSM.Ineachcase,theDNAsequenceisassumedtohavealengthofT bases,wherebaset

∈ {

1

,

...,

T

}

hasprobability pt of beingmethylated.Thegoalistoestimatetheseprobabilitiesusingtheresultsofmethylationtestson4biologicalreplicates

k

∈ {

1

,

...,

4

}

(i.e.individualswithidenticalDNAsequences).Atbaset andforreplicatek,asodiumbisulfatetreatmenttest isappliednkt times.Outofthesetests, ykt aresuccessful. ThephysicalpositionofDNAbaset is

δ

t.The physicalposition

is neededbecause typically some gaps exist betweenbases inreal data sets. The goalof theanalysis is todiscover the probability pt thatpositiont ismethylated,forallt.Inthiscasestudy,T rangesupto16384.

TheabovesettingismappedtoanSSMasfollows:

•

The logit-probabilitythat thebaset ismethylated, i.e.logit

(

pt

)

isthe hiddenstate Xt ofthe SSM.Therefore,X1:T

=

logit

(

p1:T

)

.

•

Thenumberofsuccessfultestsatbaset oftheDNAsequence,forallfourreplicates(y1:4,t)istheobservationyt ofthe

SSM,i.e.y1:T

=

y1:4,t.Thenumberoftestsatpositiont(n1:4,t)isalsopartoftheobservations.

The transition,prior andobservationdensities oftheSSMare constructedasfollowsto simulatetherealsetting: The transitiondensity(2) forthe single-tissueDNAcase modelsthe fact that theprobabilities of methylationare dependent betweensubsequentDNAbases:

Xt

∼

Normal

(X

t−1,

τ

2

t

),

t

>

1 (18)

where

τ

2

t isthetransitionvariance.Thisisafunction ofacommonvariance

σ

12 andtheDNAphysicalposition(

δ

t):

τ

t2

=

σ

2

1

|

δ

t

−

δ

t−1

|

.Thevariance

σ

12representstheamountofdependencebetweenthemethylationstatesofsubsequentbases.

The transitiondensity inthe multi-tissue DNAcase (with two tissues)is similar butin thiscase a mixture of Gaus-sian densitiesis usedinsteadof oneGaussian. Thismixturerepresents thefact thattwo tissues withdifferentinter-base dependenceexistintheDNAsample:

Xt

∼

2c=1

wc

·

Normal

(X

t−1, ψ_tc2

)

,

t

>

1 (19)

where wc

=

0

.

5 is the weight of the mixture componentwith indexc

∈ {

1

,

2

}

and

ψ

_tc2

=

σ

2

c

|

δ

t

−

δ

t−1

|

. Here,

σ

c2 isthe

transitionvarianceofcomponentc(twovariances

σ

2

1 and

σ

22intotal).Thepriorstateequation(1)isX1

∼

Normal

(

0

,

1

)

in

boththesingle- andmulti-cases.

Fortheobservationequations(3),abinomialmodelisused.Thisissuitableformodeling numbersofsuccessesinaﬁxed numberoftests.Theequationsareasfollows:

ykt

∼

Binomial

(

nkt

,

pkt

),

k

∈ {

1

, ...

4

}

,

t

>

0 (20)

Thismeansthatthenumberofsuccessfultests ykt atpositiont forreplicatek,thenumberoftestsnkt andtheprobability

ofmethylation pkt arerelatedaccordingtoabinomiallikelihood.Theseobservationequationsapplytobothcasestudies. Theonly remainingpartofthemodelisa connectionbetweenthelogit-probabilitieslogit

(

p1:T

)

=

X1:T thatappear in

the transitiondensity(which are commonforall replicates)andthe probabilities p1:4,1:T that appear inthe observation

equations(oneprobabilityperreplicate).Thisconnectionisachievedbythefollowingequation:

logit

(

pkt

)

∼