• No results found

Sequential Monte Carlo with kernel embedded mappings: The mapping particle filter

N/A
N/A
Protected

Academic year: 2021

Share "Sequential Monte Carlo with kernel embedded mappings: The mapping particle filter"

Copied!
16
0
0

Loading.... (view fulltext now)

Full text

(1)

Contents lists available atScienceDirect

Journal

of

Computational

Physics

www.elsevier.com/locate/jcp

Sequential

Monte

Carlo

with

kernel

embedded

mappings:

The mapping

particle

filter

Manuel Pulido

a

,

b

,∗

,

Peter

Jan van Leeuwen

a

,

c

aDataAssimilationResearchCentre,DepartmentofMeteorology,UniversityofReading,UK bDepartmentofPhysics,FaCENA,UniversidadNacionaldelNordesteandCONICET,Argentina cNationalCentreforEarthObservation,DepartmentofMeteorology,UniversityofReading,UK

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received18October2018

Receivedinrevisedform15May2019 Accepted24June2019

Availableonline27June2019 Keywords:

Optimaltransport Particleflows Kernelembedding

SteinVariationalGradientDescent

In this work, a novel sequential Monte Carlo filter is introduced which aims at an efficient sampling of the state space. Particles are pushed forward from the prediction to the posterior density using a sequence of mappings that minimizes the Kullback-Leibler divergence between the posterior and the sequence of intermediate densities. The sequence ofmappings represents a gradient flow based on the principles of local optimal transport. A key ingredient of the mappings is that they are embedded in a reproducingkernel Hilbertspace,whichallowsfor apracticaland efficient MonteCarlo algorithm.Thekernelembeddingprovidesadirectmeanstocalculatethegradientofthe Kullback-Leiblerdivergenceleadingtoquickconvergenceusingwell-knowngradient-based stochasticoptimizationalgorithms.Evaluationofthemethodisconductedinthechaotic Lorenz-63 system, the Lorenz-96 system, which is a coarse prototype of atmospheric dynamics, and an epidemic model that describes cholera dynamics. No resampling is requiredinthemappingparticlefilterevenforlongrecursivesequences.Thenumberof effectiveparticlesremainsclosetothetotalnumberofparticlesinallthesequence.Hence, themappingparticlefilterdoesnotsufferfromsampleimpoverishment.

CrownCopyright©2019PublishedbyElsevierInc.Thisisanopenaccessarticleunder theCCBYlicense(http://creativecommons.org/licenses/by/4.0/).

1. Introduction

Severalapplicationsrangingfrommeteorology,oceanography,andhydrologytobiologicalsystemsneedtoestimate the stateofasystemfromanimperfectnumericalmodelandpartialnoisyobservations.Thecombinationofadynamicalmodel whichrepresentsthenonlinearevolutionofthe(hidden)statewithsparsenoisyobservationsisusuallyrepresentedthrough a hiddenMarkovmodel,also knownasastate-space model[9,5]. Theaim ofthisworkis todevelop a methodologyfor sequential Bayesianinferenceofstate-space modelscomposedofnonlinear chaoticdynamical systemswithnon-Gaussian uncertaintyinthestate-spacevariablesusingasmallnumberofsamples.

Particle filters are Monte Carlo techniques that sample the space through so-called particles–state realizations– that representthe probabilitydensityofthestate conditionedon thepartialnoisy observations,i.e.theposterior density.The challenge forparticlefiltersistorepresentrecursivelythehighprobabilityregions oftheposteriordensitywithalimited number ofparticles. Ingeneral, particle filterssufferfromfilterdegeneracy [9]. Aftera fewtime iterations, they tend to

*

Correspondingauthor.

E-mailaddress:[email protected](M. Pulido). https://doi.org/10.1016/j.jcp.2019.06.060

0021-9991/CrownCopyright©2019PublishedbyElsevierInc.ThisisanopenaccessarticleundertheCCBYlicense (http://creativecommons.org/licenses/by/4.0/).

(2)

give all theweight to a single particle.In the importance samplingframework, a proposaldistribution of thetransition densityconditionedtothecurrentobservationscanbechosen toimprovethesampling[9,2].Although,agoodchoiceofa proposaldensitymayalleviatethefilterdegeneracy,ifthefilterisappliedrecursivelyaresamplingstepisstillrequired.The applicationof resamplingproduces anotherissue,particle impoverishment,since onlya few particles withlarge weights are keptandcopied, theparticlediversity isdecreased inthe resamplingstep.Thisis animportant contentionpointfor hidden-Markovmodelsinhigh-dimensionalstatespaces.

Chorin etal., [7], proposed an implicit samplingschemein which theparticles froma Gaussian proposal densityare mapped to the highprobability regions of the posterior density through the solution of an algebraic equation for each particle.The latterrelatesthe modeof theproposaldensitywiththe modeofthe posteriordensityforeach particle.To findthemodeoftheposteriordensityforeachparticle,aminimizationisrequired.Toreducethevarianceoftheweights, Van Leeuwen, [26], proposes a mapping that enforces equal weights by scaling the deterministic movement of particles inthe optimalproposalstep.Since thehighprobability region oftheproposal densityischosen smallerthan themodel transitiondensity, thefilter can be overlyoptimistic aboutits performance [27]. Recently, a two-stepmapping hasbeen proposedtoalleviate theseissuesandremovethesebiases.However,amathematicalunderstandingoftheconvergenceof thefilterinthelimitforlargenumberofparticlesisstillmissing.Reich,[23],appliedoptimaltransportprinciplestofilters. A linear ensemble transform is proposed that minimizes the Euclidean distancewith optimal coupling in one mapping step. The degeneracy problemis still present in this method. It is, however, well suited for localization procedures for high-dimensional applications[6], inwhichan observationonlyaffectstheregion ofthestate spacethatis closetoitin Euclideandistance.

Inthiswork,weproposeanovelparticlefilterwhichisalsobasedonamappingfromthepriordensitytotheposterior densityastheimplicitsamplingfilter[7,3],theimplicitequalweightparticlefilter[26,31] andthenonparametricensemble transformmethod[23].Themaindifferenceisthatthemappingisdoneiterativelyviaagradientflow.Asampling method-ologybasedonoptimaltransportisderivedtodrivetheparticlesfromthepredictiondensitytotheposteriordensity.The transportofparticlesaimstominimizetheKullback-Leiblerdivergencethatmeasuresthedifferencesbetweenthe interme-diatedensitiesandtheposteriordensity.ThemappingisembeddedinareproducingkernelHilbertspacewhichallowsfor ananalyticalexpressionforthegradientoftheKullback-Leiblerdivergence.Theproposedsamplingmethodfortheparticle filterisbasedintheStein variationalgradient descentintroduced in[14].Theyfound aconnectionbetweenthegradient Kullback-LeiblerdivergenceandtheSteindiscrepancy.Inthiswork,wedevelopasequentialMonteCarlofilterbasedonthe variationalmapping,evaluateitinthreehiddenMarkovmodelsanddiscussitspotentialapplications.

2. Methodology

2.1. SequentialBayesianestimation

Weassumetheestimationproblemisencompassedofadynamicalmodel

M

,whichpredictsthestate

x

fromaprevious state,andan observationalmodel

H

,whichtransformsthestatefromthe(hidden) statespacetotheobservationalspace. Thesetofequationsthatdefinestheestimationproblemisknownasastate-spacemodelorahiddenMarkovmodel.These are

xk

=

M

(

xk−1

,

η

k

),

(1)

yk

=

H

(

xk

,

ν

k

),

(2)

where

x

k

R

Nx isthestateattimek,

y

k

R

Ny aretheobservations,

η

k

p

(

η

)

istherandommodelerror,and

ν

k

p

(

ν

)

istheobservationalerror.ThemethoddevelopedhereisgeneralanddoesnotrelyontheadditiveorGaussianassumption inmodelnorobservationalerrors.

Observations

y

k areassumedtobemeasuredatdiscretetimes.Someaprioriknowledgeofthestateatk

=

0 isassumed, theinitialpriorstatedensityp

(

x0

)

.ThesequentialBayesianstateinferenceisgivenintwostages:

1. Firstly,intheevolutionstagethepredictiondensityisdeterminedas p(xk

|

y1:k−1

)

=

p(xk

|

xk−1

)p(

xk−1

|

y1:k−1

)

dxk−1

,

(3) wherewedenote

{

y1

,

· · ·

,

yk−1

}

by

y

1:k−1.Atk

=

1,

y

1:0

=

sothatthepredictiondensityisp

(

x1

)

=

p

(

x1

|

x0

)

p

(

x0

)

dx0. 2. Secondly,intheassimilationstageBayesruleisusedtoexpresstheinferenceasasequentialprocess

p(xk

|

y1:k

)

=

p(yk

|

xk

)p(

xk

|

y1:k−1

)

p(yk

|

y1:k−1

)

,

(4)

wherep

(

yk

|

xk

)

istheobservationlikelihoodandp

(

yk

|

y1:k−1

)

=

(3)

Note that the marginalized posterior densityfor the current state is inferred in Eq. (4) instead ofthe posterior density forthewholetrajectory

x

0:K.Thisinference ofthecurrentstate isusefulforsystemswhichroutinelyreceiveobservations andthereforethe estimationisproducedonly atthecurrentstate withoutkeepinginformationofthewholepath ofthe particles.ThisapproachisfollowedbyensembleKalmanfilters,particleflowfilters(e.g.[8])andtheimplicitequal-weight particlefilter[31,28].

2.2. Particleflowsandoptimaltransport

Afilterneedstoupdateitspriorknowledge,providedinthepredictiondensity,withtheobservationlikelihood,Eq.(4). The concept of homotopy can be used tosmoothly transformthe prior densityq

(

x

)

towards the posterior density p

(

x

)

. Continuous deformationsthrough a parametercan be usedto achieve this, such asT

(

x

,

λ)

:

R

Nx

× [

0

,

1

]

R

, T

(

x

,

λ)

=

p

(

x

)

λq

(

x

)

1−λ. (To simplify the notation we haveleft the conditioning to observations implicitin this andthe following subsection.)Theparameter

λ

isinterpretedasapseudo-timewhichisvariedfrom0to1atafixedrealtime.Thisconcept wasusedin[8] forparticlefiltersandisdirectlyrelatedtotempering[16].Thedensitiesarerepresentedthroughparticles, sothatthedeformationsareinterpretedasparticlesmovinginaflowaccordingtoasetofordinarydifferentialequations,

dxλ

=

vλ

(

xλ

),

(5)

where

v

λisthedriftorvelocityfield.Wefocusondeterministicflowssothatdiffusivityprocessesarenotconsidered.The evolutionofthedensityunderthisflowisthengivenbytheLiouvilleequation,

tqλ

+ ∇ ·

(

vλqλ

).

(6)

Insome recentworks, theparticlesare pushedsmoothlyforwardinpseudo-times usingaGaussian approximationto rep-resent the velocity field(e.g. [4,13]).Ontheother hand,in thecurrentwork we avoidtheuse ofthat approximationby obtainingthegradientflowfromoptimallocaltransport.

Theoptimaltransportisanotherpromisingapproachforthesamplingofcomplexposteriordistributions[17].Giventhe prior probabilitymassdistributionq

(

x

)

andthetargetprobability massdistribution p

(

z

)

,wewanttotransportthe proba-bilitymassfromq to pusinga mappingT

:

x

z.The optimaltransportproblemseeksthetransformation T thatgives the minimalcostto transportthedistributionofmassfromq to p.Thisrepresentstheclassical Mongeoptimaltransport problem. Thereisa rigorousproof oftheexistenceofsuchoptimalmappingundermildconditions[18]. Furthermore,[1] haveformulatedtheMonge-Kantorovichproblemasasteepestdescentgradientoptimization.

Ourapproachcombinestheideasofoptimaltransportandparticleflows.Wetakethelocalapproachtooptimaltransport [29] inwhicha gradientflowneeds tobe obtained.Through asequence ofmappingsgivenbythe flowwe seekto push forward the particles from the prior to the target density. The mappings in the sequence are required to be assmooth aspossible.Theparticles behaveasactive Lagrangiantracersinaflow. Thevelocity fieldateachpseudo-time stepofthe mappingsequenceischosenfollowingalocaloptimaltransportcriterion.

2.3. Variationalmappings

Ateachiterationofthesequence,weproposealocaltransformationT thatfollowsEq.(5),

xλ+

=

T(xλ

)

=

xλ

+

v

(

xλ

),

(7)

where

=

δλ

isassumedtobesmalland

v

(

xλ

)

isanarbitraryvectorialfunction“sufficiently”smooth,whichrepresentsthe velocityoftheflowdefinedinEq.(5).

The Kullback-Leiblerdivergenceisusedasa measureofthedifference betweenthe intermediatedensityq

(

x

)

andthe posteriordensityp

(

x

)

.TheKullback-Leiblerdivergencebetweenqandpafterthetransformation,T,intermsoftheinverse mappingis

D

K L

(T

)

=

qX

(

x

)

log

qX

(

x

)

pT−1

(

x

)

dx

,

(8)

wherepT−1

(

x

)

=

p

(

T

(

x

))

det

xT

(

x

)

anddet

xT

(

x

)

isthedeterminantJacobianofthemappingandforbrevity

x

=

xλ(time indexisleftimplicit).

Ourgoalistofindthevelocityfield

v

(

x

)

ofthetransformationT thatgivestheminimumof

D

K L

(

T

)

foreachmapping ofthesequence.Inotherwords,weneedtofindthedirection

v

(

x

)

,thatgivesthesteepestdescentof

D

K L.

WeusetheGateauxderivativeofafunctional,whichisageneralizationofthedirectionalderivative.Giventhefunctional F,itisdefinedas

DhF

=

lim

→0

F

(

x

+

h

(

x

))

F

(

x

)

(4)

TheGateauxderivativeoftheKullback-Leiblerdivergence,Eq.(8),inthedirection

v

(

x

)

isgivenby Dv

D

K L

= −

q(x

)

dlogpT−1

(

x

)

=

0 dx

.

(10)

Thederivativeofthetransformedlog-posteriordensityis

dlogpT−1

(

x

)

=

0

= ∇

Tlogp(T(x

))

dT

+

Tr

(

xT− 1d

xT)

=0

.

(11) Consideringthat T

(

x

)

=

x

+

v

(

x

)

inEq.(11) andreplacinginEq.(10),thedirectionalGateauxderivativeof

D

K L results in Dv

D

K L

= −

q(x

)

xlogp(x

)

v

(

x

)

+

Tr

(

xv

)

dx

.

(12)

Equation (12) gives the

D

K L derivative along v. However, we require the negative gradient of

D

K L in terms of the samples of q forthe optimization of

D

K L as a function of T. In general, the flow vbelongs to an infinitedimensional Hilbertspace.Hence,thefulloptimizationproblemisstill intractableinpracticesincewedonothaveawaytodetermine the

Dv

D

K Lthatgivesthesteepestdescentdirection.

Onewaytolimitourfunctionaloptimizationproblemsothatitbecomestractableischoosingasspaceoffunctionsthe unitballofareproducingkernelHilbertspace(RKHS),whichwedenoteas

F

.Thiswas proposedby[14].Inthisway,we constrain

v

to

F

andrequirethat

v

F

1 tofindthegradientof

D

K L.Theoptimizationproblemisthentofindthe

v

F

thatgivesthedirectionofsteepestdescentof

D

K L.ThemainpropertiesoftheRKHSarediscussedin[24].

ThefunctionstoberepresentedintheRKHSarevector-valued,i.e.

v

(

x

)

R

Nx.Thekernelisthusassumedtobediagonal andisotropic,

K

=

KINx,where

I

Nx istheidentitymatrixandK isascalarkernel.Thereproducingpropertystatesthatany functionfrom

F

canbeexpressedasthedotproductbythekernelthatdefinestheRKHS,

v

(

x

)

=

K

(

·

,

x

),

v

(

·

)

F

.

(13)

Equation(13) isknownasthereproducingproperty.ThescalarkernelK defines

F

1.

Replacing the vector field, v

(

x

)

, inthe two terms inEq. (12) by the expression givenin Eq.(13) and then usingdot productproperties,wefindthat

Dv

D

K L

= −

q(x

)

[K

(

x

,

·

)

xlogp(x

)

+ ∇

xK(x

,

·

)

] dx

,

v

(

·

)

F1

.

(14) This is validfor any vsuch that

v

F

1. Therefore, the first term ofthe dot product inEq. (14) is by definitionthe gradientof

D

K L atT=0

=

x,Dv

D

K L

= ∇

D

K L

,

v

F1 sothat

D

K L

(

x

)

= −

E

xq K(x

,

x

)

xlogp(x

)

+ ∇

xK

(

x

,

x

)

,

(15)

where

E

representstheexpectationoperator

E

xp

[

f

(

x

)

]

p

(

x

)

f

(

x

)

dx.(NotethatbyDv

D

K L

(

x

)

wedenotethedifferences between

D

K L witha mapping T atxinadirection

v

fromthe Kullback-Leiblerdivergencewithoutthemapping,i.e.the Gateauxderivate, while

D

K L

(

x

)

denotesthedirectioninwhich

D

K L increasesthemostwhenmappingsoftheformEq. (7) areconductedat

x

.)

This expression for the gradient, Eq. (15), is particularlysuitable for Monte-Carlo integration when q is only known througha setofsample points. Sincewe seektominimize theKullback-Leiblerdivergencewe chooseasdirectionofthe transformationT thenegativeofitsgradient,thesteepestdescentdirection,

xλ+

=

x

D

K L

(

x

).

(16)

UsingthekernelreproducingpropertyandintegrationbypartsinEq.(15),theresultinggradientflowofthemappingis

vλ

(

x

)

= −∇

D

K L

(

x

)

=

(

x

)

logp(x

)

− ∇

(

x

).

(17)

As shownin [25], thegradient flow, Eq.(17), has asstationarysolution q

(

x

)

=

p

(

x

)

sothat the gradient flowconverges towardthetargetdensity.

In[19],theKullback-Leiblerdivergenceisalsousedbutwithanextraregularizationtermtofindasingleglobaloptimal map.Here,weapply agradientflowviaasequenceoflocalmappings.Underweaksmoothnessconstrains [29],the mini-mumofthecost functionasafunctionofthemappings,Eq.(7), isuniquelydeterminedinEq.(16), i.e.noregularization termisrequired.

(5)

2.4. Mappingparticlefilter

Consider wehaveasetofequal-weightedparticles

x

(1:Np)

k−1 whichsampletheposteriordensityattimek

1.Thetarget densityattimek whichweaimtosampleusingthevariationalmappingistheposteriordensity p

(

xk

|

y1:k

)

.Themapping approachonlyrequiresasetofsamplesofthepriordensity.Itisstartedbythesetofunweightedparticlesthatareevolved fromthepreviousestimatetothepresentassimilationtimebythedynamicalmodel,i.e.

xk,0(j)

=

M

(

x(kj)1

,

η

(j)k

)

Np j=1,where the second subscript represents the mapping iteration. These sample points of the prediction density are then pushed towardsthesequentialposteriordensitybythemappingiterations.

Giventhesetofparticles

x

(1:Np)

k,i−1 thataresamplesoftheintermediatedensityatmappingiterationi

1,thegradientof theKullback-LeiblerdivergencefromEq.(15) atastatespaceposition

x

byMonte-Carlointegrationis

D

K L

(

x

)

= −

1 Np Np

l=1

K(xk,i(l)1

,

x

)

logp(xk,i(l)1

)

+ ∇

xK

(

x(l)k,i−1

,

x

)

.

(18)

Thisexpressionisequivalenttotheoneobtainedin[14] foratimeindependentinference.Thenegativeofthisgradient representsthevelocityfieldofthegradientflow.

ThefirsttermofthegradientoftheKullback-Leiblerdivergence,Eq.(18),tendstodrivetheparticlestowardsthepeaks oftheposteriordensity.Itgivesaweightedaverageof

logpincludingthecurrentparticleandsurroundingoneswithin thekernelscaleofthecurrentone.Thisisthetypicalbehaviorofavariationalimportancesampler sothatitaccumulates particles atthehighprobabilityregions oftheposteriordensity.Ontheotherhand,thesecondtermof

D

K L inEq.(18) tends toseparatetheparticles. Ifthecurrentparticle

x

(j)k isintheinfluenceregionofanotherparticlesayx(l)k ,i.e.within thekernelscale,theterm

x(l)

k,i−1

K

(

x(l)k,i1

,

x(j)k,i1

)

willtendtoseparatethemactingasarepulsiveforcebetweenparticles. Atthemappingiterationi,theparticle j istransformed,accordingtothemappingEq.(16),by

x(k,ij)

=

x(k,ij)1

D

K L

(

xk,i(j)1

),

(19)

where

−∇

D

K L

(

xk,i(j)1

)

isthesteepestdescentdirectionattheparticleposition.Eq.(19) maybeinterpretedasthemovement oftheparticlesalongthestreamlinesoftheflowassumingsmall

.

An ingredient needed to evaluate Eq. (18) is an expression for the gradient of the posterior density. A problem in sequential Bayesian inference is that there is no exact expression forthe posterior density.We do knowthe likelihood function,butweonlyhaveasetofparticlesthatrepresentthepriordensity,notthedensityitself.Thepriordensityusing theparticlerepresentationoftheposteriordensityattimek

1 inEq.(3) resultsin

p(xk

|

y1:k−1

)

1 Np Np

j p(xk

|

xk(j)1

),

(20)

andtheexpressionfortheposteriordensityfromEq.(4) becomes

p(xk

|

y1:k

)

1 Np p(yk

|

xk

)

Np

j p(xk

|

xk(j)1

).

(21)

Becauseofthefiniteensemblesizeattimek

1,thisexpressionisanapproximationoftheposteriordensity.Equation (21) isthetargetdensityinthemappingsequence.

Ateach mappingiteration,all theparticles aremovedfollowingEq.(19).Successiveapplicationsofthetransformation, through gradientdescent,withthecorrespondingupdatesof

D

K L,willconvergetowards theminimumofthe Kullback-Leiblerdivergence.Thepseudo-codeoftheimplementedalgorithmwhichcombinesvariationalmappingwiththesequential particlefilterisshowninAlgorithm1.

The forminwhichthevariationalmappingisobtained,assteepestgradientdescent,issuitableforthestochastic opti-mizationalgorithmsusedinthemachinelearningcommunity.Inthiscase,

,knownasthelearningrate,canbedetermined adaptivelyusing

D

K Lfrompreviousiterations(e.g.[30]).

Animportantpartofanyiterativevariationalmethodisastoppingcriterion.OnepossibilityinAlgorithm1istocheck thevalue of

|∇

D

K L

|

foreachparticle,oraveraged overallparticles.Anotheroptionisto useimportance samplingandto interpretthefinaldensityofthemappingsequenceasaproposaldensityandcalculateweightswithrespecttotheposterior density,whichwillthen automaticallyleadtoan unbiasedestimator.Tousethis,an expressionforthefinalintermediate densityq isrequired,however,wehaveonlyits particlerepresentation.Toimplementthisinapracticalway,thedensity transformations can be used,that we dohaveexplicitly, andthefinal densitycan be relatedtothe prior density.Inthe Appendix,theimplementationofimportancesamplingwithinthevariationalmappingparticlefilterisdescribed.
(6)

Algorithm 1 Mappingparticlefilteralgorithm. Input:Givenx(1:Np) k−1 ,yk,M(·),p(y|x)andp(η) forj=1,Npdo xk,(j)0M(x(kj)1,ηk) Forecaststage end for

repeat Mappingiterations

forj=1,Npdo

xk,i(j)x(j)k,i1DK L(x(j)k,i−1) ∇DK LfromEq.(18) end for

i←i+1

untilStoppingcriterionmet Output: x(1:Np)

k,i

Toavoid the resamplingin the importance samplingstep, one can calculatethe effectiveensemble size as Ne f f (see Appendix),anddecidetocontinuetheiterationsinAlgorithm1until Ne f f reachesacertainthresholdNt.Inthefollowing assimilation cycle, the weights of the particles representing the final intermediate density has to be considered in the sequentialposteriordensity,resultinginaweighted posteriordensityinsteadofEq.(21).Inthisway,themappingparticle filterwouldreadily extendto includeimportance sampling. One shouldkeep inmind, however,that ourestimate ofthe posteriordensityisrather poorin high-dimensionalsystems witha smallnumberofparticles,so theweights wouldnot bevery accurate.Thus,ourstandard implementationusesthemagnitudeofthegradientasthestoppingcriteria,without calculatingtheweights.

3. Numerical experiments

Thesequentialimplementationofthemappingparticlefilterisevaluatedusingthreechaoticnonlineardynamicalmodels withstate spaces of up to 40dimensions(see Section 3.1). As usual inthese experiments,the observational andmodel errors areassumedGaussian. A derivationofthe hiddenMarkovmodel withGaussian errorsis giveninSection 3.2.We conductedasetofstochastictwinexperiments,inwhichthedynamicalmodelusedtoproducethesyntheticobservations isthesameastheone usedinthestate-spacemodel,includingthestatisticalparameters. Detailsoftheexperimentsare giveninSection3.3.

3.1. Descriptionofthesystems TheLorenz-63systemisgivenby

dx dt

=

σ

(y

x), dy dt

=

x(

ρ

z)

y, (22) dz dt

=

xy

βz.

Theequationsaresolvedusingafourth-orderRunge-Kuttascheme.Theparametersarechosen attheirstandardvalues

σ

=

10,

ρ

=

28 and

β

=

8

/

3.The integrationtime stepis0

.

001 and theassimilationcyclesareevery 0

.

01 (10integration timesteps).Theobservationoperator

H

istakenastheidentitymatrix,forthefull-observedstateexperimentsinSection4. Themodelerrorcovariance

Q

ischosendiagonal.Itsdiagonalelementsaresetto30%oftheclimatologicalvarianceofthe variable,e.g. Q11

=

0

.

3

σ

x2,where

σ

x2 isthetime-seriesvarianceofvariablex.Theobservationerrorcovarianceis

R

=

0

.

5I.

Thestochasticdynamicalmodelforcholeraisa5-variablesystemallowingforsusceptible,infected,andthreeclassesof recoveryindividualsinwhichcholeramortalityistheonlyobservedvariable.Fromtheinferencepointofview,itpresents somechallenges[10].Ithaspartialnoisyobservationswhosevariancedependsontime.Transmissionisassumedstochastic andhasamultiplicativemodelerror.Thecompartmentdynamicalmodelforcholerausedinthisworkistheonethroughly described in [10]. A shortdescription is givenhere. The individuals are classifiedassusceptible (S), infectedindividuals (I),whilerecoveredindividualsbelong tothreeclasses(R1:3) toallowfordifferentimmune periods.Theequationsforthe numberofindividualsforeachcategoryaregivenby

dSt

=

dNtB S

dNtN I

dNtS D

+

dNR kS t dIt

=

dNtS I

dNI R 1 t

dNtI C

+

dNtI D dRt1

=

dNtI R1

dNtR1R2

dNtR1D (23) dR2t

=

dNtR1R2

dNtR2S

dNtR2D dR3t

=

dNtR2R3

dNtR3S

dNtR3D
(7)

Transmissionisgivenbyastochasticdifferentialequation

dNtS I

=

λ

tStdt

+

ItSt

/P

tdWt

,

(24) wheredWt isaGaussianwhitenoise.Thetransitionsbetweencategoriesaregivenby

dNtI R1

=

γ

Itdt, dNR l−1Rl t

=

rkRlt−1dt, dNtRlS

=

rkRtkdt, dNtS D

=

mStdt, dNtI D

=

mItdt, dNR lD t

=

mRltdt

,

(25) dNtI C

=

mcItdt, dNtB S

=

d Pt

+

m Ptdt,

wherel

=

2

,

3, B representsbirth,C ischolera mortalityand D denotesdeathfromothercauses.Equations(23) are inte-gratedusinganEuler-Maruyamaschemewithtimestepof1

/

20 month.Theobservationsarecholeramortalitydatawhich aregivenby yt

N

NtI C

NtI C1

,

τ

2

(N

tI C

NtI C1

)

2

.

(26)

Whilethemappingparticlefiltercanhandlenon-additivemodelerrors,wetransformthesystemofequationsaboveto allow foran additive model errorimplementation.We usean augmented state spacedefining a newstate variable T as dTt

=

dWt,sothattheaugmentedsystemhasadditiveGaussianmodelerrorwhosegradientofthelog-posteriordensityis represented byEq.(30). Notethat anon-Gaussian densityforadditive modelerrorwouldgive morerealisticfeatures. To adddiversitytotheparticlesinthefilterasmalladditivenoisetermisaddedinalltheequations[15].Thevarianceofthis additivenoiseissetto10%ofthedWt variance.Thisnoiseisnotusedtogeneratethetruetrajectory,itisonlyincludedin theparticlefilter.

Forthe40-variableLorenz-96systemexperiments,thesetofequationsis

dxi

dt

=

(x

i+1

xi−2

)x

i−1

xi

+

F

,

(27)

where i

=

1

,

· · ·

,

40 and cyclicboundary conditions are imposed, x0

=

x40, x−1

=

x39 and x41

=

x1.The equationswere integratedusingafourth-orderRunge-Kuttascheme,withanintegrationstepof0.001.Theforcingis F

=

8 whichresultsin chaoticdynamics.Theobservationaltimeresolutionis0

.

05.Theobservationalandmodelerrorcovariancesare

R

=

0

.

5Iand

Q

=

0

.

3I,respectively.Thisamplitudeofthemodelerrornoiseisrepresentativeforthetwo-scale Lorenz96systemwhich was estimated to be 0.3 using information measures by [20] and 0.3-0.5 using an Expectation-Maximization algorithm coupledtoanETKF[21].Theinitialensembleparticlesarestatestakenrandomlyfromaclimatology.

3.2. HiddenMarkovmodelswithGaussianerrors

Next, we derive the expressionof the posteriordensity inthe sequential framework asa function ofthe particles at k

1.Forthenumericalexperiments,modelandobservationalerrorsareassumedGaussianinthethreedynamicalsystems used inthiswork (Section3.1). Theseare takenasan example,buttheframework is general.In particular,it issuitable fornon-additiveandnon-Gaussianerrors.Thepredictionprobabilitydensitycanbeobtainedanalyticallyundertheadditive Gaussianmodelerrorassumption,

η

k

N

(

0

,

Qk

)

inEq.(1).TheevolutionoftheparticlesfromEq.(3) gives

p(xk

|

y1:k−1

)

Np

m=1 exp

1 2

xk

M

(

x (m) k−1

)

2 Qk

,

(28) where

xk

M

(

x(m)k1

)

2 Qk

(

xk

M

(

x (m) k−1

))

Q− 1 k

(

xk

M

(

x (m)

k−1

))

.Weassumeequal-weightedparticleshere;itis straight-forwardtoassumeweightedparticlesattimek

1 ifsodesired.

Theobservational errorisalsoassumedadditiveGaussian,

ν

k

N

(

0

,

Rk

)

inEq.(2),sothattheobservationallikelihood atk is p(yk

|

xk

)

exp

1 2

yk

H

(

xk

)

2 Rk

.

(29)

The Bayes rule, Eq.(4), is used toobtain the sequential posterior densitycombining theprediction density, Eq.(28), and theobservationlikelihood, Eq.(29). The sequentialposterior densitygiventhe particles attheprevious time step is proportionalto p(xk

|

y1:k

)

Np

m=1 exp

1 2

xk

M

(

x (m) k−1

)

2 Qk

·

exp

1 2

yk

H

(

xk

)

2 Rk

.

(30)
(8)

Thisisthetargetdensitythattheparticles

x

k(1:N)asMonteCarlosamplesshouldberepresentinginthefilter. Thegradientofthelogarithmofthesequentialposteriordensity,Eq.(30),at

x

(l)k,i1 isgivenby

xlogp(xk,i(l)1

)

=

Cp−1

(

x(l)k,i1

)

Np

m=1

ψ

l,mQ

ψ

lR

HRk1

yk

H

(

x(l)k,i1

)

Qk−1

x(l)k,i1

M

(

x(m)k1

)

(31) with

ψ

l,mQ

=

exp

1 2

x (l) k,i−1

M

(

x (m) k−1

)

2 Qk

,

ψ

lR

=

exp

1 2

yk

H

(

x l k

)

2 Rk

and p

(

x(l)k,i1

)

=

C

mN=p1

ψ

l,mQ

ψ

lR.Reducingwe obtain

xlogp(xk,i(l)1

)

=

HRk1

yk

H

(

xk,i(l)−1

)

Qk−1

xk,i(l)1

Np m=1

ψ

Q l,m

M

(

x (m) k−1

)

Np m=1

ψ

Q l,m

.

(32)

This gradient of the log-posterior, Eq.(32), does not depend on the unknown normalization constant of the sequential posteriordensity.Hence,thenumericalevaluationofthegradientoftheKullback-Leiblerdivergenceisfeasible.

3.3. Experimentdetails

AGaussiankernelischosenfortheexperiments, K(x

,

x

)

=

exp

1

2

(

x

x

)

A−1

(

x

x

)

.

(33)

The kernelcovariance

A

istakenproportional to themodelerrorcovariance matrix,Q,i.e.A

=

α

Q.Thisappears tobe a convenientchoicesincethemodeluncertainty,

Q

,alreadyincludesthephysicssothatitisexpectedtorepresentthescaling betweenthevariablesandalsothecorrelationsbetweenvariables.Underthischoicetheonlyparameterofthekernelthat requiresto be definedis

α

.Withgrowing dimension ofthestate vector andthe numberofparticles small, thedistance betweentheparticlesgrowslarger.Toaccountforthis,

α

shouldbechosenlargertoincreasethedistancebetweenparticles. Intheexperiments,wehavefoundthat

α

shouldbechosenoftheorderofthedimensionofthestatevector.

3.4. Thestochasticoptimizationmethodsandtheirconvergence

AnoptimizationoftheKullback-Leiblerdivergenceisconductedinthemappingparticlefiltertopushthepriorparticles tosamplesfromtheposteriordensity.Theparticles aredriven inapurelydeterministic transformationbythevariational mappingfromthepriordensitytotheposterior.Theintermediatedensitiesarerepresentedthroughparticles.Eachparticle isthenmovedalongthedirectionofthesteepestdescentoftheKullback-Leiblerdivergence,whichconsidersalltheparticle positions.

Severalstochastic gradient-basedoptimizationmethodshavebeenrecentlydevelopedinthemachinelearning commu-nity[30,11] whicharedirectlyapplicabletothemappingparticlefilter.Theyarefocusedonefficientfirst-orderconvergence inhigh-dimensionalcontrolspaceswhenthecostfunctionisnoisy.Thesourceofthisnoiseissubsampling.These optimiza-tionmethods,andtheirsuccessful convergencerates,havebeeninstrumentalforthesuccessofdeeplearningapplications (e.g.[12]).Thevariationalmappinginthisworkisbasedonstochasticgradient-basedoptimizationmethods.

The gradientdescent methodhasa fixedlearning rateforall theiterations andall thecontrol variables.This is well-knowntocausesomeinconveniences,inparticularalongextendedvalleyswherezig-zagconvergencearises.Asthegradient decreases in some directions, the convergence along those variables is very slow. The adaptative schemes search for a dynamic learningrateforeachcontrol variable.Ideally,aNewton methodwouldbetheoptimallearning rateifthe opti-mizationproblemisquadratic,butitrequiresHessianevaluationswhicharecomputationallyprohibitiveinhigh-dimensional spaces.Thelearningrateforeachcontrolvariableinthestochasticoptimizationmethod,adadelta[30],isbasedonaglobal learningrateoverarunningaverageofthepastabsolutedirectionalderivativesalongthedirectionofthecontrolvariable. Therefore,the learning rateincreasesfordirectionswithsmalldirectional derivatives.Themethod improvessubstantially butstillsomestagnationafterthefirstiterationshasbeenreported.Toovercomethisweakness,thealgorithmforstochastic optimizationadam[11] estimatesthefirstandsecondmomentsofthegradientswithbiascorrection.Thesetwostochastic optimizationalgorithms,adadeltaandadam, togetherwithadescentgradientmethodwereevaluated intheexperiments. Thechosen initiallearning rateoftheoptimizationmethodsisatradeoffbetweenconvergencespeedandconservingthe smoothnessoftheflow.Inthissense,consideringthateachparticleisfollowingastreamline,theparticlesshouldnotcross thestreamlinesofotherparticles.Thevaluesrecommendedin[30] exhibitedagoodperformance,sothatnomanualtuning ofthelearningratewasrequired.

These first-order adadelta and adam optimization methods were chosen because the cost function was expected in principletobenoisy.ThesemethodsonlyaccountforthediagonaloftheHessianusinganadaptativestepineachvariable. Forhigher-dimensionaloptimizationproblems,theinformationofoff-diagonalelementsoftheHessianmaybeessentialfor relativelysmoothcostfunctions.Inthatcase,quasi-Newtonmethodsmaybeafeasibleoptionsincetherequirednumberof

(9)

Fig. 1.Theroot-mean-square ofthe DK L gradientasafunctionofthe mappingiterationforthedescentgradient(blueline),adadelta(redline)and adam(greenline)optimizationalgorithms(upperpanel).ThiscorrespondstothefirstassimilationcycleoftheLorenz-63experimentinwhichthethree optimizationalgorithmssharethesamepriordensity.ThenumberofeffectiveparticlesasafunctionofthemappingiterationforNp=100 (lowerpanel). (Forinterpretationofthecolorsinthefigure(s),thereaderisreferredtothewebversionofthisarticle.)

Fig. 2.As Fig.1but for the 10-th time iteration, so that the prior density for the optimization algorithms is not the same.

particles inthefilterisexpectedtobe small(

<

200).Intermsoftheconvergencerate, notethattheoptimizationstepin themappingisconstrained.Itshouldberelativelysmalltoavoidthatparticlepathsintersect.

The convergence ofthe variationalmapping isshownin Fig.1 asa function ofthe mappingiteration atthe first as-similation cycle.Atthiscycle,thethreeoptimizationmethods,descentgradient,adadeltaandadam, havethesameinitial density.UpperpanelsofFig.1showtheroot-mean-squareoftheKullback-Leiblergradient,andlowerpanelsthenumberof effectiveparticles.Theadadeltaconvergesfastest,closetoadam.Foralargernumberofiterations,thegradientof Kullback-Leiblerdivergenceisstilldiminishingforadambutitshowsalmostnochangeafter25iterationsforadadelta.However,the numberofeffectivesamplesdoesnotdifferbetweenthem,40iterationsarerequiredtoachieve themaximum–98%ofthe totalnumberofparticles.Thedescentgradientrequiresabout200iterationstoachievethisnumberofeffectiveparticles.

Fig. 2 showstheconvergence atthe 10th assimilation cycleofthe optimizationmethods for evaluating therecursive effects. Notably, adadelta has the fastest convergence with the minimum root-mean-square gradient similar to the one obtained in the first iteration. On the other hand, adam converges slowly in the first mapping iterations and starts to increasetheconvergencerateafter20iterations(aneffectthatislikelycausedbythemomentumequationsinadamwhich requiresome iterations to“warmup”).Adadelta hasreachedthemaximumnumberofeffectiveparticlesin50iterations. Becauseofthefastconvergenceandbecausetheeffectiveparticlenumbersisthemostrelevantparameterfortheparticle filterwetookadadeltawith50iterationsandaninitiallearningrateof0.03asthedefaultsettingfortheexperiments.

There isaclearcorrelation betweenthe decreaseoftheroot-mean-squareofthe Kullback-Leiblergradientandthe in-creaseinthenumberofeffectiveparticlesinFig.2forthethreestochasticoptimizationmethods.Thisimpliesasmentioned thatbothmeasurescouldbeusedtodeterminetheconvergenceintheoptimizationandtherequirednumberofiterations foragiventhreshold.Alongthisline,wenotethattheKullback-Leiblerdivergencecanbedirectlyexpressedintermsofthe weights.

(10)

Fig. 3.TheGaussianpriordensity(blueline)andthebimodalposteriordensity(redline),bluedotsarethesamplesfromthepriordensity(forvisibility hasbeenlocatedatp=0.05)andreddotsshowtheresultofapplyingthemappingparticlefilterinasingleobservationcycle(leftpanel).Thepathofthe particlesasafunctionofpseudotimeλ(rightpanel),atλ=0 theyarethesamplesofthepriordensityandatλ=1 istheresultofthefilterwhenthe convergencecriteriaaresatisfied.

Fig. 4.MarginalizedpriordensityforeachvariableofLorenz-63systemdeterminedatthe115-thtimeiteration,k=115 (blueline),thefinaldensityafter thevariationalmappings(redline)andthemarginalizedsequentialposterior“target”distribution(black).

UsingMonteCarlointegrationoftheKullback-Leiblerdivergence,

D

K L

(q

p)

=

1 Np Np

j=1 log

q(x(kj)

)

p(x(kj)

|

y1:k

)

,

(34)

fromEq.(A.1),wefind

D

K L

(q

p)

= −

1 Np Np

j=1 log

wk(j)Np

.

(35)

As expected ifthe weigths are equally distributedin the particles, the Kullback-Leibler divergenceis zero, then q

(

xk

)

p

(

xk

|

y1:k

)

.Inother words,asthe numberofeffectiveparticles Ne f f tends to Np,theKullback-Leiblerdivergencetendsto zero,itsminimum.

4. Results

A firstsimple experiment illustrates the potential ofthe method ina non-Gaussian inverse problem. Suppose a one-dimensionalspacewithGaussianapriori andGaussianobservationalerror.Theobservationaloperatoristhemoduleofthe state,i.e.

H

(

x

)

= |

x

|

.Inthiscase, theresultingposteriordensityisabimodaldistribution.We applythemappingparticle filterto assimilate a single observation. Theexperiment could representan instrumentthat measures thespeed andthe statevariableisthevelocity. Fig.3ashowsthepriorandposteriordistribution.Blue dotscorrespondstotheprior density sampleswhile reddots representthedots afterapplicationof themappingparticle filter.Fig.3bshowsthepaths ofthe particlesasa functionofpseudo-time

λ

.Theparticles followingthepaths ofminimumtransportcost flownicelytoward thebimodaldistribution.

Fig. 4shows the marginalized prior densityin the experimentwith theLorenz-63 systemas a function ofthe three statevariablesatthe115-thtimeassimilationcycle,k

=

115 (eachpanelexhibitsavariable).Forplotting,themarginalized densityisdeterminedwithkerneldensityestimation.The cyclewas chosen sothatthe differencesbetweenthemapping particlefilter(MPF)andthesamplingimportanceresampling(SIR) filterareemphasized (i.e.acycleforwhichthea pri-ori density isnot agood representationoftheposterior density).Theexperimentuses 100particles andthefull stateis
(11)

Fig. 5.Marginalizedposteriordensity(contours)andtheparticlessamplingthepriordensityatthe115-thtimeiterationforthethreesectionsatthemode location(upperpanels).Theparticlessamplingthefinaldensityafterthevariationalmappingwith50iterations(lowerpanels).

observed. Becauseparticles arespreadin andout the highobservationlikelihood region,theproposaldensityis broader, asymmetric compared withthemarginalized sequentialposteriordensity, Eq.(30). Afterthevariationalmapping,the re-sultingdensityisacloserepresentationofthesequentialposteriordensity,thetargetdensityforthemapping,inthethree variables(seethethreepanelsinFig.4).

The distributionoftheparticlescanbeseen inFig. 5.UpperpanelsofFig.5showtheparticles belongingtotheprior density for each variable, the initial densityfor the mapping. Contours show the posterior density, marginalized to the 2-dimensional plane.Lowerpanelsshow thefinal locationsofthesamplesafter50mapping iterations.The mappinghas pushedmostparticlestowardsthehighprobabilityregionoftheposteriordensity.Notethatthealgorithmisnotcollapsing theparticlestowardthemodeoftheposterior,butrepresentingthedensitywiththesamplesasawhole.Thediversityof theparticlesproducedbythemappingisnotable.

ThesameexperimentconductedwiththestandardSIR,orbootstrapfilterusingresamplingwhenNe f f

<

Np

/

2 isshown in Fig. 6.The sampling deficienciesin the SIR filterare visible in Fig.6 where the particles ofthe prior density (upper panels) andthe resampleddensity(lower panels)are shownwithdots. Because theresamplingconserves andreplicates statisticallyonlytheparticleswithhighlikelihood,thereisanevidentsampleimpoverishmenteveninthislow-dimensional experiment.

Because ofthe goodsampling, theMPF exhibitsavery weak sensitivitytothe numberofparticles inthe root-mean-square error(RMSE) metrics.Fig.7showsthe RMSEoftheanalysismeanwithrespecttothetruestate, forMPF andSIR filtersfor100, 20and5particles (Fig.7a,bandcrespectively).Evenforalargenumberofparticles(withrespect tothe statedimension),theMPFoutperformsthemeanestimationwithrespecttotheSIRfilter.SIRfilterdivergesfor5particles. Fig. 7a, bandc showsthat the time-meanRMSE forthe MPFpractically doesnot change between100particles witha time-meanRMSEof0.482to5particleswhosetime-meanRMSEis0.489.Thus,theperformanceoftheMPFisquiterobust, onlysmallchangesintheRMSEarefoundwhendecreasingthenumberofparticles.

Note thecomplexity ofAlgorithm1 isproportional to N2

p whileitis proportional to Np forthe SIRfilter, sothat the computationalcostoftheassimilationstageishigherfortheMPFwiththesamenumberofparticles.However,itrequires asmallernumberofparticlessothatitrepresentsapromisingvenueforcomputationallyexpensivedynamicalmodels.The SIRfiltercouldusemoreparticlesatthesamecomputationalcost.Howeveragenerallyapplicablecomparisonofthefilters andconclusionofthiskindisnot possiblebecauseitdependsonthecomputational costofthedynamical model

M

that integrates thestate fromtime k

1 tok.Inparticular, one ofthepotential applicationsforwhich we envisagethe MPF, geophysical systems,typically useextremely computationally demanding models. Thus, only a smallnumber (

<

100) of modelintegrations–particles–areaffordable.TheoverallcomputationalcostoftheMPFforsuchasystemwouldbesimilar to3Dvariationalassimilation(i.e.MAPestimation)foreachparticle.

Athirdsetofexperimentswas conductedusingacompartmentstochastic dynamicalmodelforcholera dynamics[10]. Fig. 8a shows the truemortality time series andthe one estimatedby the MPF. The experiment started withan initial

(12)

Fig. 6.As in Fig.5for the SIR particle filter.

Fig. 7.RMSE for the MPF and the SIR filters as a function of the number of particles. Panels (a) 100, (b) 20, and (c) 5 particles.

meanstatewhichunderestimatesthenumberoftruesusceptibleindividualsby20%.Agoodperformance isfoundforthe MPF(total RMSEis0.0031). TheSIR filterfollowsthecholera outbreaksbuttheestimate ismuchnoisierwhenthesame level ofnoise is usedasfor theMPF (total RMSEis 0.0039). By decreasingthe levelof noise inthe SIR filter,the noise diminishesbutattheexpenseofastrongunderestimationofthepeaks(totalRMSEis0.0077).Overalltheperformanceof themappingparticle filterisexcellent, itcan handlevery well thispartial observationconfiguration,i.e.5state variables andoneobservedvariable,andthenonadditivemodelerror.

ToevaluatetheperformanceoftheMPFfora largerstate-spacemodel,anexperimentusingthe40-variableLorenz-96 dynamicalsystemwasconducted.Anensembleof20particleswasusedintheMPFexperiment.Scalablesamplingmethods are expectedto produceusefulresults fortheseexperiments inwhich Np

<

Nx. Fig.9a compares theresulting RMSEas a function oftime given by theMPF experimentand two stochastic EnKFexperiments one using20 members(inflation factor1.1)andonewith100members(inflationfactor1.05).TheSIRfilterisnotshownsinceitexhibitsdegeneracyinthe Lorenz-96systembecauseofthehighdimensionalstateandobservationspaces.TheMPFtakesalongertimetoconverge (20cycles),buttheEnKFwiththesamenumberofmembers(20)showsalargerRMSE.Ontheotherhand,theEnKFwith 100members presentsrathersimilar RMSEto theMPF experimentwith20 particles.Furthermore,the mappingparticle filterestimatesare ratherstablein time,which isa resultofits deterministic mappings.The spreadisrather stableand comparabletotheRMSEinbothfilters(Fig.9b).

An experiment to evaluate the performance of MPF for a partially observed Lorenz-96 system (40 variables and 20 observations) was conducted. The settings are similar to the previous experiment butnow the observations are sparse. Fig.10ashowstheRMSEgivenbytheMPFandthestochasticEnKF.TheMPFoutperformstheEnKFintermsoftheRMSE metrics.Fig.10bshowsthespreadoftheparticlesforbothfilters.Nolocalizationisusedinthefilters.Thisisnotbyfaran

(13)

Fig. 8.Mortalitytimeseriesforamodelofcholeradynamics,truemortalityandtheoneestimatedwiththeMPF(a),withtheSIRfilterwithhighadditive noise(b)andwiththeSIRfilterwithlowadditivenoise(c).

Fig. 9.RMSE (a) and spread (b) of the state as a function of the assimilation cycles produced by EnKF and MPF in the 40-variables Lorenz-96 system.

Fig. 10.RMSE(a)andspread(b)ofthestateasafunctionoftheassimilationcyclesproducedbyEnKFandMPFinthe40-variablesLorenz-96systemwith 20observations.Bothfiltersuse20particles.

exhaustivecomparison,theonlypurposeistoshowthattheMPFcangivepromisingresultsinrelativelylargedimensional spacesinwhichtheSIRfilterdoesnotwork.Furtherexperimentsandmoreextensivecomparisonsarerequiredtoevaluate theoverallperformanceofMPFinmedianandhigher-dimensionalsystems(

>

100 statespacedimensions),thesegobeyond thepresentworkwhichisfocused ontheintroductionanddevelopmentofthetechnique.

The radialbasisfunctionsusedaskernelrequiretosetthebandwidth,whichinthisworkwas chosenas

A

=

α

Q.The uknown parameter

α

dependson the number of particles andthe dimensions ofthe problem so that it requires some tuning.Theoptimalparameterwillbetheonethatproducesthesamplespreadequaltothespreadofthesequential poste-riordensity.Thebandwidthparameter

α

usedintheLorenz-63experimentwith20particlesis

α

=

1.Fortheexperiment with 100particles we use

α

=

0

.

5. Fig. 11 showsexperiments forthe 40 variables Lorenz-96 systemin which we vary the bandwidthfrom

α

=

2 to

α

=

100.The RMSEshowsonlyslightchangeswithsimilarvaluesinthethreeexperiments (Fig. 11a). The RMSEmeasure for theMPF appears ratherrobust to the kernelbandwidth. Fig.11b showsthe spreadof the experiments for different

α

values. A choice of

α

=

20 appears to give an optimal spread of the particles for this experiment.
(14)

Fig. 11.(a)RMSEfortheMPFintheLorenz-96experimentsusingabandwidthofα=2,20,100.(b)Thecorrespondingspreadoftheparticlesofthe posteriordensity.

5. Discussion and conclusions

Theproposedmappingparticlefilterisbasedonadeterministicgradientflow.Itisabletokeepalargeeffectivesample size,avoidingtheresamplingstep,evenforlongrecursionssubjecttotheconvergenceofthemappingsequences. Further-more,itshowsarobustbehaviorintermsoftheRMSEwitha consistentensemblespreadintheconductedexperiments, usingonlyasmallnumberofparticles.

Inthelimit ofasingle particle,Np

=

1,thegradient oftheKullback-Leiblerdivergenceisequaltothe negativeofthe gradientofthelog-posteriordensity(seeEq.(18)).Inthatcase,theoptimizationviagradientdescentdeterminesthemode oftheposteriordensity.Therefore,themethodforNp

=

1 is equivalenttothree-dimensionalvariationaldataassimilation (3D-Var)with

Q

intheroleof

B

.

The MPF requires theadjoint of the observational operator to evaluate the gradient ofthe posterior density(see Eq. (32)).AnextensionoftheMPFwas proposedin[22] whichrepresentstheobservationaloperatorintheRKHS,andinthis wayitavoidstheneedoftheJacobianoftheobservationaloperator.TheexperimentsshowthatthisextensionoftheMPF retainsthemainnon-Gaussianpropertiesoftheposteriordensity.

Themappingofsamplesinthevariationalmappingparticlefilterisproducedbymovingtheparticleswherethey max-imizetheinformationgainoftheproposaldensitywithrespecttothesequentialposteriordensity,viatheKullback-Leibler divergence.In these terms,the proposed mapping particle filterseeks to maximize the amount ofinformation available in a complexhigh-dimensional posteriordensity usinga stochastic optimization methodand givena limited numberof particles.

Acknowledgements

ThisworkhasbeenfoundedbytheEuropeanResearchCouncilviatheCUNDAprojectnumber694509underthe Euro-peanUnionHorizon2020programme.

Appendix A. Importance sampling and the mapping particle filter

The variational mappingsampling schemeis suitable to be coupled withimportance sampling. The last intermediate densityobtainedwiththevariationalmappingsamplingschemecanbeusedasaproposaldensityoftheposteriordensity. Supposewe denotetheparticles obtainedinthelast mappingiterationby

x

k(1:N)

=

.

x(1k,I:N).The valuesoftheproposaland posteriordensitiesattheparticlepositionsareusedtoestimatetheweights

˜

w(kj)

=

p(yk

|

x (j) k

)p(

x (j) k

|

y1:k−1

)

q(x(kj)

)

,

(A.1)

andthentheyarenormalizedwithw(kj)

= ˜

w(kj)

/

Njpw

˜

(j)k .TheeffectiveensemblesizeisgivenbyNe f f

=

1

/

j

(

w (j) k

)

2. Sincetheoutputofthefilterisasetofweightedsamples,thesequentialposteriordensity(e.g.Eq.(30))hastoinclude theweightsofthepreviouscycle,

p(x(j)k

|

y1:k

)

p(yk

|

x(kj)

)

l

wk(l)1p(xk(j)

|

x(l)k1

).

(A.2)

Tocompute theweights between theposterior densityand theproposal density–final density ofthe MPF–,we also needto evaluatetheproposaldensityattheparticlelocations. Werequiretheproposaldensitytobe known(apart from

(15)

Fig. A.12.(a)EvolutionoftheweightsofalltheparticlesfortheLorenz-63systeminalongtimesequenceforI=50 andwithoutthreshold.Becausethe weightsofalltheparticlesareplotted,thelastparticleplottedisthemostvisibleone.(b)Thevarianceoftheweightsasafunctionoftime.(c)Number ofeffectiveparticles.

the set of samples that we have fromthat density). There are two ways to obtain it. One wayis to usekernel density estimationwiththesetofsamples,

x

(1k:N),usingthesamekernelsasinthemapping.Thesecondoptionistodetermineat eachmappingiterationhowthedensityistransformedwiththemapping(e.g.[19]).

Before themapping,wechooseasinitialintermediate density,forinstance,anequallyweighted Gaussianmixture cen-teredat

M

(

x(i)k1

)

andwithcovariances

Q

,

q(xk,0

|

y1:k

)

Np

i=1 exp

1 2

xk,0

M

(

x (i) k−1

)

2 Q

.

(A.3)

Inthefollowingmappingiterationsi

>

0,thedensitychangesaccordingtothemappingT, qT

(

x(k,ij)

)

=

q(x(k,ij)1

)

det

xT

=

q(x(k,ij)1

)

|

I

H

|

,

(A.4)

where

|

I

H

|

isthedeterminantJacobianofthetransformationand

H

istheHessianoftheKullback-Leiblerdivergence. FromEq.(18),theHessianisgivenby

H

=

1 Np Np

l=1

x(j) k,i−1 K(xk,i(l)1

,

x(j)k,i1

)

xlogp(xk,i(l)1

)

+ ∇

x(j) k,i−1

x (l) k,i−1 K

(

x(l)k,i1

,

x(k,ij)1

)

.

(A.5)

ThecalculationoftheHessianoftheKullback-Leiblerdivergence,Eq.(A.5) iscomputationallydemanding.Althoughnote thatthesecondderivativesareknownanalytically forstandardkernelfunctionsandthesymmetryofthekernelcanbeused to reduce thecalculations.The Hessiancalculation isrequiredateachmapping iterationtoevolve qk,i(j) fromtheprevious intermediatedensityq(j)k,i1.

Theweightsaccountforthebiasintroducedinthefilterwhenthesetofsamplesobtainedbythesequenceofmappings is not exactly distributed according to the target density. However, it should be kept in mind that our estimate of the posteriordensityEq.(A.2) isnotexact,soitisunclearwhat thecalculatedweights actuallymean.Itwouldbeinteresting todeterminetheaccuracyofthisestimate,butthatisbeyondthescopeofthiswork.

Theweightsarealsousefulasadiagnosticofthequalityoftheconvergenceoftheoptimizationmethod.Inthatcase,an evaluationoftheweightsforthecurrentintermediatedensitycouldbeconductedineachmappingiteration.Thisallowsto check inthealgorithmifthegivenweights givean effectivesample sizewhichisoverthe thresholdno furthermapping iterations arerequired.Intheconductedexperimentsofthiswork,ingeneral,theeffectivesamplesizewasover98%after 50mappingiterationsduringthewholerecursion.Inotherwords,ifthenumberofmappingiterationsislargeenough

50, theweightswillbealmostequal.Inhigh-dimensionalsystems,aconvergencecriterionbasedonthemoduleofthegradient ofKulback-Leibler divergencecanbe usedinsteadoftheeffectivesample size.ThisavoidsthecalculationsoftheHessian orkerneldensityestimations.The convergenceofthetechnique isevaluatedthrough experimentswithbothmeasures in Section3.4.

The performance of the mapping particle filtercan be evaluated through the weights of the particles. As a function of the mapping iterations of the filter,the sequence of mappings is expectedto start with a set of particles withmost of the particles with almost zero weights and a few particles with very large weights. Note from Eq. (35) that parti-cles withalmost zero weights produce a largeKullback-Leibler divergence.Then, asthe particles are pushed toward the posterior density, their weights will tend to be equally distributedfor mostof the particles.If the variationalmappings work effectively,onlyafewparticlesshouldremainwithlowweights. Inother words,thevarianceoftheweightsshould

(16)

Fig. A.13.Idem Fig.A.12forI=100 and without threshold.

be smallfor all the cycles. Fig. A.12a shows the evolution ofthe weights of all the particles as a function of time for I

=

50 andwithout thresholdforthe Lorenz-63system. Most ofthe particleweights are concentrated around 0.05,in a rangebetween0.04 and 0.07 (Np

=

20 particles).There are some cyclesin whichone particleobtains a low weight but in thefollowing cyclethe weight ofthat particle isagain within theequally-weighted range.Note that we didnot per-form anyresampling. The variance of the weights is around 1

.

25 10−4 (Fig. A.12b). The numberof effectiveparticles is ingeneralabout19(from atotalof 20particles),only inafew cyclesit descendsdown to16(Fig. A.12c).Afurther ex-perimentwas conductedinwhichthemaximumnumberofiterationswas increasedto100. Inthatcase, alltheparticles remain with weights larger than 0.04 during the whole time sequence andthe peaks of variance of the weights found in Fig. A.12b are not present (see Fig. A.13). The numberof effective particles is in this caseover 18 for all the cycles (Fig.A.13c).

References

[1]S.Angenent,S.Haker,A.Tannenbaum,MinimizingflowsfortheMonge–Kantorovichproblem,SIAMJ.Math.Anal.35(2003)61–97.

[2]M.S.Arulampalam,S.Maskell,N.Gordon,T.Clapp,Atutorialonparticlefiltersforonlinenonlinear/non-GaussianBayesiantracking,IEEETrans.Signal Process.50(2002)174–188.

[3]E.Atkins,M.Morzfeld,A.J.Chorin,Implicitparticlemethodsandtheirconnectionwithvariationaldataassimilation,Mon.WeatherRev.141(2013) 1786–1803.

[4]P.Bunch,S.Godsill,ApproximationsoftheoptimalimportancedensityusingGaussianparticleflowimportancesampling,J.Am.Stat.Assoc.111(2016) 748–762.

[5]O.Cappé,E.Moulines,T.Rydon,InferenceinHiddenMarkovModels,SpringerScience+BusinessMedia,NewYork,NY,2005.

[6]Y.Cheng,S.Reich,Assimilatingdataintoscientificmodels:anoptimalcouplingperspective,in:NonlinearDataAssimilation,Springer,2015,pp. 75–118. [7]A.J.Chorin,X.Tu,Implicitsamplingforparticlefilters,Proc.Natl.Acad.Sci.USA106(2009)17249–17254.

[8]F.Daum,J.Huang,Nonlinearfilterswithlog-homotopy,in:SignalandDataProcessingofSmallTargets2007,vol. 6699,2007,p. 669918. [9]A.Doucet,S.Godsill,C.Andrieu,OnsequentialMonteCarlosamplingmethodsforBayesianfiltering,Stat.Comput.10(2000)197–208. [10]E.L.Ionides,C.Bretó,A.A.King,Inferencefornonlineardynamicalsystems,Proc.Natl.Acad.Sci.USA103(2006)18438–18443.

[11]D.Kingma,J.Ba,Adam:amethodforstochasticoptimization,in:Int.Conf.onLearningRepres(ICLR),2015,arXivpreprintarXiv:1412.6980. [12]Y.LeCun,Y.Bengio,G.Hinton,Deeplearning,Nature521(2015)436–444.

[13]Y.Li,M.Coates,Particlefilteringwithinvertibleparticleflow,IEEETrans.SignalProcess.65(2017)4102–4116.

[14]Q.Liu,D.Wang,Steinvariationalgradientdescent:ageneralpurposeBayesianinferencealgorithm,in:AdvancesinNeuralInformationProcessing Systems,2016,pp. 2378–2386.

[15]J.Liu,M.West,Combinedparameterandstateestimationinsimulation-basedfiltering,in:SequentialMonteCarloMethodsinPractice,Springer,New York,2001,pp. 197–223.

[16]R.M.Neal,Samplingfrommultimodaldistributionsusingtemperedtransitions,Stat.Comput.6(1996)353–366.

[17]Y. Marzouk, T.Moselhy,M.Parno,A.Spantini,An introductiontosamplingviameasuretransport,in: R.Ghanem,D. Higdon,H.Owhadi (Eds.), HandbookofUncertaintyQuantification,Springer,2017,inpress,arXiv:1602.05023.

[18]R.McCann,Existenceanduniquenessofmonotonemeasure-preservingmaps,DukeMath.J.80(1995)309–323. [19]T.A.Moselhy,Y.M.Marzouk,Bayesianinferencewithoptimalmaps,J.Comput.Phys.231(2012)7815–7850.

[20] M.Pulido,O.Rosso,Modelselection:usinginformationmeasuresfromordinalsymbolicanalysistoselectmodelsub-gridscaleparameterizations, J. Atmos.Sci.74(2017)3253–3269,https://doi.org/10.1175/JAS-D-16-0340.1.

[21]M.Pulido,P.Tandeo,M.Bocquet,A.Carrassi,M.Lucini,Parameterestimationinstochasticmulti-scaledynamicalsystemsusingmaximumlikelihood methods,Tellus70(2018)1442099.

[22]M.Pulido,P.J.vanLeewen,D.J.Posselt,Kernelembeddednonlinearobservationalmappingsinthevariationalmappingparticlefilter,in:Computational Science-ICCS2019,in:LectureNotesinComputerScience,vol. 11539,2019,arXiv:1901.10426,2019.

[23]S.Reich,AnonparametricensembletransformmethodforBayesianinference,SIAMJ.Sci.Comput.35(2013)A2013–A2024. [24]B.Scholkopf,A.J.Smola,LearningWithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond,MITPress,2002. [25]E.G.Tabak,E.Vanden-Eijnden,Densityestimationbydualascentofthelog-likelihood,Commun.Math.Sci.8(2010)217–233.

[26]P.J.vanLeeuwen,Nonlineardataassimilationingeosciences:anextremelyefficientparticlefilter,Q.J.R.Meteorol.Soc.136(2010)1991–1999. [27]P.J.vanLeeuwen,Nonlineardataassimilationforhigh-dimensionalsystems,in:NonlinearDataAssimilation,Springer,2015,pp. 1–73.

[28] P.J.vanLeeuwen,H.R.Künsch,L.Nerger,R.Potthast,S.Reich,Particlefiltersforhigh-dimensionalgeoscienceapplications:areview,Q.J.R.Meteorol. Soc.(2019),inpress,https://doi.org/10.1002/qj.3551.

[29]C.Villani,OptimalTransport:OldandNew,vol. 338,SpringerScience&BusinessMedia,2008. [30]M.D.Zeiler,ADADELTA:anadaptivelearningratemethod,arXiv:1212.5701 [cs.LG],2012.

Journal of Computational Physics 396 (2019) 400–415 ScienceDirect www.elsevier.com/locate/jcp (http://creativecommons.org/licenses/by/4.0/ 61–97. 174–188. 1786–1803. 748–762. 2005. 2015, 17249–17254. 2007, 197–208. 18438–18443. 2015, 436–444. 4102–4116. 2016, 2001, 353–366. 2017, 309–323. 7815–7850. https://doi 1442099. 2019. 35 2002. 217–233. 1991–1999. 2015, https://doi 2008. 2012. 1904–1919.

References

Related documents