Sequential Monte Carlo with kernel embedded mappings: The mapping particle filter

(1)

Contents lists available atScienceDirect

Journal

of

Computational

Physics

www.elsevier.com/locate/jcp

Sequential

Monte

Carlo

with

kernel

embedded

mappings:

The mapping

particle

ﬁlter

Manuel Pulido

a

,

b

,∗

,

Peter

Jan van Leeuwen

a

,

c

a_Data_Assimilation_Research_Centre,_Department_of_Meteorology,_University_of_Reading,_UK b_Department_of_Physics,_FaCENA,_Universidad_Nacional_del_Nordeste_and_CONICET,_Argentina c_National_Centre_for_Earth_Observation,_Department_of_Meteorology,_University_of_Reading,_UK

a

r

t

i

c

l

e

i

n

f

o

a

b

s

t

r

a

c

t

Articlehistory:

Received18October2018

Receivedinrevisedform15May2019 Accepted24June2019

Availableonline27June2019 Keywords:

Optimaltransport Particleﬂows Kernelembedding

SteinVariationalGradientDescent

In this work, a novel sequential Monte Carlo filter is introduced which aims at an efficient sampling of the state space. Particles are pushed forward from the prediction to the posterior density using a sequence of mappings that minimizes the Kullback-Leibler divergence between the posterior and the sequence of intermediate densities. The sequence ofmappings represents a gradient flow based on the principles of local optimal transport. A key ingredient of the mappings is that they are embedded in a reproducingkernel Hilbertspace,whichallowsfor apracticaland efficient MonteCarlo algorithm.Thekernelembeddingprovidesadirectmeanstocalculatethegradientofthe Kullback-Leiblerdivergenceleadingtoquickconvergenceusingwell-knowngradient-based stochasticoptimizationalgorithms.Evaluationofthemethodisconductedinthechaotic Lorenz-63 system, the Lorenz-96 system, which is a coarse prototype of atmospheric dynamics, and an epidemic model that describes cholera dynamics. No resampling is requiredinthemappingparticlefilterevenforlongrecursivesequences.Thenumberof effectiveparticlesremainsclosetothetotalnumberofparticlesinallthesequence.Hence, themappingparticlefilterdoesnotsufferfromsampleimpoverishment.

1. Introduction

Severalapplicationsrangingfrommeteorology,oceanography,andhydrologytobiologicalsystemsneedtoestimate the stateofasystemfromanimperfectnumericalmodelandpartialnoisyobservations.Thecombinationofadynamicalmodel whichrepresentsthenonlinearevolutionofthe(hidden)statewithsparsenoisyobservationsisusuallyrepresentedthrough a hiddenMarkovmodel,also knownasastate-space model[9,5]. Theaim ofthisworkis todevelop a methodologyfor sequential Bayesianinferenceofstate-space modelscomposedofnonlinear chaoticdynamical systemswithnon-Gaussian uncertaintyinthestate-spacevariablesusingasmallnumberofsamples.

Particle filters are Monte Carlo techniques that sample the space through so-called particles–state realizations– that representthe probabilitydensityofthestate conditionedon thepartialnoisy observations,i.e.theposterior density.The challenge forparticlefiltersistorepresentrecursivelythehighprobabilityregions oftheposteriordensitywithalimited number ofparticles. Ingeneral, particle filterssufferfromfilterdegeneracy [9]. Aftera fewtime iterations, they tend to

*

Correspondingauthor.

E-mailaddress:[email protected](M. Pulido). https://doi.org/10.1016/j.jcp.2019.06.060

(2)

give all theweight to a single particle.In the importance samplingframework, a proposaldistribution of thetransition densityconditionedtothecurrentobservationscanbechosen toimprovethesampling[9,2].Although,agoodchoiceofa proposaldensitymayalleviatetheﬁlterdegeneracy,iftheﬁlterisappliedrecursivelyaresamplingstepisstillrequired.The applicationof resamplingproduces anotherissue,particle impoverishment,since onlya few particles withlarge weights are keptandcopied, theparticlediversity isdecreased inthe resamplingstep.Thisis animportant contentionpointfor hidden-Markovmodelsinhigh-dimensionalstatespaces.

Chorin etal., [7], proposed an implicit samplingschemein which theparticles froma Gaussian proposal densityare mapped to the highprobability regions of the posterior density through the solution of an algebraic equation for each particle.The latterrelatesthe modeof theproposaldensitywiththe modeofthe posteriordensityforeach particle.To findthemodeoftheposteriordensityforeachparticle,aminimizationisrequired.Toreducethevarianceoftheweights, Van Leeuwen, [26], proposes a mapping that enforces equal weights by scaling the deterministic movement of particles inthe optimalproposalstep.Since thehighprobability region oftheproposal densityischosen smallerthan themodel transitiondensity, thefilter can be overlyoptimistic aboutits performance [27]. Recently, a two-stepmapping hasbeen proposedtoalleviate theseissuesandremovethesebiases.However,amathematicalunderstandingoftheconvergenceof thefilterinthelimitforlargenumberofparticlesisstillmissing.Reich,[23],appliedoptimaltransportprinciplestofilters. A linear ensemble transform is proposed that minimizes the Euclidean distancewith optimal coupling in one mapping step. The degeneracy problemis still present in this method. It is, however, well suited for localization procedures for high-dimensional applications[6], inwhichan observationonlyaffectstheregion ofthestate spacethatis closetoitin Euclideandistance.

Inthiswork,weproposeanovelparticlefilterwhichisalsobasedonamappingfromthepriordensitytotheposterior densityastheimplicitsamplingfilter[7,3],theimplicitequalweightparticlefilter[26,31] andthenonparametricensemble transformmethod[23].Themaindifferenceisthatthemappingisdoneiterativelyviaagradientflow.Asampling method-ologybasedonoptimaltransportisderivedtodrivetheparticlesfromthepredictiondensitytotheposteriordensity.The transportofparticlesaimstominimizetheKullback-Leiblerdivergencethatmeasuresthedifferencesbetweenthe interme-diatedensitiesandtheposteriordensity.ThemappingisembeddedinareproducingkernelHilbertspacewhichallowsfor ananalyticalexpressionforthegradientoftheKullback-Leiblerdivergence.Theproposedsamplingmethodfortheparticle filterisbasedintheStein variationalgradient descentintroduced in[14].Theyfound aconnectionbetweenthegradient Kullback-LeiblerdivergenceandtheSteindiscrepancy.Inthiswork,wedevelopasequentialMonteCarlofilterbasedonthe variationalmapping,evaluateitinthreehiddenMarkovmodelsanddiscussitspotentialapplications.

2. Methodology

2.1. SequentialBayesianestimation

Weassumetheestimationproblemisencompassedofadynamicalmodel

M

,whichpredictsthestate

x

fromaprevious state,andan observationalmodel

H

,whichtransformsthestatefromthe(hidden) statespacetotheobservationalspace. Thesetofequationsthatdeﬁnestheestimationproblemisknownasastate-spacemodelorahiddenMarkovmodel.These are

xk

=

M

(

xk−1

,

η

k

),

(1)

yk

=

H

(

xk

,

ν

k

),

(2)

where

x

k

∈

R

Nx isthestateattimek,

y

k

∈

R

Ny aretheobservations,

η

k

∼

p

(

η

)

istherandommodelerror,and

ν

k

∼

p

(

ν

)

istheobservationalerror.ThemethoddevelopedhereisgeneralanddoesnotrelyontheadditiveorGaussianassumption inmodelnorobservationalerrors.

Observations

y

k areassumedtobemeasuredatdiscretetimes.Someaprioriknowledgeofthestateatk

=

0 isassumed, theinitialpriorstatedensityp

(

x0

)

.ThesequentialBayesianstateinferenceisgivenintwostages:

1. Firstly,intheevolutionstagethepredictiondensityisdeterminedas p(xk

|

y1:k−1

)

=

p(xk

|

xk−1

)p(

xk−1

|

y1:k−1

)

dxk−1

,

(3) wherewedenote

{

y1

,

· · ·

,

yk−1

}

by

y

1:k−1.Atk

=

1,

y

1:0

=

∅

sothatthepredictiondensityisp

(

x1

)

=

p

(

x1

|

x0

)

p

(

x0

)

dx0. 2. Secondly,intheassimilationstageBayesruleisusedtoexpresstheinferenceasasequentialprocess

p(xk

|

y1:k

)

=

p(yk

|

xk

)p(

xk

|

y1:k−1

)

p(yk

|

y1:k−1

)

,

(4)

wherep

(

yk

|

xk

)

istheobservationlikelihoodandp

(

yk

|

y1:k−1

)

=

(3)

Note that the marginalized posterior densityfor the current state is inferred in Eq. (4) instead ofthe posterior density forthewholetrajectory

x

0:K.Thisinference ofthecurrentstate isusefulforsystemswhichroutinelyreceiveobservations andthereforethe estimationisproducedonly atthecurrentstate withoutkeepinginformationofthewholepath ofthe particles.ThisapproachisfollowedbyensembleKalmanfilters,particleflowfilters(e.g.[8])andtheimplicitequal-weight particlefilter[31,28].

2.2. Particleﬂowsandoptimaltransport

Aﬁlterneedstoupdateitspriorknowledge,providedinthepredictiondensity,withtheobservationlikelihood,Eq.(4). The concept of homotopy can be used tosmoothly transformthe prior densityq

(

x

)

towards the posterior density p

(

x

)

. Continuous deformationsthrough a parametercan be usedto achieve this, such asT

(

x

,

λ)

:

R

Nx

× [

₀

,

₁

]

→

R

_, _T

(

_x

,

λ)

=

p

(

x

)

λ_q

(

_x

)

1−λ_. _(To _simplify _the _notation _we _have_left _the _conditioning _to _observations _implicit_in _this _and_the _following subsection.)Theparameter

λ

isinterpretedasapseudo-timewhichisvariedfrom0to1atafixedrealtime.Thisconcept wasusedin[8] forparticlefiltersandisdirectlyrelatedtotempering[16].Thedensitiesarerepresentedthroughparticles, sothatthedeformationsareinterpretedasparticlesmovinginaflowaccordingtoasetofordinarydifferentialequations,

dxλ

dλ

=

vλ

(

xλ

),

(5)

where

v

λisthedriftorvelocityfield.Wefocusondeterministicflowssothatdiffusivityprocessesarenotconsidered.The evolutionofthedensityunderthisflowisthengivenbytheLiouvilleequation,

∂

tqλ

+ ∇ ·

(

vλqλ

).

(6)

Insome recentworks, theparticlesare pushedsmoothlyforwardinpseudo-times usingaGaussian approximationto rep-resent the velocity ﬁeld(e.g. [4,13]).Ontheother hand,in thecurrentwork we avoidtheuse ofthat approximationby obtainingthegradientﬂowfromoptimallocaltransport.

Theoptimaltransportisanotherpromisingapproachforthesamplingofcomplexposteriordistributions[17].Giventhe prior probabilitymassdistributionq

(

x

)

andthetargetprobability massdistribution p

(

z

)

,wewanttotransportthe proba-bilitymassfromq to pusinga mappingT

:

x

→

z.The optimaltransportproblemseeksthetransformation T thatgives the minimalcostto transportthedistributionofmassfromq to p.Thisrepresentstheclassical Mongeoptimaltransport problem. Thereisa rigorousproof oftheexistenceofsuchoptimalmappingundermildconditions[18]. Furthermore,[1] haveformulatedtheMonge-Kantorovichproblemasasteepestdescentgradientoptimization.

Ourapproachcombinestheideasofoptimaltransportandparticleflows.Wetakethelocalapproachtooptimaltransport [29] inwhicha gradientflowneeds tobe obtained.Through asequence ofmappingsgivenbythe flowwe seekto push forward the particles from the prior to the target density. The mappings in the sequence are required to be assmooth aspossible.Theparticles behaveasactive Lagrangiantracersinaflow. Thevelocity fieldateachpseudo-time stepofthe mappingsequenceischosenfollowingalocaloptimaltransportcriterion.

2.3. Variationalmappings

Ateachiterationofthesequence,weproposealocaltransformationT thatfollowsEq.(5),

xλ+

=

T(xλ

)

=

xλ

+

v

(

xλ

),

(7)

where

=

δλ

isassumedtobesmalland

v

(

xλ

)

isanarbitraryvectorialfunction“sufficiently”smooth,whichrepresentsthe velocityoftheflowdefinedinEq.(5).

The Kullback-Leiblerdivergenceisusedasa measureofthedifference betweenthe intermediatedensityq

(

x

)

andthe posteriordensityp

(

x

)

.TheKullback-Leiblerdivergencebetweenqandpafterthetransformation,T,intermsoftheinverse mappingis

D

K L

(T

)

=

qX

(

x

)

log

qX

(

x

)

p_T−1

(

x

)

dx

,

(8)

wherep_T−1

(

x

)

=

p

(

T

(

x

))

det

∇

xT

(

x

)

anddet

∇

xT

(

x

)

isthedeterminantJacobianofthemappingandforbrevity

x

=

xλ(time indexisleftimplicit).

Ourgoalistoﬁndthevelocityﬁeld

v

(

x

)

ofthetransformationT thatgivestheminimumof

D

K L

(

T

)

foreachmapping ofthesequence.Inotherwords,weneedtoﬁndthedirection

v

(

x

)

,thatgivesthesteepestdescentof

D

K L.

WeusetheGateauxderivativeofafunctional,whichisageneralizationofthedirectionalderivative.Giventhefunctional F,itisdeﬁnedas

DhF

=

lim

→0

F

(

x

+

h

(

x

))

−

F

(

x

)

(4)

TheGateauxderivativeoftheKullback-Leiblerdivergence,Eq.(8),inthedirection

v

(

x

)

isgivenby Dv

D

K L

= −

q(x

)

dlogpT−1

(

x

)

₌

0 dx

.

(10)

Thederivativeofthetransformedlog-posteriordensityis

dlogpT−1

(

x

)

₌

0

= ∇

Tlogp(T(x

))

dT

+

Tr

(

∇

xT− 1_d

∇

xT)

₌₀

.

(11) Consideringthat T

(

x

)

=

x

+

v

(

x

)

inEq.(11) andreplacinginEq.(10),thedirectionalGateauxderivativeof

D

K L results in Dv

D

K L

= −

q(x

)

∇

xlogp(x

)

v

(

x

)

+

Tr

(

∇

xv

)

dx

.

(12)

Equation (12) gives the

D

K L derivative along v. However, we require the negative gradient of

D

K L in terms of the samples of q forthe optimization of

D

K L as a function of T. In general, the ﬂow vbelongs to an inﬁnitedimensional Hilbertspace.Hence,thefulloptimizationproblemisstill intractableinpracticesincewedonothaveawaytodetermine the

−

Dv

D

K Lthatgivesthesteepestdescentdirection.

Onewaytolimitourfunctionaloptimizationproblemsothatitbecomestractableischoosingasspaceoffunctionsthe unitballofareproducingkernelHilbertspace(RKHS),whichwedenoteas

F

.Thiswas proposedby[14].Inthisway,we constrain

v

to

F

andrequirethat

v

F

≤

1 toﬁndthegradientof

D

K L.Theoptimizationproblemisthentoﬁndthe

v

∈

F

thatgivesthedirectionofsteepestdescentof

D

K L.ThemainpropertiesoftheRKHSarediscussedin[24].

ThefunctionstoberepresentedintheRKHSarevector-valued,i.e.

v

(

x

)

∈

R

Nx_._The_kernel_is_thus_assumed_to_be_diagonal andisotropic,

K

=

KINx,where

I

Nx istheidentitymatrixandK isascalarkernel.Thereproducingpropertystatesthatany functionfrom

F

canbeexpressedasthedotproductbythekernelthatdeﬁnestheRKHS,

v

(

x

)

=

K

(

·

,

x

),

v

(

·

)

_F

.

(13)

Equation(13) isknownasthereproducingproperty.ThescalarkernelK deﬁnes

F

1.

Replacing the vector ﬁeld, v

(

x

)

, inthe two terms inEq. (12) by the expression givenin Eq.(13) and then usingdot productproperties,weﬁndthat

Dv

D

K L

= −

q(x

)

[K

(

x

,

·

)

∇

xlogp(x

)

+ ∇

xK(x

,

·

)

] dx

,

v

(

·

)

_F1

.

(14) This is validfor any vsuch that

v

F

≤

1. Therefore, the ﬁrst term ofthe dot product inEq. (14) is by deﬁnitionthe gradientof

D

K L atT=0

=

x,Dv

D

K L

= ∇

D

K L

,

v

_F1 sothat

∇

D

K L

(

x

)

= −

E

x∼q K(x

,

x

)

∇

xlogp(x

)

+ ∇

xK

(

x

,

x

)

,

(15)

where

E

representstheexpectationoperator

E

x∼p

[

f

(

x

)

]

p

(

x

)

f

(

x

)

dx.(NotethatbyDv

D

K L

(

x

)

wedenotethedifferences between

D

K L witha mapping T atxinadirection

v

fromthe Kullback-Leiblerdivergencewithoutthemapping,i.e.the Gateauxderivate, while

∇

D

K L

(

x

)

denotesthedirectioninwhich

D

K L increasesthemostwhenmappingsoftheformEq. (7) areconductedat

x

.)

This expression for the gradient, Eq. (15), is particularlysuitable for Monte-Carlo integration when q is only known througha setofsample points. Sincewe seektominimize theKullback-Leiblerdivergencewe chooseasdirectionofthe transformationT thenegativeofitsgradient,thesteepestdescentdirection,

xλ+

=

x

−

∇

D

K L

(

x

).

(16)

UsingthekernelreproducingpropertyandintegrationbypartsinEq.(15),theresultinggradientﬂowofthemappingis

vλ

(

x

)

= −∇

D

K L

(

x

)

=

qλ

(

x

)

∇

logp(x

)

− ∇

qλ

(

x

).

(17)

As shownin [25], thegradient ﬂow, Eq.(17), has asstationarysolution q

(

x

)

=

p

(

x

)

sothat the gradient ﬂowconverges towardthetargetdensity.

In[19],theKullback-Leiblerdivergenceisalsousedbutwithanextraregularizationtermtoﬁndasingleglobaloptimal map.Here,weapply agradientﬂowviaasequenceoflocalmappings.Underweaksmoothnessconstrains [29],the mini-mumofthecost functionasafunctionofthemappings,Eq.(7), isuniquelydeterminedinEq.(16), i.e.noregularization termisrequired.

(5)

2.4. Mappingparticleﬁlter

Consider wehaveasetofequal-weightedparticles

x

(1:Np)

k−1 whichsampletheposteriordensityattimek

−

1.Thetarget densityattimek whichweaimtosampleusingthevariationalmappingistheposteriordensity p

(

xk

|

y1:k

)

.Themapping approachonlyrequiresasetofsamplesofthepriordensity.Itisstartedbythesetofunweightedparticlesthatareevolved fromthepreviousestimatetothepresentassimilationtimebythedynamicalmodel,i.e.

x_k,0(j)

=

M

(

x(_k₋j)₁

,

η

(j)_k

)

Np j=1,where the second subscript represents the mapping iteration. These sample points of the prediction density are then pushed towardsthesequentialposteriordensitybythemappingiterations.

Giventhesetofparticles

x

(1:Np)

k,i−1 thataresamplesoftheintermediatedensityatmappingiterationi

−

1,thegradientof theKullback-LeiblerdivergencefromEq.(15) atastatespaceposition

x

byMonte-Carlointegrationis

∇

D

K L

(

x

)

= −

1 Np Np

l=1

K(x_k,i(l)₋₁

,

x

)

∇

logp(x_k,i(l)₋₁

)

+ ∇

xK

(

x(l)k,i−1

,

x

)

.

(18)

Thisexpressionisequivalenttotheoneobtainedin[14] foratimeindependentinference.Thenegativeofthisgradient representsthevelocityﬁeldofthegradientﬂow.

TheﬁrsttermofthegradientoftheKullback-Leiblerdivergence,Eq.(18),tendstodrivetheparticlestowardsthepeaks oftheposteriordensity.Itgivesaweightedaverageof

∇

logpincludingthecurrentparticleandsurroundingoneswithin thekernelscaleofthecurrentone.Thisisthetypicalbehaviorofavariationalimportancesampler sothatitaccumulates particles atthehighprobabilityregions oftheposteriordensity.Ontheotherhand,thesecondtermof

∇

D

K L inEq.(18) tends toseparatetheparticles. Ifthecurrentparticle

x

(j)_k isintheinﬂuenceregionofanotherparticlesayx(l)_k ,i.e.within thekernelscale,theterm

∇

_x(l)

k,i−1

K

(

x(l)_k,i₋₁

,

x(j)_k,i₋₁

)

willtendtoseparatethemactingasarepulsiveforcebetweenparticles. Atthemappingiterationi,theparticle j istransformed,accordingtothemappingEq.(16),by

x(_k,ij)

=

x(_k,ij)₋₁

−

∇

D

K L

(

x_k,i(j)₋₁

),

(19)

where

−∇

D

K L

(

x_k,i(j)₋₁

)

isthesteepestdescentdirectionattheparticleposition.Eq.(19) maybeinterpretedasthemovement oftheparticlesalongthestreamlinesoftheﬂowassumingsmall

.

An ingredient needed to evaluate Eq. (18) is an expression for the gradient of the posterior density. A problem in sequential Bayesian inference is that there is no exact expression forthe posterior density.We do knowthe likelihood function,butweonlyhaveasetofparticlesthatrepresentthepriordensity,notthedensityitself.Thepriordensityusing theparticlerepresentationoftheposteriordensityattimek

−

1 inEq.(3) resultsin

p(xk

|

y1:k−1

)

≈

1 Np Np

j p(xk

|

x_k(j)₋₁

),

(20)

andtheexpressionfortheposteriordensityfromEq.(4) becomes

p(xk

|

y1:k

)

∝

1 Np p(yk

|

xk

)

Np

j p(xk

|

x_k(j)₋₁

).

(21)

Becauseoftheﬁniteensemblesizeattimek

−

1,thisexpressionisanapproximationoftheposteriordensity.Equation (21) isthetargetdensityinthemappingsequence.

Ateach mappingiteration,all theparticles aremovedfollowingEq.(19).Successiveapplicationsofthetransformation, through gradientdescent,withthecorrespondingupdatesof

∇

D

K L,willconvergetowards theminimumofthe Kullback-Leiblerdivergence.Thepseudo-codeoftheimplementedalgorithmwhichcombinesvariationalmappingwiththesequential particleﬁlterisshowninAlgorithm1.

The forminwhichthevariationalmappingisobtained,assteepestgradientdescent,issuitableforthestochastic opti-mizationalgorithmsusedinthemachinelearningcommunity.Inthiscase,

,knownasthelearningrate,canbedetermined adaptivelyusing

∇

D

K Lfrompreviousiterations(e.g.[30]).

Animportantpartofanyiterativevariationalmethodisastoppingcriterion.OnepossibilityinAlgorithm1istocheck thevalue of

|∇

D

K L

|

foreachparticle,oraveraged overallparticles.Anotheroptionisto useimportance samplingandto interpretthefinaldensityofthemappingsequenceasaproposaldensityandcalculateweightswithrespecttotheposterior density,whichwillthen automaticallyleadtoan unbiasedestimator.Tousethis,an expressionforthefinalintermediate densityq isrequired,however,wehaveonlyits particlerepresentation.Toimplementthisinapracticalway,thedensity transformations can be used,that we dohaveexplicitly, andthefinal densitycan be relatedtothe prior density.Inthe Appendix,theimplementationofimportancesamplingwithinthevariationalmappingparticlefilterisdescribed.

(6)

Algorithm 1 Mappingparticleﬁlteralgorithm. Input:Givenx(1:Np) k−1 ,yk,M(·),p(y|x)andp(η) forj=1,Npdo x_k,(j)₀←M(x(_k₋j)₁,ηk) Forecaststage end for

repeat Mappingiterations

forj=1,Npdo

x_k,i(j)←x(j)_k,i₋₁−∇DK L(x(j)k,i−1) ∇DK LfromEq.(18) end for

i←i+1

untilStoppingcriterionmet Output: x(1:Np)

k,i

Toavoid the resamplingin the importance samplingstep, one can calculatethe effectiveensemble size as Ne f f (see Appendix),anddecidetocontinuetheiterationsinAlgorithm1until Ne f f reachesacertainthresholdNt.Inthefollowing assimilation cycle, the weights of the particles representing the ﬁnal intermediate density has to be considered in the sequentialposteriordensity,resultinginaweighted posteriordensityinsteadofEq.(21).Inthisway,themappingparticle ﬁlterwouldreadily extendto includeimportance sampling. One shouldkeep inmind, however,that ourestimate ofthe posteriordensityisrather poorin high-dimensionalsystems witha smallnumberofparticles,so theweights wouldnot bevery accurate.Thus,ourstandard implementationusesthemagnitudeofthegradientasthestoppingcriteria,without calculatingtheweights.

3. Numerical experiments

Thesequentialimplementationofthemappingparticleﬁlterisevaluatedusingthreechaoticnonlineardynamicalmodels withstate spaces of up to 40dimensions(see Section 3.1). As usual inthese experiments,the observational andmodel errors areassumedGaussian. A derivationofthe hiddenMarkovmodel withGaussian errorsis giveninSection 3.2.We conductedasetofstochastictwinexperiments,inwhichthedynamicalmodelusedtoproducethesyntheticobservations isthesameastheone usedinthestate-spacemodel,includingthestatisticalparameters. Detailsoftheexperimentsare giveninSection3.3.

3.1. Descriptionofthesystems TheLorenz-63systemisgivenby

dx dt

=

σ

(y

−

x), dy dt

=

x(

ρ

−

z)

−

y, (22) dz dt

=

xy

−

βz.

Theequationsaresolvedusingafourth-orderRunge-Kuttascheme.Theparametersarechosen attheirstandardvalues

σ

=

10,

ρ

=

28 and

β

=

8

/

3.The integrationtime stepis0

.

001 and theassimilationcyclesareevery 0

.

01 (10integration timesteps).Theobservationoperator

H

istakenastheidentitymatrix,forthefull-observedstateexperimentsinSection4. Themodelerrorcovariance

Q

ischosendiagonal.Itsdiagonalelementsaresetto30%oftheclimatologicalvarianceofthe variable,e.g. Q11

=

0

.

3

σ

x2,where

σ

x2 isthetime-seriesvarianceofvariablex.Theobservationerrorcovarianceis

R

=

0

.

5I.

Thestochasticdynamicalmodelforcholeraisa5-variablesystemallowingforsusceptible,infected,andthreeclassesof recoveryindividualsinwhichcholeramortalityistheonlyobservedvariable.Fromtheinferencepointofview,itpresents somechallenges[10].Ithaspartialnoisyobservationswhosevariancedependsontime.Transmissionisassumedstochastic andhasamultiplicativemodelerror.Thecompartmentdynamicalmodelforcholerausedinthisworkistheonethroughly described in [10]. A shortdescription is givenhere. The individuals are classiﬁedassusceptible (S), infectedindividuals (I),whilerecoveredindividualsbelong tothreeclasses(R1:3) toallowfordifferentimmune periods.Theequationsforthe numberofindividualsforeachcategoryaregivenby

dSt

=

dNtB S

−

dNtN I

−

dNtS D

+

dNR k_S t dIt

=

dNtS I

−

dNI R 1 t

−

dNtI C

+

dNtI D dR_t1

=

dN_tI R1

−

dN_tR1R2

−

dN_tR1D (23) dR2_t

=

dN_tR1R2

−

dN_tR2S

−

dN_tR2D dR3_t

=

dN_tR2R3

−

dN_tR3S

−

dN_tR3D

(7)

Transmissionisgivenbyastochasticdifferentialequation

dN_tS I

=

λ

tStdt

+

ItSt

/P

tdWt

,

(24) wheredWt isaGaussianwhitenoise.Thetransitionsbetweencategoriesaregivenby

dN_tI R1

=

γ

Itdt, dNR l−1_Rl t

=

rkRlt−1dt, dN_tRlS

=

rkR_tkdt, dN_tS D

=

mStdt, dN_tI D

=

mItdt, dNR l_D t

=

mRltdt

,

(25) dN_tI C

=

mcItdt, dNtB S

=

d Pt

+

m Ptdt,

wherel

=

2

,

3, B representsbirth,C ischolera mortalityand D denotesdeathfromothercauses.Equations(23) are inte-gratedusinganEuler-Maruyamaschemewithtimestepof1

/

20 month.Theobservationsarecholeramortalitydatawhich aregivenby yt

∼

N

N_tI C

−

N_tI C₋₁

,

τ

2

(N

_tI C

−

N_tI C₋₁

)

2

.

(26)

Whilethemappingparticleﬁltercanhandlenon-additivemodelerrors,wetransformthesystemofequationsaboveto allow foran additive model errorimplementation.We usean augmented state spacedeﬁning a newstate variable T as dTt

=

dWt,sothattheaugmentedsystemhasadditiveGaussianmodelerrorwhosegradientofthelog-posteriordensityis represented byEq.(30). Notethat anon-Gaussian densityforadditive modelerrorwouldgive morerealisticfeatures. To adddiversitytotheparticlesintheﬁlterasmalladditivenoisetermisaddedinalltheequations[15].Thevarianceofthis additivenoiseissetto10%ofthedWt variance.Thisnoiseisnotusedtogeneratethetruetrajectory,itisonlyincludedin theparticleﬁlter.

Forthe40-variableLorenz-96systemexperiments,thesetofequationsis

dxi

dt

=

(x

i+1

−

xi−2

)x

i−1

−

xi

+

F

,

(27)

where i

=

1

,

· · ·

,

40 and cyclicboundary conditions are imposed, x0

=

x40, x−1

=

x39 and x41

=

x1.The equationswere integratedusingafourth-orderRunge-Kuttascheme,withanintegrationstepof0.001.Theforcingis F

=

8 whichresultsin chaoticdynamics.Theobservationaltimeresolutionis0

.

05.Theobservationalandmodelerrorcovariancesare

R

=

0

.

5Iand

Q

=

0

.

3I,respectively.Thisamplitudeofthemodelerrornoiseisrepresentativeforthetwo-scale Lorenz96systemwhich was estimated to be 0.3 using information measures by [20] and 0.3-0.5 using an Expectation-Maximization algorithm coupledtoanETKF[21].Theinitialensembleparticlesarestatestakenrandomlyfromaclimatology.

3.2. HiddenMarkovmodelswithGaussianerrors

Next, we derive the expressionof the posteriordensity inthe sequential framework asa function ofthe particles at k

−

1.Forthenumericalexperiments,modelandobservationalerrorsareassumedGaussianinthethreedynamicalsystems used inthiswork (Section3.1). Theseare takenasan example,buttheframework is general.In particular,it issuitable fornon-additiveandnon-Gaussianerrors.Thepredictionprobabilitydensitycanbeobtainedanalyticallyundertheadditive Gaussianmodelerrorassumption,

η

_k

∼

N

(

0

,

Qk

)

inEq.(1).TheevolutionoftheparticlesfromEq.(3) gives

p(xk

|

y1:k−1

)

∝

Np

m=1 exp

−

1 2

xk

−

M

(

x (m) k−1

)

2 Qk

,

(28) where

xk

−

M

(

x(m)_k₋₁

)

2 Qk

(

xk

−

M

(

x (m) k−1

))

Q− 1 k

(

xk

−

M

(

x (m)

k−1

))

.Weassumeequal-weightedparticleshere;itis straight-forwardtoassumeweightedparticlesattimek

−

1 ifsodesired.

Theobservational errorisalsoassumedadditiveGaussian,

ν

k

∼

N

(

0

,

Rk

)

inEq.(2),sothattheobservationallikelihood atk is p(yk

|

xk

)

∝

exp

−

1 2

yk

−

H

(

xk

)

2 Rk

.

(29)

The Bayes rule, Eq.(4), is used toobtain the sequential posterior densitycombining theprediction density, Eq.(28), and theobservationlikelihood, Eq.(29). The sequentialposterior densitygiventhe particles attheprevious time step is proportionalto p(xk

|

y1:k

)

∝

Np

m=1 exp

−

1 2

xk

−

M

(

x (m) k−1

)

2 Qk

·

exp

−

1 2

yk

−

H

(

xk

)

2 Rk

.

(30)

(8)

Thisisthetargetdensitythattheparticles

x

_k(1:N)asMonteCarlosamplesshouldberepresentingintheﬁlter. Thegradientofthelogarithmofthesequentialposteriordensity,Eq.(30),at

x

(l)_k,i₋₁ isgivenby

∇

xlogp(x_k,i(l)₋₁

)

=

Cp−1

(

x(l)_k,i₋₁

)

Np

m=1

ψ

_l,mQ

ψ

_lR

HR−_k1

yk

−

H

(

x(l)_k,i₋₁

)

−

Q_k−1

x(l)_k,i₋₁

−

M

(

x(m)_k₋₁

)

(31) with

ψ

_l,mQ

=

exp

−

1 2

x (l) k,i−1

−

M

(

x (m) k−1

)

2 Qk

,

ψ

_lR

=

exp

−

1 2

yk

−

H

(

x l k

)

2 Rk

and p

(

x(l)_k,i₋₁

)

=

C

_mN₌p₁

ψ

_l,mQ

ψ

_lR.Reducingwe obtain

∇

)

=

HR−_k1

yk

−

H

(

xk,i(l)−1

)

−

Q_k−1

⎡

⎣

x_k,i(l)₋₁

−

Np m=1

ψ

Q l,m

M

(

x (m) k−1

)

Np m=1

ψ

Q l,m

⎤

⎦

.

(32)

This gradient of the log-posterior, Eq.(32), does not depend on the unknown normalization constant of the sequential posteriordensity.Hence,thenumericalevaluationofthegradientoftheKullback-Leiblerdivergenceisfeasible.

3.3. Experimentdetails

AGaussiankernelischosenfortheexperiments, K(x

,

x

)

=

exp

−

1

2

(

x

−

x

)

_A−1

(

_x

−

_x

)

.

₍₃₃₎

The kernelcovariance

A

istakenproportional to themodelerrorcovariance matrix,Q,i.e.A

=

α

Q.Thisappears tobe a convenientchoicesincethemodeluncertainty,

Q

,alreadyincludesthephysicssothatitisexpectedtorepresentthescaling betweenthevariablesandalsothecorrelationsbetweenvariables.Underthischoicetheonlyparameterofthekernelthat requiresto be deﬁnedis

α

.Withgrowing dimension ofthestate vector andthe numberofparticles small, thedistance betweentheparticlesgrowslarger.Toaccountforthis,

α

shouldbechosenlargertoincreasethedistancebetweenparticles. Intheexperiments,wehavefoundthat

α

shouldbechosenoftheorderofthedimensionofthestatevector.

3.4. Thestochasticoptimizationmethodsandtheirconvergence

AnoptimizationoftheKullback-Leiblerdivergenceisconductedinthemappingparticleﬁltertopushthepriorparticles tosamplesfromtheposteriordensity.Theparticles aredriven inapurelydeterministic transformationbythevariational mappingfromthepriordensitytotheposterior.Theintermediatedensitiesarerepresentedthroughparticles.Eachparticle isthenmovedalongthedirectionofthesteepestdescentoftheKullback-Leiblerdivergence,whichconsidersalltheparticle positions.

Severalstochastic gradient-basedoptimizationmethodshavebeenrecentlydevelopedinthemachinelearning commu-nity[30,11] whicharedirectlyapplicabletothemappingparticlefilter.Theyarefocusedonefficientfirst-orderconvergence inhigh-dimensionalcontrolspaceswhenthecostfunctionisnoisy.Thesourceofthisnoiseissubsampling.These optimiza-tionmethods,andtheirsuccessful convergencerates,havebeeninstrumentalforthesuccessofdeeplearningapplications (e.g.[12]).Thevariationalmappinginthisworkisbasedonstochasticgradient-basedoptimizationmethods.

The gradientdescent methodhasa fixedlearning rateforall theiterations andall thecontrol variables.This is well-knowntocausesomeinconveniences,inparticularalongextendedvalleyswherezig-zagconvergencearises.Asthegradient decreases in some directions, the convergence along those variables is very slow. The adaptative schemes search for a dynamic learningrateforeachcontrol variable.Ideally,aNewton methodwouldbetheoptimallearning rateifthe opti-mizationproblemisquadratic,butitrequiresHessianevaluationswhicharecomputationallyprohibitiveinhigh-dimensional spaces.Thelearningrateforeachcontrolvariableinthestochasticoptimizationmethod,adadelta[30],isbasedonaglobal learningrateoverarunningaverageofthepastabsolutedirectionalderivativesalongthedirectionofthecontrolvariable. Therefore,the learning rateincreasesfordirectionswithsmalldirectional derivatives.Themethod improvessubstantially butstillsomestagnationafterthefirstiterationshasbeenreported.Toovercomethisweakness,thealgorithmforstochastic optimizationadam[11] estimatesthefirstandsecondmomentsofthegradientswithbiascorrection.Thesetwostochastic optimizationalgorithms,adadeltaandadam, togetherwithadescentgradientmethodwereevaluated intheexperiments. Thechosen initiallearning rateoftheoptimizationmethodsisatradeoffbetweenconvergencespeedandconservingthe smoothnessoftheflow.Inthissense,consideringthateachparticleisfollowingastreamline,theparticlesshouldnotcross thestreamlinesofotherparticles.Thevaluesrecommendedin[30] exhibitedagoodperformance,sothatnomanualtuning ofthelearningratewasrequired.

These ﬁrst-order adadelta and adam optimization methods were chosen because the cost function was expected in principletobenoisy.ThesemethodsonlyaccountforthediagonaloftheHessianusinganadaptativestepineachvariable. Forhigher-dimensionaloptimizationproblems,theinformationofoff-diagonalelementsoftheHessianmaybeessentialfor relativelysmoothcostfunctions.Inthatcase,quasi-Newtonmethodsmaybeafeasibleoptionsincetherequirednumberof

(9)

Fig. 1.Theroot-mean-square ofthe DK L gradientasafunctionofthe mappingiterationforthedescentgradient(blueline),adadelta(redline)and adam(greenline)optimizationalgorithms(upperpanel).ThiscorrespondstotheﬁrstassimilationcycleoftheLorenz-63experimentinwhichthethree optimizationalgorithmssharethesamepriordensity.ThenumberofeffectiveparticlesasafunctionofthemappingiterationforNp=100 (lowerpanel). (Forinterpretationofthecolorsintheﬁgure(s),thereaderisreferredtothewebversionofthisarticle.)

Fig. 2.As Fig.1but for the 10-th time iteration, so that the prior density for the optimization algorithms is not the same.

particles intheﬁlterisexpectedtobe small(

<

200).Intermsoftheconvergencerate, notethattheoptimizationstepin themappingisconstrained.Itshouldberelativelysmalltoavoidthatparticlepathsintersect.

The convergence ofthe variationalmapping isshownin Fig.1 asa function ofthe mappingiteration atthe ﬁrst as-similation cycle.Atthiscycle,thethreeoptimizationmethods,descentgradient,adadeltaandadam, havethesameinitial density.UpperpanelsofFig.1showtheroot-mean-squareoftheKullback-Leiblergradient,andlowerpanelsthenumberof effectiveparticles.Theadadeltaconvergesfastest,closetoadam.Foralargernumberofiterations,thegradientof Kullback-Leiblerdivergenceisstilldiminishingforadambutitshowsalmostnochangeafter25iterationsforadadelta.However,the numberofeffectivesamplesdoesnotdifferbetweenthem,40iterationsarerequiredtoachieve themaximum–98%ofthe totalnumberofparticles.Thedescentgradientrequiresabout200iterationstoachievethisnumberofeffectiveparticles.

Fig. 2 showstheconvergence atthe 10th assimilation cycleofthe optimizationmethods for evaluating therecursive effects. Notably, adadelta has the fastest convergence with the minimum root-mean-square gradient similar to the one obtained in the first iteration. On the other hand, adam converges slowly in the first mapping iterations and starts to increasetheconvergencerateafter20iterations(aneffectthatislikelycausedbythemomentumequationsinadamwhich requiresome iterations to“warmup”).Adadelta hasreachedthemaximumnumberofeffectiveparticlesin50iterations. Becauseofthefastconvergenceandbecausetheeffectiveparticlenumbersisthemostrelevantparameterfortheparticle filterwetookadadeltawith50iterationsandaninitiallearningrateof0.03asthedefaultsettingfortheexperiments.

There isaclearcorrelation betweenthe decreaseoftheroot-mean-squareofthe Kullback-Leiblergradientandthe in-creaseinthenumberofeffectiveparticlesinFig.2forthethreestochasticoptimizationmethods.Thisimpliesasmentioned thatbothmeasurescouldbeusedtodeterminetheconvergenceintheoptimizationandtherequirednumberofiterations foragiventhreshold.Alongthisline,wenotethattheKullback-Leiblerdivergencecanbedirectlyexpressedintermsofthe weights.

(10)

Fig. 3.TheGaussianpriordensity(blueline)andthebimodalposteriordensity(redline),bluedotsarethesamplesfromthepriordensity(forvisibility hasbeenlocatedatp=0.05)andreddotsshowtheresultofapplyingthemappingparticlefilterinasingleobservationcycle(leftpanel).Thepathofthe particlesasafunctionofpseudotimeλ(rightpanel),atλ=0 theyarethesamplesofthepriordensityandatλ=1 istheresultofthefilterwhenthe convergencecriteriaaresatisfied.

Fig. 4.MarginalizedpriordensityforeachvariableofLorenz-63systemdeterminedatthe115-thtimeiteration,k=115 (blueline),theﬁnaldensityafter thevariationalmappings(redline)andthemarginalizedsequentialposterior“target”distribution(black).

UsingMonteCarlointegrationoftheKullback-Leiblerdivergence,

D

K L

(q

p)

=

1 Np Np

j=1 log

q(x(_kj)

)

p(x(_kj)

|

y1:k

)

,

(34)

fromEq.(A.1),weﬁnd

D

K L

(q

p)

= −

1 Np Np

j=1 log

w_k(j)Np

.

(35)

As expected ifthe weigths are equally distributedin the particles, the Kullback-Leibler divergenceis zero, then q

(

xk

)

≈

p

(

xk

|

y1:k

)

.Inother words,asthe numberofeffectiveparticles Ne f f tends to Np,theKullback-Leiblerdivergencetendsto zero,itsminimum.

4. Results

A ﬁrstsimple experiment illustrates the potential ofthe method ina non-Gaussian inverse problem. Suppose a one-dimensionalspacewithGaussianapriori andGaussianobservationalerror.Theobservationaloperatoristhemoduleofthe state,i.e.

H

(

x

)

= |

x

|

.Inthiscase, theresultingposteriordensityisabimodaldistribution.We applythemappingparticle ﬁlterto assimilate a single observation. Theexperiment could representan instrumentthat measures thespeed andthe statevariableisthevelocity. Fig.3ashowsthepriorandposteriordistribution.Blue dotscorrespondstotheprior density sampleswhile reddots representthedots afterapplicationof themappingparticle ﬁlter.Fig.3bshowsthepaths ofthe particlesasa functionofpseudo-time

λ

.Theparticles followingthepaths ofminimumtransportcost ﬂownicelytoward thebimodaldistribution.

Fig. 4shows the marginalized prior densityin the experimentwith theLorenz-63 systemas a function ofthe three statevariablesatthe115-thtimeassimilationcycle,k

=

115 (eachpanelexhibitsavariable).Forplotting,themarginalized densityisdeterminedwithkerneldensityestimation.The cyclewas chosen sothatthe differencesbetweenthemapping particleﬁlter(MPF)andthesamplingimportanceresampling(SIR) ﬁlterareemphasized (i.e.acycleforwhichthea pri-ori density isnot agood representationoftheposterior density).Theexperimentuses 100particles andthefull stateis

(11)

Fig. 5.Marginalizedposteriordensity(contours)andtheparticlessamplingthepriordensityatthe115-thtimeiterationforthethreesectionsatthemode location(upperpanels).Theparticlessamplingtheﬁnaldensityafterthevariationalmappingwith50iterations(lowerpanels).

observed. Becauseparticles arespreadin andout the highobservationlikelihood region,theproposaldensityis broader, asymmetric compared withthemarginalized sequentialposteriordensity, Eq.(30). Afterthevariationalmapping,the re-sultingdensityisacloserepresentationofthesequentialposteriordensity,thetargetdensityforthemapping,inthethree variables(seethethreepanelsinFig.4).

The distributionoftheparticlescanbeseen inFig. 5.UpperpanelsofFig.5showtheparticles belongingtotheprior density for each variable, the initial densityfor the mapping. Contours show the posterior density, marginalized to the 2-dimensional plane.Lowerpanelsshow theﬁnal locationsofthesamplesafter50mapping iterations.The mappinghas pushedmostparticlestowardsthehighprobabilityregionoftheposteriordensity.Notethatthealgorithmisnotcollapsing theparticlestowardthemodeoftheposterior,butrepresentingthedensitywiththesamplesasawhole.Thediversityof theparticlesproducedbythemappingisnotable.

ThesameexperimentconductedwiththestandardSIR,orbootstrapﬁlterusingresamplingwhenNe f f

<

Np

/

2 isshown in Fig. 6.The sampling deﬁcienciesin the SIR ﬁlterare visible in Fig.6 where the particles ofthe prior density (upper panels) andthe resampleddensity(lower panels)are shownwithdots. Because theresamplingconserves andreplicates statisticallyonlytheparticleswithhighlikelihood,thereisanevidentsampleimpoverishmenteveninthislow-dimensional experiment.

Because ofthe goodsampling, theMPF exhibitsavery weak sensitivitytothe numberofparticles inthe root-mean-square error(RMSE) metrics.Fig.7showsthe RMSEoftheanalysismeanwithrespecttothetruestate, forMPF andSIR filtersfor100, 20and5particles (Fig.7a,bandcrespectively).Evenforalargenumberofparticles(withrespect tothe statedimension),theMPFoutperformsthemeanestimationwithrespecttotheSIRfilter.SIRfilterdivergesfor5particles. Fig. 7a, bandc showsthat the time-meanRMSE forthe MPFpractically doesnot change between100particles witha time-meanRMSEof0.482to5particleswhosetime-meanRMSEis0.489.Thus,theperformanceoftheMPFisquiterobust, onlysmallchangesintheRMSEarefoundwhendecreasingthenumberofparticles.

Note thecomplexity ofAlgorithm1 isproportional to N2

p whileitis proportional to Np forthe SIRfilter, sothat the computationalcostoftheassimilationstageishigherfortheMPFwiththesamenumberofparticles.However,itrequires asmallernumberofparticlessothatitrepresentsapromisingvenueforcomputationallyexpensivedynamicalmodels.The SIRfiltercouldusemoreparticlesatthesamecomputationalcost.Howeveragenerallyapplicablecomparisonofthefilters andconclusionofthiskindisnot possiblebecauseitdependsonthecomputational costofthedynamical model

M

that integrates thestate fromtime k

−

1 tok.Inparticular, one ofthepotential applicationsforwhich we envisagethe MPF, geophysical systems,typically useextremely computationally demanding models. Thus, only a smallnumber (

<

100) of modelintegrations–particles–areaffordable.TheoverallcomputationalcostoftheMPFforsuchasystemwouldbesimilar to3Dvariationalassimilation(i.e.MAPestimation)foreachparticle.

Athirdsetofexperimentswas conductedusingacompartmentstochastic dynamicalmodelforcholera dynamics[10]. Fig. 8a shows the truemortality time series andthe one estimatedby the MPF. The experiment started withan initial

(12)

Fig. 6.As in Fig.5for the SIR particle ﬁlter.

Fig. 7.RMSE for the MPF and the SIR ﬁlters as a function of the number of particles. Panels (a) 100, (b) 20, and (c) 5 particles.

meanstatewhichunderestimatesthenumberoftruesusceptibleindividualsby20%.Agoodperformance isfoundforthe MPF(total RMSEis0.0031). TheSIR filterfollowsthecholera outbreaksbuttheestimate ismuchnoisierwhenthesame level ofnoise is usedasfor theMPF (total RMSEis 0.0039). By decreasingthe levelof noise inthe SIR filter,the noise diminishesbutattheexpenseofastrongunderestimationofthepeaks(totalRMSEis0.0077).Overalltheperformanceof themappingparticle filterisexcellent, itcan handlevery well thispartial observationconfiguration,i.e.5state variables andoneobservedvariable,andthenonadditivemodelerror.

ToevaluatetheperformanceoftheMPFfora largerstate-spacemodel,anexperimentusingthe40-variableLorenz-96 dynamicalsystemwasconducted.Anensembleof20particleswasusedintheMPFexperiment.Scalablesamplingmethods are expectedto produceusefulresults fortheseexperiments inwhich Np

<

Nx. Fig.9a compares theresulting RMSEas a function oftime given by theMPF experimentand two stochastic EnKFexperiments one using20 members(inflation factor1.1)andonewith100members(inflationfactor1.05).TheSIRfilterisnotshownsinceitexhibitsdegeneracyinthe Lorenz-96systembecauseofthehighdimensionalstateandobservationspaces.TheMPFtakesalongertimetoconverge (20cycles),buttheEnKFwiththesamenumberofmembers(20)showsalargerRMSE.Ontheotherhand,theEnKFwith 100members presentsrathersimilar RMSEto theMPF experimentwith20 particles.Furthermore,the mappingparticle filterestimatesare ratherstablein time,which isa resultofits deterministic mappings.The spreadisrather stableand comparabletotheRMSEinbothfilters(Fig.9b).

An experiment to evaluate the performance of MPF for a partially observed Lorenz-96 system (40 variables and 20 observations) was conducted. The settings are similar to the previous experiment butnow the observations are sparse. Fig.10ashowstheRMSEgivenbytheMPFandthestochasticEnKF.TheMPFoutperformstheEnKFintermsoftheRMSE metrics.Fig.10bshowsthespreadoftheparticlesforbothﬁlters.Nolocalizationisusedintheﬁlters.Thisisnotbyfaran

(13)

Fig. 8.Mortalitytimeseriesforamodelofcholeradynamics,truemortalityandtheoneestimatedwiththeMPF(a),withtheSIRﬁlterwithhighadditive noise(b)andwiththeSIRﬁlterwithlowadditivenoise(c).

Fig. 9.RMSE (a) and spread (b) of the state as a function of the assimilation cycles produced by EnKF and MPF in the 40-variables Lorenz-96 system.

Fig. 10.RMSE(a)andspread(b)ofthestateasafunctionoftheassimilationcyclesproducedbyEnKFandMPFinthe40-variablesLorenz-96systemwith 20observations.Bothﬁltersuse20particles.

exhaustivecomparison,theonlypurposeistoshowthattheMPFcangivepromisingresultsinrelativelylargedimensional spacesinwhichtheSIRﬁlterdoesnotwork.Furtherexperimentsandmoreextensivecomparisonsarerequiredtoevaluate theoverallperformanceofMPFinmedianandhigher-dimensionalsystems(

>

100 statespacedimensions),thesegobeyond thepresentworkwhichisfocused ontheintroductionanddevelopmentofthetechnique.

The radialbasisfunctionsusedaskernelrequiretosetthebandwidth,whichinthisworkwas chosenas

A

=

α

Q.The uknown parameter

α

dependson the number of particles andthe dimensions ofthe problem so that it requires some tuning.Theoptimalparameterwillbetheonethatproducesthesamplespreadequaltothespreadofthesequential poste-riordensity.Thebandwidthparameter

α

usedintheLorenz-63experimentwith20particlesis

α

=

1.Fortheexperiment with 100particles we use

α

=

0

.

5. Fig. 11 showsexperiments forthe 40 variables Lorenz-96 systemin which we vary the bandwidthfrom

α

=

2 to

α

=

100.The RMSEshowsonlyslightchangeswithsimilarvaluesinthethreeexperiments (Fig. 11a). The RMSEmeasure for theMPF appears ratherrobust to the kernelbandwidth. Fig.11b showsthe spreadof the experiments for different

α

values. A choice of

α

=

20 appears to give an optimal spread of the particles for this experiment.

(14)

Fig. 11.(a)RMSEfortheMPFintheLorenz-96experimentsusingabandwidthofα=2,20,100.(b)Thecorrespondingspreadoftheparticlesofthe posteriordensity.

5. Discussion and conclusions

Theproposedmappingparticleﬁlterisbasedonadeterministicgradientﬂow.Itisabletokeepalargeeffectivesample size,avoidingtheresamplingstep,evenforlongrecursionssubjecttotheconvergenceofthemappingsequences. Further-more,itshowsarobustbehaviorintermsoftheRMSEwitha consistentensemblespreadintheconductedexperiments, usingonlyasmallnumberofparticles.

Inthelimit ofasingle particle,Np

=

1,thegradient oftheKullback-Leiblerdivergenceisequaltothe negativeofthe gradientofthelog-posteriordensity(seeEq.(18)).Inthatcase,theoptimizationviagradientdescentdeterminesthemode oftheposteriordensity.Therefore,themethodforNp

=

1 is equivalenttothree-dimensionalvariationaldataassimilation (3D-Var)with

Q

intheroleof

B

.

The MPF requires theadjoint of the observational operator to evaluate the gradient ofthe posterior density(see Eq. (32)).AnextensionoftheMPFwas proposedin[22] whichrepresentstheobservationaloperatorintheRKHS,andinthis wayitavoidstheneedoftheJacobianoftheobservationaloperator.TheexperimentsshowthatthisextensionoftheMPF retainsthemainnon-Gaussianpropertiesoftheposteriordensity.

Themappingofsamplesinthevariationalmappingparticleﬁlterisproducedbymovingtheparticleswherethey max-imizetheinformationgainoftheproposaldensitywithrespecttothesequentialposteriordensity,viatheKullback-Leibler divergence.In these terms,the proposed mapping particle ﬁlterseeks to maximize the amount ofinformation available in a complexhigh-dimensional posteriordensity usinga stochastic optimization methodand givena limited numberof particles.

Acknowledgements

ThisworkhasbeenfoundedbytheEuropeanResearchCouncilviatheCUNDAprojectnumber694509underthe Euro-peanUnionHorizon2020programme.

Appendix A. Importance sampling and the mapping particle ﬁlter

The variational mappingsampling schemeis suitable to be coupled withimportance sampling. The last intermediate densityobtainedwiththevariationalmappingsamplingschemecanbeusedasaproposaldensityoftheposteriordensity. Supposewe denotetheparticles obtainedinthelast mappingiterationby

x

_k(1:N)

=

.

x(1_k,I:N).The valuesoftheproposaland posteriordensitiesattheparticlepositionsareusedtoestimatetheweights

˜

w(_kj)

=

p(yk

|

x (j) k

)p(

x (j) k

|

y1:k−1

)

q(x(_kj)

)

,

(A.1)

andthentheyarenormalizedwithw(_kj)

= ˜

w(_kj)

/

N_jpw

˜

(j)_k .TheeffectiveensemblesizeisgivenbyNe f f

=

1

/

j

(

w (j) k

)

2_. Sincetheoutputoftheﬁlterisasetofweightedsamples,thesequentialposteriordensity(e.g.Eq.(30))hastoinclude theweightsofthepreviouscycle,

p(x(j)_k

|

y1:k

)

∝

p(yk

|

x(_kj)

)

l

w_k(l)₋₁p(x_k(j)

|

x(l)_k₋₁

).

(A.2)

Tocompute theweights between theposterior densityand theproposal density–ﬁnal density ofthe MPF–,we also needto evaluatetheproposaldensityattheparticlelocations. Werequiretheproposaldensitytobe known(apart from

(15)

Fig. A.12.(a)EvolutionoftheweightsofalltheparticlesfortheLorenz-63systeminalongtimesequenceforI=50 andwithoutthreshold.Becausethe weightsofalltheparticlesareplotted,thelastparticleplottedisthemostvisibleone.(b)Thevarianceoftheweightsasafunctionoftime.(c)Number ofeffectiveparticles.

the set of samples that we have fromthat density). There are two ways to obtain it. One wayis to usekernel density estimationwiththesetofsamples,

x

(1_k:N),usingthesamekernelsasinthemapping.Thesecondoptionistodetermineat eachmappingiterationhowthedensityistransformedwiththemapping(e.g.[19]).

Before themapping,wechooseasinitialintermediate density,forinstance,anequallyweighted Gaussianmixture cen-teredat

M

(

x(i)_k₋₁

)

andwithcovariances

Q

,

q(xk,0

|

y1:k

)

∝

Np

i=1 exp

−

1 2

xk,0

−

M

(

x (i) k−1

)

2 Q

.

(A.3)

Inthefollowingmappingiterationsi

>

0,thedensitychangesaccordingtothemappingT, qT

(

x(_k,ij)

)

=

q(x(_k,ij)₋₁

)

det

∇

xT

=

q(x(_k,ij)₋₁

)

|

I

−

H

|

,

(A.4)

where

|

I

−

H

|

isthedeterminantJacobianofthetransformationand

H

istheHessianoftheKullback-Leiblerdivergence. FromEq.(18),theHessianisgivenby

H

=

1 Np Np

l=1

∇

_x(j) k,i−1 K(x_k,i(l)₋₁

,

x(j)_k,i₋₁

)

∇

)

+ ∇

_x(j) k,i−1

∇

x (l) k,i−1 K

(

x(l)_k,i₋₁

,

x(_k,ij)₋₁

)

.

(A.5)

ThecalculationoftheHessianoftheKullback-Leiblerdivergence,Eq.(A.5) iscomputationallydemanding.Althoughnote thatthesecondderivativesareknownanalytically forstandardkernelfunctionsandthesymmetryofthekernelcanbeused to reduce thecalculations.The Hessiancalculation isrequiredateachmapping iterationtoevolve q_k,i(j) fromtheprevious intermediatedensityq(j)_k,i₋₁.

Theweightsaccountforthebiasintroducedintheﬁlterwhenthesetofsamplesobtainedbythesequenceofmappings is not exactly distributed according to the target density. However, it should be kept in mind that our estimate of the posteriordensityEq.(A.2) isnotexact,soitisunclearwhat thecalculatedweights actuallymean.Itwouldbeinteresting todeterminetheaccuracyofthisestimate,butthatisbeyondthescopeofthiswork.

Theweightsarealsousefulasadiagnosticofthequalityoftheconvergenceoftheoptimizationmethod.Inthatcase,an evaluationoftheweightsforthecurrentintermediatedensitycouldbeconductedineachmappingiteration.Thisallowsto check inthealgorithmifthegivenweights givean effectivesample sizewhichisoverthe thresholdno furthermapping iterations arerequired.Intheconductedexperimentsofthiswork,ingeneral,theeffectivesamplesizewasover98%after 50mappingiterationsduringthewholerecursion.Inotherwords,ifthenumberofmappingiterationsislargeenough

≥

50, theweightswillbealmostequal.Inhigh-dimensionalsystems,aconvergencecriterionbasedonthemoduleofthegradient ofKulback-Leibler divergencecanbe usedinsteadoftheeffectivesample size.ThisavoidsthecalculationsoftheHessian orkerneldensityestimations.The convergenceofthetechnique isevaluatedthrough experimentswithbothmeasures in Section3.4.

The performance of the mapping particle ﬁltercan be evaluated through the weights of the particles. As a function of the mapping iterations of the ﬁlter,the sequence of mappings is expectedto start with a set of particles withmost of the particles with almost zero weights and a few particles with very large weights. Note from Eq. (35) that parti-cles withalmost zero weights produce a largeKullback-Leibler divergence.Then, asthe particles are pushed toward the posterior density, their weights will tend to be equally distributedfor mostof the particles.If the variationalmappings work effectively,onlyafewparticlesshouldremainwithlowweights. Inother words,thevarianceoftheweightsshould

(16)

Fig. A.13.Idem Fig.A.12forI=100 and without threshold.

be smallfor all the cycles. Fig. A.12a shows the evolution ofthe weights of all the particles as a function of time for I

=

50 andwithout thresholdforthe Lorenz-63system. Most ofthe particleweights are concentrated around 0.05,in a rangebetween0.04 and 0.07 (Np

=

20 particles).There are some cyclesin whichone particleobtains a low weight but in thefollowing cyclethe weight ofthat particle isagain within theequally-weighted range.Note that we didnot per-form anyresampling. The variance of the weights is around 1

.

25 10−4 (Fig. A.12b). The numberof effectiveparticles is ingeneralabout19(from atotalof 20particles),only inafew cyclesit descendsdown to16(Fig. A.12c).Afurther ex-perimentwas conductedinwhichthemaximumnumberofiterationswas increasedto100. Inthatcase, alltheparticles remain with weights larger than 0.04 during the whole time sequence andthe peaks of variance of the weights found in Fig. A.12b are not present (see Fig. A.13). The numberof effective particles is in this caseover 18 for all the cycles (Fig.A.13c).

References

[1]S.Angenent,S.Haker,A.Tannenbaum,MinimizingﬂowsfortheMonge–Kantorovichproblem,SIAMJ.Math.Anal.35(2003)61–97.

[2]M.S.Arulampalam,S.Maskell,N.Gordon,T.Clapp,Atutorialonparticleﬁltersforonlinenonlinear/non-GaussianBayesiantracking,IEEETrans.Signal Process.50(2002)174–188.

[3]E.Atkins,M.Morzfeld,A.J.Chorin,Implicitparticlemethodsandtheirconnectionwithvariationaldataassimilation,Mon.WeatherRev.141(2013) 1786–1803.

[4]P.Bunch,S.Godsill,ApproximationsoftheoptimalimportancedensityusingGaussianparticleﬂowimportancesampling,J.Am.Stat.Assoc.111(2016) 748–762.

[5]O.Cappé,E.Moulines,T.Rydon,InferenceinHiddenMarkovModels,SpringerScience+BusinessMedia,NewYork,NY,2005.

[6]Y.Cheng,S.Reich,Assimilatingdataintoscientiﬁcmodels:anoptimalcouplingperspective,in:NonlinearDataAssimilation,Springer,2015,pp. 75–118. [7]A.J.Chorin,X.Tu,Implicitsamplingforparticleﬁlters,Proc.Natl.Acad.Sci.USA106(2009)17249–17254.

[8]F.Daum,J.Huang,Nonlinearﬁlterswithlog-homotopy,in:SignalandDataProcessingofSmallTargets2007,vol. 6699,2007,p. 669918. [9]A.Doucet,S.Godsill,C.Andrieu,OnsequentialMonteCarlosamplingmethodsforBayesianﬁltering,Stat.Comput.10(2000)197–208. [10]E.L.Ionides,C.Bretó,A.A.King,Inferencefornonlineardynamicalsystems,Proc.Natl.Acad.Sci.USA103(2006)18438–18443.

[11]D.Kingma,J.Ba,Adam:amethodforstochasticoptimization,in:Int.Conf.onLearningRepres(ICLR),2015,arXivpreprintarXiv:1412.6980. [12]Y.LeCun,Y.Bengio,G.Hinton,Deeplearning,Nature521(2015)436–444.

[13]Y.Li,M.Coates,Particleﬁlteringwithinvertibleparticleﬂow,IEEETrans.SignalProcess.65(2017)4102–4116.

[14]Q.Liu,D.Wang,Steinvariationalgradientdescent:ageneralpurposeBayesianinferencealgorithm,in:AdvancesinNeuralInformationProcessing Systems,2016,pp. 2378–2386.

[15]J.Liu,M.West,Combinedparameterandstateestimationinsimulation-basedﬁltering,in:SequentialMonteCarloMethodsinPractice,Springer,New York,2001,pp. 197–223.

[16]R.M.Neal,Samplingfrommultimodaldistributionsusingtemperedtransitions,Stat.Comput.6(1996)353–366.

[17]Y. Marzouk, T.Moselhy,M.Parno,A.Spantini,An introductiontosamplingviameasuretransport,in: R.Ghanem,D. Higdon,H.Owhadi (Eds.), HandbookofUncertaintyQuantiﬁcation,Springer,2017,inpress,arXiv:1602.05023.

[18]R.McCann,Existenceanduniquenessofmonotonemeasure-preservingmaps,DukeMath.J.80(1995)309–323. [19]T.A.Moselhy,Y.M.Marzouk,Bayesianinferencewithoptimalmaps,J.Comput.Phys.231(2012)7815–7850.

[20] M.Pulido,O.Rosso,Modelselection:usinginformationmeasuresfromordinalsymbolicanalysistoselectmodelsub-gridscaleparameterizations, J. Atmos.Sci.74(2017)3253–3269,https://doi.org/10.1175/JAS-D-16-0340.1.

[21]M.Pulido,P.Tandeo,M.Bocquet,A.Carrassi,M.Lucini,Parameterestimationinstochasticmulti-scaledynamicalsystemsusingmaximumlikelihood methods,Tellus70(2018)1442099.

[22]M.Pulido,P.J.vanLeewen,D.J.Posselt,Kernelembeddednonlinearobservationalmappingsinthevariationalmappingparticleﬁlter,in:Computational Science-ICCS2019,in:LectureNotesinComputerScience,vol. 11539,2019,arXiv:1901.10426,2019.

[23]S.Reich,AnonparametricensembletransformmethodforBayesianinference,SIAMJ.Sci.Comput.35(2013)A2013–A2024. [24]B.Scholkopf,A.J.Smola,LearningWithKernels:SupportVectorMachines,Regularization,Optimization,andBeyond,MITPress,2002. [25]E.G.Tabak,E.Vanden-Eijnden,Densityestimationbydualascentofthelog-likelihood,Commun.Math.Sci.8(2010)217–233.

[26]P.J.vanLeeuwen,Nonlineardataassimilationingeosciences:anextremelyeﬃcientparticleﬁlter,Q.J.R.Meteorol.Soc.136(2010)1991–1999. [27]P.J.vanLeeuwen,Nonlineardataassimilationforhigh-dimensionalsystems,in:NonlinearDataAssimilation,Springer,2015,pp. 1–73.

[28] P.J.vanLeeuwen,H.R.Künsch,L.Nerger,R.Potthast,S.Reich,Particleﬁltersforhigh-dimensionalgeoscienceapplications:areview,Q.J.R.Meteorol. Soc.(2019),inpress,https://doi.org/10.1002/qj.3551.

[29]C.Villani,OptimalTransport:OldandNew,vol. 338,SpringerScience&BusinessMedia,2008. [30]M.D.Zeiler,ADADELTA:anadaptivelearningratemethod,arXiv:1212.5701 [cs.LG],2012.

Journal of Computational Physics 396 (2019) 400–415

ScienceDirect

www.elsevier.com/locate/jcp

(http://creativecommons.org/licenses/by/4.0/

61–97.

174–188.

1786–1803.

748–762.

2005.

2015,

17249–17254.

2007,

197–208.

18438–18443.

2015,

436–444.

4102–4116.

2016,

2001,

353–366.

2017,

309–323.

7815–7850.

https://doi

1442099.

2019.

35

2002.

217–233.

1991–1999.

2015,

https://doi

2008.

2012.

1904–1919.