On the convergence of two sequential Monte Carlo methods for maximum a posteriori sequence estimation and stochastic global optimization

(1)

DOI 10.1007/s11222-011-9294-4

On the convergence of two sequential Monte Carlo methods

for maximum a posteriori sequence estimation and stochastic

global optimization

Joaquín Míguez · Dan Crisan · Petar M. Djuri´c

Received: 28 January 2011 / Accepted: 29 September 2011 © Springer Science+Business Media, LLC 2011

Abstract This paper addresses the problem of maximum a

posteriori (MAP) sequence estimation in general state-space models. We consider two algorithms based on the sequen-tial Monte Carlo (SMC) methodology (also known as par-ticle filtering). We prove that they produce approximations of the MAP estimator and that they converge almost surely. We also derive a lower bound for the number of particles that are needed to achieve a given approximation accuracy. In the last part of the paper, we investigate the application of particle filtering and MAP estimation to the global optimiza-tion of a class of (possibly convex and possibly non-differentiable) cost functions. In particular, we show how to convert the cost-minimization problem into one of MAP se-quence estimation for a state-space model that is “matched” to the cost of interest. We provide examples that illustrate the application of the methodology as well as numerical re-sults.

Keywords Sequential Monte Carlo · MAP sequence

estimation· Convergence of particle filters · State space models· Global optimization

J. Míguez (

!

)

Department of Signal Theory & Communications, Universidad Carlos III de Madrid, Madrid, Spain

e-mail:[email protected] D. Crisan

Department of Mathematics, Imperial College London, London, UK

e-mail:[email protected] P.M. Djuri´c

Department of Electrical & Computer Engineering, Stony Brook University, New York, USA

e-mail:[email protected]

1 Introduction

State-space stochastic models are useful in representing a multitude of problems appearing in different scientific and engineering fields. They involve a random sequence {Xt}t≥0, representing the unobserved “state” of the system

and a sequence of related observations{Yt}t_≥1. The goal is

to perform inference on the states using the observed data. When the model can be described by linear equations with Gaussian perturbations, the Kalman filter (Kalman 1960) provides an exact solution for the (Gaussian) prob-ability distribution of Xt given the observations Y1, . . . , Yt.

However, the analytical intractability of general state-space models (nonlinear and/or non-Gaussian) has motivated a great amount of work on approximation techniques, includ-ing variations on the Kalman filter (Anderson and Moore 1979; Julier and Uhlmann2004) and the family of sequen-tial Monte Carlo (SMC) methods, also known as particle filters (Gordon et al.1993; Liu and Chen 1998; Doucet et al.2000,2001b; Ristic et al. 2004). Particle filters gener-ate discrete random measures that can be naturally used to approximate integrals with respect to the posterior dis-tribution of Xt given the data (including, e.g., its mean).

A number of theoretical results ensure the convergence of such approximations—see Del Moral (2004), Legland and Oudjane (2004), Bain and Crisan (2008), Heine and Crisan (2008), Hu et al. (2008) and the references therein.

In this paper, we investigate the use of SMC algorithms for the approximation of maximum a posteriori (MAP) se-quence estimates, i.e., we study the problem of finding the sequence of state values Xk= xk, k= 0, 1, . . . , t, that

presents the highest probability density conditional on the available data Yk = yk, k = 1, 2, . . . , t. We do not claim

(2)

of the posterior distribution of the state Xk, which

mini-mizes the expected value of a quadratic cost function and, hence, it is often referred to as the minimum mean square error (MMSE) estimator. In comparison, the MAP estimator results from the minimization of the posterior expectation of a 0-1 cost function (Robert 2007). The adoption of the latter criterion turns out natural in some applications. Con-sider, for example, a system that yields a multimodal poste-rior probability distribution. In such a case, the MMSE es-timate of Xk may lie in a low-density region and, therefore,

may turn out to be even misleading about the actual state of the system, while the MAP criterion produces an estimate located in a high probability region of the state space. Such scenarios often appear in (multi-)target tracking problems (Bar-Shalom and Blair2000).

A limited number of methods for MAP sequence esti-mation using particle filtering exist. The straightforward ap-proach consists of a linear search of the particle with the highest posterior density. In Godsill et al. (2001), it is sug-gested to use the collection of particles at times t= 1, 2, . . . to build a trellis representation of the state space. Then, it is possible to run a Viterbi algorithm (Forney1973) to find the path in the trellis with the highest posterior density. This method has become standard, enjoying some applications in engineering; see, e.g., Nyblom et al. (2008). However, its computational complexity grows with N2_{, where N is}

the number of particles generated by the SMC algorithm, and it can be prohibitive in some applications. In Klaas and Lang (2005), it is suggested to use tree search procedures to achieve a computationally more efficient implementation. More recently, in Saha et al. (2009), it has been proposed to perform marginal MAP estimation of the state Xt by

obtain-ing an approximation of the filterobtain-ing density and then usobtain-ing standard optimization techniques to compute its maximum.

It is important to point out that MAP estimates cannot be computed as integrals with respect to the posterior dis-tribution of Xt. As a consequence, the classical convergence

results for particle filters in Del Moral and Miclo (2000), Del Moral (2004), Bain and Crisan (2008) or Legland and Oud-jane (2004) do not guarantee the convergence of the approx-imate MAP estapprox-imates produced by the algorithm in Godsill et al. (2001).

In this paper, we address a formal analysis of the con-vergence of the MAP estimates computed using SMC al-gorithms. In particular, we consider two methods based on the standard sequential importance resampling (SIR) algo-rithm (Gordon et al.1993) (see also Doucet et al.2001a). The first one involves a direct search over the sample paths in the space of{X0, . . . , Xt} generated by the SIR algorithm

and has a complexity that grows linearly with the number of sample paths. The second one is the algorithm of Godsill et al. (2001), that performs a trellis search over an extended grid of paths using the Viterbi algorithm and has a quadratic

complexity. Both search procedures can be implemented se-quentially and together with the SIR method. Our analysis includes:

(a) The derivation of explicit convergence rates for the Lp

errors (with arbitrary integer p) in the approximation of integrals with respect to the joint posterior distribu-tion of X0, . . . , Xt given a fixed record of observations.

A similar result was originally introduced in Del Moral et al. (2001) for a general class of interacting particle systems. However, the conditions assumed here are min-imal and easily verified for the class of state-space mo-dels of interest in this paper.

(b) A proof, based on the rates for the Lp_≥4errors, of the

al-most sure convergence of the MAP sequence estimates produced by the SIR algorithm, both with the simple di-rect search over the sample paths and using the Viterbi algorithm on a trellis grid.

(c) Lower bounds on the number of particles (sample paths) that are needed to achieve a prescribed accuracy with the MAP estimator.

Another contribution of this work is to show how the MAP sequence estimation algorithms can be used as tools for the global optimization of a broad class of objective functions. We identify a family of cost functions that ad-mit a certain recursive decomposition and describe how it is possible to design state-space models that are “matched” to the cost, meaning that (a) the unknowns of the cost func-tion are assimilated to the random sequence of states in the model and (b) the maxima of the posterior probability den-sity function (pdf) of the state-space model coincide with the minima of the cost function. With this reformulation of the problem, SMC algorithms can be used to produce, in a natural way, a random grid in the space of the unknowns that is dense in the regions where the cost is low and sparse else-where. We illustrate this approach by way of two examples, including a typical global optimization problem, the Neu-maier 3 problem (Ali et al.2005), and the design of cross-talk cancellation acoustic filters by a minimax optimization criterion (Rao et al.2007). We present computer simulation results for the Neumaier 3 problem that illustrate the advan-tage of using random, instead of deterministic, grids for the search of solutions.

(3)

The rest of this paper is organized as follows. A brief survey of the basic notations is presented in Sect. 2. The problem of MAP sequence estimation for state-space mo-dels is formally stated in Sect.3. In Sect.4, we explicitly describe the algorithms to be analyzed. Our main results on the convergence of the SIR algorithm and the MAP estima-tion algorithms designed around it are introduced in Sect.5. An application in the context of target tracking is given in Sect. 6. The application of the MAP sequence estimation tools for the global minimization of cost functions is intro-duced in Sect.7. The paper concludes with a brief summary in Sect.8.

2 Notation

Random variables (possibly vector valued) and their real-izations are represented by the same upper- and lower-case letter, e.g., the random variable X and its realization X= x. Random sequences are denoted as{Xt}t_∈N.

We use_Rd_{, with integer d}_{≥ 1, to denote the set of}

d-dimensional vectors with real entries. The Borel σ -algebra inRd _{is indicated as B(R}d_{). The set of bounded real}

func-tions over_Rd_{is denoted as B(}_Rd_).

Pdf’s are indicated by the letter π. This is an argument-wise notation, hence for the random variables X and Y , π(x) signifies the density of X, possibly different from π(y), which represents the pdf of Y . The integral of a function f (x)with respect to a measure with density π(x) is denoted by the shorthand (f, π)_!!f (x)π(x)dx.

For a discrete-time sequence x0, x1, . . . , xt, . . ., the

short-hand xt1:t2 = {xt1, xt1+1, . . . , xt2} denotes the subsequence from time t1up to time t2.

3 Problem statement

Let {Xt}t_≥0 and {Yt}t >0 be discrete-time stochastic

pro-cesses that take values in _Rdx and Rdy, respectively. The common probability measure for the pair (Xt, Yt)is denoted

asPand assumed to be absolutely continuous with respect to the Lebesgue measure. We refer to{Xt}t_≥0as the “state” or

“signal” process, while{Yt}t≥0is termed the “observation”

or “measurement” process.

For t= 0, the random variable X0has a pdf with respect

to the Lebesgue measure in_Rdx, that we denote as π(x₀), and, for t > 0, the process evolves according to the condi-tional probability law

P{Xt∈ A|x0:t−1} =

"

A

π(xt|x0:t−1)dxt,

where π(xt|x0:t−1) denotes the pdf, with respect to the Lebesgue measure, of Xt given X0:t−1= x0:t−1 and A is

any Borel subset of_Rdx, i.e., A∈ B(Rdx).

The observation process,{Yt}t >0, follows the conditional

probability law

P{Yt∈ A′|x0:t, y1:t−1} =

"

A′

π(yt|x0:t, y1:t−1)dyt,

where π(yt|x0:t, y1:t−1)denotes the conditional pdf of Yt

given X0:t= x0:t and Y1:t−1= y1:t−1, again with respect to

the Lebesgue measure, and A′_{∈ B(R}dy).

We refer to the densities π(x0)and π(xt|x0:t−1)as the

prior pdf and the transition pdf of the state process, respec-tively, while for fixed observations Y1:t = y1:t, the function

gt(x0:t)! π(yt|x0:t, y1:t−1)is referred to as the likelihood

of the signal path X0:t = x0:t at time t. Together, the

densi-ties

π(x0), π(xt|x0:t−1) and π(yt|x0:t, y1:t−1) (1)

determine a random state-space model. The a posteriori pdf of a signal path X0:t= x0:t given a sequence of observations

In this paper, we address the problem of finding the max-ima of the a posteriori pdf. In particular, let T be an arbi-trarily large but finite time horizon and let the observations Y1:T = y1:T be fixed. We seek the sequences of length T+ 1

in the state space that maximize the function π(x0:T|y1:T),

i.e., we aim at finding the solution setXπ

T defined as

Xπ

T = arg max

x0:T∈(Rdx)T+1π(x0:T|y1:T). (3)

Note that every elementˆx0:T ∈XπT is a MAP estimate of the

sequence X0:T given the observations Y1:T = y1:t.

Since the observation sequence is kept fixed, in the se-quel we use the shorthand π0:t(x0:t)= π(x0:t|y1:t)for the

posterior density at any time t= 0, . . . , T .

4 Algorithms

(4)

et al. 1993; Doucet et al. 2000) in order to obtain two random-grid approximations, with different coarseness, of the signal-path space (_Rdx₎T+1. Then, it is possible to either directly choose the node of a “coarse” grid with the highest posterior density or run the Viterbi (Forney1973) algorithm on a “fine” (trellis shaped) grid, as suggested in Godsill et al. (2001).

4.1 Discretization of the state-space: the sequential importance resampling algorithm

We aim at numerically computing the elements ˆx0:T of the

solution set Xπ

T in problem (3). Even if the posterior pdf

π0:T(x0:T)can be evaluated up to a proportionality constant using the factorization of (2), this is, in general, a difficult optimization problem in a high dimensional space, possibly with multiple global and/or local extrema. In this paper, we propose to tackle these difficulties by using a SMC method in order to obtain a suitable discretization of the path space (Rdx₎T+1. Different search methods can subsequently be ap-plied to find the point of the discretized space with the high-est density.

SMC algorithms aim at recursively computing approxi-mations of the sequence of posterior probability laws

P{X0:t∈ A|y1:t} =

"

A

π₀_:t(x₀_:t)dx₀_:t, (4)

t= 1, . . . , T , where A ∈ B((Rdx₎t+1₎is a Borel set. Specif-ically, at each time t, a SMC algorithm generates random paths ΩN

0:t= {x0(n):t}n=1,...,N such that integrals with respect

to the pdf π0:t(x0:t)can be approximated by summations (Crisan and Doucet 2000), i.e., !f (x₀_:t)π₀_:t(x₀_:t)dx₀_:t ≈

1 N

$N n₌₁f (x

(n)

0:t),where f : (Rdx)t+1→ R is a real

func-tion defined in the path space and integrable with respect to the posterior probability law.

Although various possibilities exist (Doucet et al.2001b), in this paper we consider the standard sequential impor-tance sampling algorithm with resampling at every time step (Doucet et al.2000), also known as bootstrap filter (Gordon et al.1993). We refer to this algorithm as SIR through the pa-per. The algorithm is based on the recursive decomposition of π0:t(x0:t)given by (2) and the computational procedure is simple.

– Initialization. At time t= 0, we draw N independent and identically distributed (i.i.d.) samples from the prior prob-ability distribution with density π(x0). Let us denote this

initial sample as Ω₀N= {x₀(n)}n=1,...,N.

– Recursive step. Assume that a random sample Ω₀N_:t−1= {x₀(n)_:t−1}n=1,...,N

has been generated up to time t− 1. Then, at time t, we take the following steps.

i. Draw N new samples in the state space_Rdx from the probability distributions with densities π(xt|x₀(n)_:t−1),

n= 1, . . . , N, and denote them as { ¯x_t(n)}n=1,...,N. Set

¯x(n)

0:t = {x0(n):t−1,¯xt(n)}.

ii. Weight each sample according to its likelihood, i.e., compute importance weights

˜w(n)

t = π(yt| ¯x₀(n)_:t, y1:t−1)

and normalize them to obtain

w_t(n)= ˜w (n) t $N k=1 ˜w (k) t .

iii. Resampling: for n= 1, . . . , N, set x(n)

0:t = ¯x(k)0:t with

probability w(k)

t , k∈ {1, . . . , N}. Reset the weights to

w_t(n)= 1/N for n = 1, . . . , N.

The multinomial resampling procedure in step iii. can be substituted by other techniques. A number of alternative re-sampling methods and their associated computational com-plexity are discussed in Carpenter et al. (1999), while Douc et al. (2005) provides analytical results regarding the effect of multinomial, residual, stratified and systematic resam-pling techniques on the variance of Monte Carlo estimates. In Crisan (2001), a tree-based branching algorithm is de-scribed that minimizes the variance of the (random) number of offsprings in the resampling step. In practice, the use of a low-variance resampling procedure should result in a (not necessarily large) improvement in the accuracy of the esti-mators computed using the random samples generated by the SIR algorithm.

We shall use the random grid ΩN

0:T = {x0(n):T}n_=1,...,N as

a discrete approximation of the path space (_Rdx₎T+1where the random sequence X0:T takes its values. Note that the SIR

algorithm also yields “marginal grids” for each time t, de-noted ΩN

t = {x

(n)

t }n_=1,...,N, t = 0, 1, . . . , T . The points of

the grid ΩN

0:T (often also the points of every ΩtN) are called

particles and the SMC methods that generate them are re-ferred to as particle filters (Doucet et al.2000) or particle smoothers (Godsill et al.2004) depending on whether one is interested in the filtering pdf’s π(xt|y1:t), t= 1, 2, . . ., or the

smoothing pdf’s π0:t(x0:t), t= 1, 2, . . ., respectively. Using

the particles in ΩN

0:T, it is straightforward to build a

ran-dom measure πN

0:t(dx0:t)=N1

$N

n=1δn(dx0:t), where δn is

the unit delta measure centered at x(n)

0:t, and use it to

approx-imate integrals of the form

(f, π0:t)= "

(5)

where f : (Rdx₎t+1→ R is a real function over the space of the paths up to time t. Indeed, we write

(f, π₀N_:t)= " f (x0:t)π₀N_:t(dx0:t)= 1 N N % n₌₁ ft(x₀(n)_:t)

for the particle approximation of (f, π0:t). If the function f

is, for example, bounded, then (f, πN

0:t)is a good

approxima-tion of (f, π0:t)for N sufficiently large (Crisan and Doucet 2000). We will take advantage of this result for the analysis in Sect.5.

The standard particle filter described in this section is an instance of the general class of sequential importance sam-pling (SIS) methods in which (a) the particles are drawn from the transition pdf π(xt|x0:t−1)and (b) resampling is

carried out at every time step. There are various versions of this class of algorithms that can be used to obtain the ran-dom grid ΩN

0:T, however. For example, it is possible to use an

importance function different from π(xt|x0:t−1)in order to

generate particles (Liu and Chen1998) or to perform resam-pling every m≥ 1 time steps, where m can be deterministic or random (Doucet et al. 2000). Indeed, not only the stan-dard SIR technique but also most of the SIS-like methods can be easily plugged into the MAP estimation algorithms that we present below.

4.2 Sequence estimation algorithms

We propose to use the random grids generated by the parti-cle filtering algorithm to search for approximate maximizers of the pdf π0:T(x0:T). In particular, we investigate two

algo-rithms. The first one is a straightforward extension of the SIR procedure, while the second one combines it with the Viterbi algorithm as suggested in Godsill et al. (2001). We will subsequently refer to them as Algorithm 1 and Algo-rithm 2, respectively.

4.2.1 Algorithm 1

We simply search the element of ΩN

0:T with the highest

pos-terior density. For this purpose, note that we can easily ex-tend the SIR algorithm described in Sect.4.1to recursively compute the posterior density of each particle up to a pro-portionality constant. To be specific, we need to perform the following additional computations.

– At the initialization step, let a(n)

0 = log(π(x0(n))for n=

1, . . . , N.

– At the recursive step, modify steps ii. and iii. as follows. ii. Weight each sample according to its likelihood, i.e.,

compute the importance weights ˜w(n)

t = π(yt| ¯x₀(n)_:t,

y1:t−1)and normalize them to obtain w_t(n)= ˜w(n)_t / $N k₌₁ ˜w (k) t . Compute ¯a (n) t = a (n) t−1+ log(π(yt| ¯x (n) 0:t, y1:t−1))+ log(π( ¯x_t(n)|x₀(n)_:t−1)).

iii. Resampling: for n= 1, . . . , N, set x(n)

0:t = ¯x0(k):t and

a_t(n)= ¯a_t(k)with probability w(k)_t , k∈ {1, . . . , N}. Re-set the weights to w(n)

t = 1/N for n = 1, . . . , N.

Finally, we select ˆxN

0:T = x0(n:To), where no= arg max n_∈{1,...,N}a

(n)

T , (6)

as the approximate maximizer of π0:T(x0:T). 4.2.2 Algorithm 2

Let us briefly describe the MAP estimation algorithm of Godsill et al. (2001). Instead of ΩN

0:T, we consider now a

finer discretization of (_Rdx₎T+1, namely the product space ¯ Ω₀N_:T = Ω₀N× ¯Ω₁N× · · · × ¯Ω_TN, where ΩN 0 = {x0(n)}n_=1,...,N and ¯ΩtN = { ¯x (n) t }n_=1,...,N for

t = 1, 2, . . . , T . Specifically note that ¯Ω_tN is constructed from the particles available at step ii. of the SIR algorithm, i.e., before resampling, to avoid duplicate samples.

We assume for clarity1 _{that π(x}

t|x0:t−1)= π(xt|xt₋₁)

and π(yt|x0:t, y1:t−1)= π(yt|xt). Given the random grids

Ω₀N and ¯Ω_tN, t= 1, . . . , T , the Viterbi algorithm outputs a sequence (x(n0)

0 ,¯x(n11), . . . ,¯xT(nT))∈ ¯Ω0N:T, ni ∈ {1, . . . , N},

with the highest posterior density, i.e., it solves the discrete optimization problem ¯xN 0:T ∈ arg max ¯x0:T∈ ¯ΩN 0:T π0:T(¯x0:T) (8)

exactly. The procedure is described below. – Initialization. For n= 1, . . . , N, let a(n)

0 = log(π(x0(n))).

– Recursive step. At time t > 0, the random grids ¯Ω_tN₋₁and ¯

Ω_tN, as well as {a(n)_t₋₁}n_=1,...,N, are available. Then, for

n= 1, . . . , N, compute a_t(n)= log(π(yt| ¯x(n)t )) + max k∈{1,...,N}[a (k) t₋₁+ log(π( ¯x (n) t | ¯x (k) t₋₁))], (9) ℓ(n)_t = arg max k_∈{1,...,N}[a (k) t₋₁+ log(π( ¯x (n) t | ¯x (k) t₋₁))].

1_{The algorithm can also be applied when}

(6)

– Backtracking. Computation of an optimal sequence. i. At time T , let jT = arg maxk∈{1,...,N}a(k)T and assign

¯xN T = ¯x

(jT)

T .

iii. For t= T −1, T −2, . . . , 0, let jt= ℓ (jt+1) t₊₁ and assign ¯xN t = ¯x (jt) t .

The Viterbi recursion can be run sequentially, together with the SIR algorithm described in Sect.4.1. Specifically, we can take a complete recursive step of the Viterbi algo-rithm right after step ii. of the SIR method (i.e., once the random marginal grid ¯Ω_tNis obtained). The combination of the SIR and Viterbi methods to compute ¯xN

0:T will be termed

Algorithm 2 in the sequel.

Compared to Algorithm 1, the application of the Viterbi method in Algorithm 2 adds a considerable (extra) compu-tational burden. Specifically, it is needed to calculate N2

branch metrics (associated to the indices ℓ(n)

t , n= 1, . . . , N,

t = 1, . . . , T , in (9)) per time step. As a consequence, the computational complexity of the method is O(N2_{T ).}

5 Analysis

5.1 Outline

We now establish the almost sure convergence of the two MAP sequence estimation algorithms described in Sect.4.2. In the results that follow, we assume that:

– The sequence Y1:T = y1:T is fixed (not random).

– The likelihoods gt(x0:t)= π(yt|x0:t, y1:t−1)are bounded functions of x0:t for every t= 1, 2, . . . , T .

– Let p0:t(x0:t)= π(x0:t|y1:t−1) be the posterior pdf of the sequence X0:t given the observations Y1:t−1= y1:t−1.

The integral of the likelihood gt(x0:t)with respect to the

measure p0:t(x0:t)dx0:t is positive, i.e., (gt, p0:t) >0 for

1≤ t ≤ T . – The setXπ

T is not empty and the posterior pdf π0:T(x0:T)

is continuous at every point ˆx0:T ∈XπT.

The first three assumptions are applied to show that the SIR algorithm converges in an adequate way while the fourth one is used to show that ˆxN

0:T → ˆx0:T. Specifically note that

con-tinuity is only assumed at the global maxima of π0:T(x0:T)

and not necessarily over its whole support.

Obviously, the convergence of the MAP estimation Al-gorithms 1 and 2 relies on the convergence of the SIR algorithm. To be precise, given a real bounded function f ∈ B((Rdx₎T+1₎ our analysis requires the convergence of (f, πN

0:T) toward the actual integral (f, π0:T) in the

Lp norm for some p≥ 4. Similar, but not directly

ap-plicable, results exist. Convergence rates for (f, πN 0:T)→

(f, π₀_:T)in the L₂norm can be found in Crisan and Doucet (2000), while the convergence of (f, πN

T )→ (f, πT)(where

πT(xt)= π(xt|y1:t)is the filtering density and πTN(dxt)= 1

N

$N

n=1δn(dxt)its approximation) in terms of generic Lp

errors was established in Del Moral and Miclo (2000) under additional constraints.

In Lemma 1 below, we establish the rate of conver-gence of (f, πN

0:T)toward (f, π0:T)in a generic Lp norm,

p≥ 1. This is required for the subsequent analysis of the approximate MAP estimates ˆxN

0:T and ¯x0:TN . Specifically,

in Theorem 1, we use Lemma 1 to show that Algorithm 1 converges almost surely (a.s). More precisely, we prove that π0:T(ˆx₀N_:T)converges to π0:T(ˆx0:T), with ˆx0:T ∈XπT.

The convergence of Algorithm 2 follows immediately (see Corollary 1). Finally, in Theorem 2, we establish a lower bound on the number of particles N needed to achieve a cer-tain accuracy in the approximation of the MAP estimates. 5.2 Asymptotic convergence results

In the sequel,∥ξ∥pdenotes the Lpnorm of the random

vari-able ξ , defined as∥ξ∥p= E[|ξ|p]1/p, where E[·] denotes

mathematical expectation and ∥f ∥∞= sup

x0:T∈(Rdx)T+1|f (x0:T )| < ∞

denotes the supremum norm of the real bounded function f∈ B((Rdx₎T+1).

Lemma 1 For every f ∈ B((Rdx₎T+1₎there exists a

con-stant c_{= c(p, T , y}1:T), independent of N, such that

∥(f, πN 0:T)− (f, π0:T)∥p≤ c∥f ∥_∞ √ N , for all N≥ 1.

See theAppendixfor a proof.

Remark 1 Lemma1is similar to Theorem 3.1 in Del Moral et al. (2001). The latter result is derived for a general class of interacting particle systems that assume a certain regu-larity condition (condition (K) on page 6). This condition is satisfied for processes defined on a compact (or finite) state space but, in general, it is quite hard to check for processes defined on_Rdx. By contrast, the proof of Lemma1does not require this additional condition.

(7)

Moulines 2008) contains an alternative attempt to resolve this issue. Note, however, that the convergence results in the latter references (Hu et al. 2008; Heine and Crisan2008; Douc and Moulines2008) refer to the approximation of in-tegrals with respect to the filtering measure π(xt|y1:t)dxt,

while Lemma 1 refers to the approximation of integrals with respect to the measure π0:t(x0:t)dx0:t.

Theorem 1 Let ˆxN

0:T be the output sequence of Algorithm1. Then, almost surely,

lim

N_→∞π0:T(ˆx N

0:T)=_x max

0:T∈(Rdx)T+1π0:T(x0:T).

Moreover any convergent subsequence of _ˆxN

0:T has a limit

ˆx0:T that belongs to the solution setXπT.

Proof Let f : (Rdx₎T+1→ R be a bounded real function of the path x0:T. From Lemma1, we obtain

∥(f, πN

0:T)− (f, π0:T)∥p≤

c∥f ∥_∞ √

N , (10)

where c is a constant independent of N. Choose p≥ 4, an arbitrarily constant 0 < ε < 1 and construct the positive ran-dom variable Θ_Tp,ε= ∞ % N₌₁ Np2−1−ε|(f, πN 0:T)− (f, π0:T)|p.

From Fatou’s lemma and Lemma1,

E[Θ_Tp,ε] ≤ ∞ % N=1 Np2−1−εc p_{∥f ∥}p ∞ Np2 = cp_{∥f ∥}p ∞ ∞ % N=1 N−1−ε<∞, hence Θp,ε T is a.s. finite. Obviously, Np2−1−ε|(f, π₀N :T)− (f, π0:T)|p≤ ΘTp,ε,and solving for|(f, πN 0:T)− (f, π0:T)|, with p ≥ 4, yields |(f, πN 0:T)− (f, π0:T)| ≤ Θ_Tδ N12−δ , (11) where δ=1+ ε p (12) and Θδ T = (Θ p,ε T ) 1

p. Note that, since p≥ 4 and 0 < ε < 1, it turns out that 0 < δ < 1₂. As a consequence of (11), the integral (f, πN

0:T)converges with probability 1, i.e.,

lim

N→∞|(f, π N

0:T)− (f, π0:T)| = 0 a.s. (13)

Now, choose any MAP estimate ˆx0:T ∈XπT and consider

the open ball

Bk(ˆx0:T)=

&

z∈ (Rdx₎T+1: ∥z − ˆx

0:T∥ < 1_k

'

where k is a positive integer and∥ · ∥ denotes the norm of the Euclidean space (Rdx₎T+1. The indicator function IBk(ˆx0:T)(x0:T)=

&

1 if x0:T ∈ Bk(ˆx0:T),

0 otherwise (14)

is real and bounded, hence, from (13), lim

N→∞|(IBk(ˆx0:T), π N

0:T)− (IBk(ˆx0:T), π0:T)| = 0 a.s. (15)

Since the posterior pdf π0:T(x0:T)is continuous at ˆx0:T ∈

Xπ

T and π0:T(ˆx0:T) >0, it follows that π0:T(x0:T)is

pos-itive on an open ball around ˆx0:T. In particular, the value

Ak= (IBk(ˆx0:T), π0:T)is strictly positive. Also note that the

particle approximation of Akhas the form

AN_k = (I_B_k₍_ˆx₀_:T₎, π₀N_:T)=m(N, k)

N ,

where m(N, k) denotes the number of elements of the dis-cretized path space ΩN

0:T that belong to the ball Bk(ˆx0:T) (equivalently, m(N, k)= |ΩN

0:T ∩ Bk(ˆx0:T)| is the number of points in the discrete intersection set ΩN

0:T ∩ Bk(ˆx0:T)).

Since limN_→∞|AN_k − Ak| = 0 a.s., it follows that, for any

k≥ 1, lim

N→∞m(N, k) >0 a.s. (16)

The limit (16) implies that, for any k ≥ 1, the inter-section ΩN

0:T ∩ Bk(ˆx0:T)is a.s. nonempty when N is suf-ficiently large. Therefore, let us choose a point xN,k

0:T ∈

Ω₀N_:T ∩ Bk(ˆx0:T). Obviously, π0:T(x0N,k:T )≤ π0:T(ˆx0:T)but,

given the selection rule (6), we also have that π0:T(x₀N,k_:T )≤ π0:T(ˆx_0:TN ). Therefore,

π0:T(x₀N,k_:T )≤ π0:T(ˆx₀N_:T)≤ π0:T(ˆx0:T).

Since π0:T is continuous at ˆx0:T and∥ ˆx0:T − ˆx0N,k:T ∥ < 1/k,

we deduce that limk→∞π0:T(x₀N,k_:T )= π0:T(ˆx0:T)and, as a

consequence, lim k_→∞π0:T(ˆx N 0:T)= π0:T(ˆx0:T) a.s. Moreover, if { ˆxNi 0:T}i∈N is a convergent subsequence of { ˆxN

0:T}N∈Nwith limit, say, ˇx0:T, it follows that π0:T(ˇx0:T)=

limi_→∞π0:T(ˆxNi

0:T)= π0:T(ˆx0:T). Therefore ˇx0:T ∈XπT, which

(8)

Remark 3 In Najim et al. (2006) a different approach is used to prove a result similar to Theorem 1 based on the propagation-of-chaos property of genealogical tree simula-tions models (see Del Moral2004for details). The basic idea is that a sub-sample from{x(i)

0:T}i=1,...,N behaves

asymptot-ically as a perfect sample from π0:T. More precisely, using

Theorem 8.3.3 in Del Moral (2004) one can show that if π₀⊗q_:T is the tensor product of q copies of the measure π0:T, then

∥Law(x0(1):T, x0(2):T, . . . , x0(q):T)− π0⊗q:T∥t v≤

q2

Nc(T ), (17) where∥ · ∥t v denotes the total variation norm between two

probability measures and c(T ) is a constant with respect to N. By choosing q= q(N) to be of order o(N) and denoting Mδ= max

x0:T∈(Rdx)T+1π0:T(x0:T)− δ, one can show that, for any δ > 0,

P( max i_=1,...,q(N)π0:T(x (i) 0:T) < Mδ ) ≤ c(T )q(N )2 N + π0:T(A(δ)) q(N )_,

where A(δ) is defined to be the set A(δ)= {x0:T ∈ (Rdx₎T+1: π₀

:T(x0:T) < Mδ}

and π0:T(A(δ))is a shorthand for the integral

π0:T(A(δ))= "

IA(δ)(x0:T)π0:T(x0:T)dx0:T.

This, in turn, leads to the convergence in probability (but not a.s.) of the estimator toward max_x₀_:T_∈(Rdx₎T+1π0:T(x0:T). Remark 4 Theorem1is valid for general topological spaces provided the posterior distribution charges any open neigh-borhood of points inXπ

T and its density is lower

semicon-tinuous (and hence consemicon-tinuous) at the points in Xπ T. This

includes discrete spaces (finite or infinite) with the corre-sponding discrete topology.

Corollary 1 Assume that π(xt|x0:t−1)= π(xt|xt₋₁), π(yt|

x₀_:t, y₁_:t−1)= π(yt|xt)and let ¯x₀N_:T be the output sequence of Algorithm2. Then, lim N→∞π0:T(¯x N 0:T)= max x0:T∈(Rdx)T+1π0:T(x0:T) a.s.

Moreover any convergent subsequence of ¯xN

0:T has a limit

ˆx0:T that belongs to the solution setXπT. Proof Simply note that ΩN

0:T ⊂ ¯Ω0N:T and, as a consequence,

π0:T(ˆx₀N_:T)≤ π0:T(¯x₀N_:T)≤ π0:T(ˆx0:T). "

Remark 5 We emphasize that the sequences{ ˆxN

0:T}N∈Nand

{ ¯xN

0:T}N_∈N may not necessarily be convergent themselves,

as they may contain subsequences that converge to differ-ent elemdiffer-ents of the solution setXπ

T (we have not assumed

uniqueness of the global minimizer). Moreover, if lim sup

∥x0:T∥→∞

π₀_:T(x₀_:T)= max

x0:T∈(Rdx)T+1π0:T(x0:T)

then the sequence may contain subsequences that diverge to infinity, or the entire sequence can diverge to infinity. If that is the case, we need to restrict the search for a global minimizer to a (sufficiently large) compact set. However, in general, lim_∥x0:T∥→∞π0:T(x0:T)= 0 and, therefore, ending up with a sequence divergent to infinity does not occur.

Equation (11) states that, for a real bounded function of x₀_:T, the approximation error converges with √N, which determines the accuracy of the discretization of the state-space ΩN

0:T. This enables us to find how large should the

number of particles N be such that the (random) grids ΩN 0:T

(respectively, ¯Ω₀N_:T) contain points at a distance from a true MAP estimate smaller than1

k, for k arbitrary but sufficiently

large. "

Theorem 2 For sufficiently large k, the (random) grids

Ω₀N_:T and ¯Ω₀N_:T contain points at a distance from a true MAP

estimate smaller than1_kprovided that N > Θk

dx (T+1) 1

2 −δ , where Θis a positive random variable independent of N and k and 0 < δ <1₂is a constant.

Proof From (11), there exists a positive random variable Θδ T

such that, for all N > 0, we have * * * * m(N, k) N − Ak * * * * ≤ Θ_Tδ N12−δ (18)

a.s. for 0 < δ <1₂ (the constant δ can be chosen as small as desired by taking a large value of p in (12)).

Recall that Ak =

!

Bk(ˆx0:T)π0:T(x0:T)dx0:T, for some

ˆx0:T ∈XπT. When k is sufficiently large, π0:T(x0:T)is very

close to π0:T(ˆx0:T) for any x0:T ∈ Bk(ˆx0:T). In particular,

we can assume that 1₂π₀_:T(ˆx₀_:T)≤ π₀_:T(x₀_:T)≤ π₀_:T(ˆx₀_:T) for any x0:T ∈ Bk(ˆx0:T). Therefore we can deduce that

Ak≥qT₂+1π0:T(ˆx0:T)(1_k)dx(T+1), where qT+1is the volume

of the unit ball in (_Rdx₎T+1, and from (18) we arrive at qT+1 2 π0:T(ˆx0:T) + 1 k ,dx(T+1) − Θ δ T N12−δ≤ m(N, k) N . (19)

(9)

to hold true. Solving for N, we obtain N > Θk dx (T+1) 1 2 −δ for Θ= ( 2Θ δ T qT+1π0:T(ˆx0:T)) 1 1 2 −δ. "

Remark 6 Under additional assumptions (for example if the state space is compact), one can deduce2 _{a smaller lower} bound for the size N of the sample required to obtain a point at a distance less than, say,1

k. The basis of this is the

follow-ing exponential bound (see Del Moral and Miclo2000for details and the required assumptions). One can show that there exist constants c1= c1(T , f, δ)and c2= c2(T , f, δ) such that

P{|(f, πN

0:T)− (f, π0:T)| ≥ δ} ≤ c1e−c2N δ

2

(20) for an arbitrarily small δ > 0. Using a standard argument, one can deduce from (20) that there exist two positive ran-dom variables Θ_T1and Θ_T2such that, for all N > 0, we have

|m(N, k)

N − Ak| ≤ Θ

1

T exp{−ΘT2N}

which implies that if N > Θ log k for a suitably chosen pos-itive random variable Θ, then m(N, k) is strictly pospos-itive and, hence, the (random) grids ΩN

0:T and ¯Ω0N:T contain points

at a distance smaller than 1_k.

6 Application example: target tracking

6.1 Problem statement

MAP sequence estimation methods find a natural applica-tion in problems where the posterior densities π0:t(x0:t),

t= 1, . . . , T , are multimodal. In such cases, the mean of the a posteriori probability distribution may yield a path lying in a low probability region and it is often preferred to use a mode of the distribution as an estimate.

The problem of tracking a target that moves over a two-dimensional region using only two sensors that pro-vide distance-dependent observations falls within this cat-egory. Let the system state at discrete time t be the four-dimensional random (column) vector Xt = [X1,t, . . . ,

X4,t]⊤∈ R4, where Rt= [X1,t, X2,t]⊤∈ R2determines the

position of the target and Vt= [X3,t, X4,t]⊤denotes its

ve-locity. The state vector is assumed to evolve with time ac-cording to the constant-velocity model3

Xt= AToXt−1+ σxUt, t= 1, 2, . . . , T ,

2_{This approach was suggested to us by Pierre Del Moral.}

3_{The model assumes that the target velocity remains constant in}

inter-vals of length To, the observation period. See, e.g., Gustafsson et al.

(2002) for a discussion of kinetic models for target tracking.

where ATo is the 4× 4 constant matrix

ATo= ⎡ ⎢ ⎢ ⎣ 1 0 To 0 0 1 0 To 0 0 1 0 0 0 0 1 ⎤ ⎥ ⎥ ⎦ ,

Tois the observation period in seconds (s), i.e., the duration

of the discrete-time unit in the model, σ2

x is the variance of

the Gaussian perturbation of the state and Ut is a standard

(zero mean, identity covariance matrix) normal vector, i.e., π(ut)= N (ut; 0,I4), where I4 is the 4× 4 identity matrix

and 0∈ R4_{. Therefore, the transition pdf at time t is}

π(xt|x0:t−1)= π(xt|xt−1)= N (xt; AToxt−1, σ

2

xI4). (21)

We assume a Gaussian prior π(x0)= N (x0; 0, Σ0), where

0∈ R4and Σ0is a diagonal, 4× 4, positive definite matrix.

Two sensors measure the power of a radio signal trans-mitted by the target. The observation collected at discrete-time t by the i-th sensor is modeled as

Yi,t= 10 log10 + Po ∥Rt− si∥2 , + σyZi,t (dB),

where i= 1, 2, Pois the power of the signal transmitted by

the target, si ∈ R2 is the position of the i-th sensor, σy2 is

the variance of the observational noise and Zi,t is a standard

Gaussian variable, π(zi,t)= N (zt; 0, σy2). We assume that

the sequences{Z1,t}t≥1 and{Z2,t}t≥1 are white and

mutu-ally independent. With this model, the conditional pdf of the observations Yt= [Y1,t, Y2,t] ∈ R2given the state Xt of the

system is π(yt|x0:t, y1:t−1)= π(yt|xt), where

π(yt|xt)= 2 # i₌₁ N + yi,t; 10 log10 + Po ∥rt− si∥2 , , σ_y2 , . (22)

The goal is to compute an estimate of the sequence of states X0:T (and, especially, of the positions R0:T) given a

fixed sequence of observations Y1:T = y1:T.

6.2 Numerical results

We have carried out computer simulations for the model de-scribed by (21) and (22) with the following set of parame-ters.

– The prior pdf π(x0)= N (x0; 0, Σ0)has a covariance

ma-trix Σ₀= ⎡ ⎢ ⎢ ⎣ 100 0 0 0 0 100 0 0 0 0 0.05 0 0 0 0 0.05 ⎤ ⎥ ⎥ ⎦ ,

(10)

– The observation period is To=1₄ s and the variance of

the signal noise is selected proportional to T2

o, namely,

σ_x2=1₂T_o2=₃₂1.

– The target transmits a signal of unit power, Po= 1, and

the sensor positions are s1= [0, 0]⊤ and s2= [20, 0]⊤.

The observational noise variance is σ2 y =12.

– The target is observed during Td= 20 s, which yields T = Td

To = 80 discrete-time steps.

For the simulations, we have generated a single trajectory x0:T with its associated observations y1:T using the

de-scribed state-space model. Therefore, the observations are fixed in all the computer experiments. Recall that Rt =

[X1,t, X2,t]⊤denotes the target position at time t.

Figure 1 (top) displays a histogram of π(rT|y1:T)

ob-tained with N= 4 × 105_{particles generated using the}

stan-dard SMC algorithm of Sect.4.1. It is clearly seen that the filtering distribution for this system is bimodal. Note that, in order to obtain a unimodal posterior distribution using distance-dependent observations in dimension dxone needs

either to collect at least dx+ 1 observations or to choose a

prior for the position that prevents ambiguities.

A consequence of the shape of the distribution in Fig.1 (top) is that the mean of the posterior distribution is not a useful estimate of the trajectory R0, . . . , RT given the

ob-servations y_1:T. Indeed, Fig.1(bottom) shows:

– The true target trajectory, as a dark-colored solid line. – The sensor positions, as circles.

– The mean of the posterior distribution obtained from the bootstrap filter with N = 4 × 105 _{particles, as a}

light-colored thick line.

– The 100 sample paths, x(i1)

0:T, . . . , x1(i:T100), with the highest

a posteriori density generated by the particle filter, dis-played as thin light-colored lines. These paths have been computed using Algorithm 1.

It is apparent that the posterior mean of π(x0:T|y1:T)yields a path that lies far away from the two regions of high prob-ability density. The collection of high-density sample paths, however, reveals clearly the two modes of the distribution. Any of these paths is a useful practical estimator of the target trajectory but it should be noted that the system is ambigu-ous. Both modes are equally likely and only a modification of the model (e.g., the addition of new observations from dif-ferent sensors, the modification of the prior distribution or a restriction of the region where the target can move) would allow to discriminate one from the other.

Figure2shows a comparison of the Algorithms 1 and 2 for MAP sequence estimation. For the same record of ob-servations as in Fig.1, we have run the two algorithms 100 times, with N = 1, 000 particles4_{, and recorded the outputs}

4_{We have used a very large number of particles (N}_{= 4 × 10}5_{) to}

gen-erate Fig.1in order to accurately show the two (symmetric) modes in

Fig. 1 Top: Histogram generated from N = 4 × 105_{particles of the}

bootstrap filter at time t_{= T . It shows that the posterior distribution} for this system is bimodal. Bottom: True trajectory (dark-colored line), posterior mean estimate (thick light-colored line) and the 100 sample paths with highest a posteriori density, computed using Algorithm 1 (thin and light-colored)

ˆxN

0:T and ¯x0N:T for each simulation. In particular, the figure

displays box-and-whiskers plots for the logarithms of the se-quence of posterior densities

log π(ˆxN

0:t|y1:t), t= 0, 1, 2, . . . , T , for Algorithm 1 (dark-colored), and log π(¯xN

0:t|y1:t), t= 0, 1, 2, . . . , T ,

the posterior pdf π0:T(x0:T). However, the practical application of the

(11)

Fig. 2 Box-and-whiskers plots for the logarithm of the posterior

den-sities of the approximate MAP estimates produced by Algorithms 1 and 2 in 100 independent simulation runs. The dark-colored plot shows the outcomes for Algorithm 1, i.e., log π(_ˆxN

0:t|y1:t)for t= 0, 1, . . . , T .

The light-colored plot shows the outcomes for Algorithm 2, i.e., log π(_¯xN

0:t|y1:t)for t= 0, 1, . . . , T . The boxes show the inter-quartile

range (IQR) and the whiskers show the smallest (largest) datum still within 1.5_{×IQR of the lower (upper) quartile. Data between 1.5×IQR} and 3_{× IQR away from the lower or upper quartiles are displayed with} the ‘+’ symbol. Data further than 3_{× IQR away from the lower or} upper quartile are displayed with the ‘_{◦’ symbol}

for Algorithm 2 (light-colored). It can be seen that the poste-rior density of the estimates produced by Algorithm 2 (¯xN

0:T)

is always higher than that produced by Algorithm 1 (ˆxN 0:T).

This is because in Algorithm 2 we carry out a search over a random grid approximation of the path space, ¯Ω₀N_:T, that is a refinement of the random grid used by Algo-rithm 1, denoted ΩN

0:T. As a consequence, Ω0N:T ⊂ ¯Ω0N:T and

π(ˆx₀N_:T|y1:T)≤ π( ¯x₀N_:T|y1:T)(most often with strict inequal-ity).

7 Application example: global optimization

7.1 Problem statement

As an application of the MAP sequence estimation tech-niques investigated in this paper, we address the problem of finding the global minima of a certain class of cost func-tions with recursive structure. For this purpose, let {xt}t≥0

and{yt}t_≥1be discrete-time vector-valued sequences inRdx

and_Rdy, respectively. For some arbitrarily large but finite horizon T , we aim at computing

Xc

T = arg min

x0:T∈(Rdx)T+1CT(x0:T; y1:T), (23) where CT(·; y1:T): (Rdx)T+1→ R+is the real non-negative

cost function of interest, the subsequence x0:T denotes the

unknowns to be optimized and the subsequence y1:T is

known and provides the fixed parameters that determine the specific form of CT.

The MAP sequence estimation methods of Sect.4 can be applied to solve problem (23) when the cost function can be constructed recursively, i.e., when there exists a sequence of functions Ct(·; y1:t): (Rdx₎t+1 → R+, t = 0, 1, . . . , T , such that Ct(x0:t; y1:t) can be computed from

Ct₋₁(x1:t−1; y1:t−1)by some known update rule. In

particu-lar, we assume that Ct can be decomposed as

Ct(x0:t; y1:t)= H (Ct₋₁(x0:t−1; y1:t−1), ct(x0:t; y1:t)),

t= 1, . . . , T , where H : R+×R+→ R+is the update func-tion and ct(·; y1:t): (Rdx₎t+1→ R+ is termed the partial cost at time t. The recursion is initialized with some func-tion C0: Rdx → R+which does not formally depend on any element of the sequence y1:T.

Despite the simplicity of the recursive structure, we may realistically expect that problems of the form of (23) be hard to solve in practical scenarios. Indeed, CT(x0:T; y1:T) may be analytically intractable and present multiple minima. Also, due to the potentially high dimension, dx(T + 1), of

the unknown, x0:T ∈ (Rdx₎T+1, it may be hard to devise ef-fective numerical optimization algorithms with acceptable computational complexity.

In order to compute approximate solutions, we propose to recast the optimization problem (23) as one of MAP se-quence estimation in a state-space model and then apply the SMC algorithms of Sect.4. The first step in our approach, therefore, is to select a suitable state-space model. We say that valid models are matched to the cost function.

Definition 1 Let y1:T be a fixed sequence in Rdy. A state-space model of the form of (1) is matched to the cost func-tion CT(x0:T; y1:T)if, and only if,Xc_T =Xπ_T.

Therefore, a state-space model is matched to the cost CT

when the maxima of the posterior pdf π(x0:T|y1:T)exactly coincide with the minima of CT(x0:T; y1:T). There is not a

unique model matched to a given cost, but rather a complete class of systems that yield the same solution setXπ

T =XcT,

as exemplified in Sect.7.2below. 7.2 Examples

In this Section we illustrate the construction of state-space models matched to cost functions by way of two examples, each of them dealing with a different class of update func-tions H (·, ·). For notational conciseness, in the rest of this section we use the shorthands

C0:t(x0:t)= Ct(x0:t; y1:t) and ct(x0:t)= ct(x0:t; y1:t), for t= 0, . . . , T , where the fixed parameters y1:t are

(12)

The first example involves a purely additive rule, H (a, b) = a + b. Additive costs appear frequently in scientific and engineering problems, e.g., in positioning and navigation (AH et al.2005), finance (Ziemba and Vickson2006) or op-erational research (Baker2000). Let us consider the generic additive form

C0:t(x0:t)=C0:t−1(x0:t−1)+ct(x0:t). (24)

This cost can be related to the posterior pdf easily by means of the exponential transformation

π0:t(x0:t)= κtexp{−C0:t(x0:t)}, (25) where the proportionality constant κt is independent of x0:t.

Substituting (24) into (25), we readily obtain that

π0:t(x0:t)= κtexp{−C0:t−1(x0:t−1)} exp{−ct(x0:t)} (26)

and, comparing (26) and (2), it becomes apparent that any state-space model such that π(x0)∝C0(x0)and

π(yt|x0:t, y1:t−1)π(xt|x0:t−1)∝ exp{−ct(x0:t)}, for t= 1, . . . , T , is matched toC0:T(x0:T).

Now, we discuss a specific example taken from the global optimization literature.

Example 1(Neumaier 3 problem) The Neumaier 3 problem is included in the collection of Ali et al. (2005) and consists in the minimization of the cost function

J (x1:T)= T % t₌₁ (xt− 1)2− T % t₌₂ xtxt−1, (27)

subject to−T2≤ xt ≤ T2for all t∈ {1, . . . , T }. The

num-ber of local minima of J (x1:T)is not known, but the global minimum can be expressed as

J (x₁o_:T)= −T (T+ 4)(T − 1)

6 ,

where xo

t = t(T + 1 − t), t = 1, . . . , T .

We can easily adapt the cost of (27) to the notation in this paper by defining C0:T(x0:T)= 1 σ2 3 _T % t=1 (xt− yt)2− T % t=2 xtxt₋₁ 4 , (28)

where yt = 1 for all t ≥ 1 and σ2>0 is an arbitrary scale

parameter. Note that, subject to−T2_{≤ x} t≤ T2,

arg min

x1:T J (x1:T)= arg minx1:T C0:T(x0:T), (29)

and x0 is a dummy unknown included only for notational

compatibility. The functions C0:t, t= 2, . . . , T , admit the

recursive decomposition

C0:t(x0:t)=C0:t−1(x0:t−1)+_σ12[(xt− yt)2− xtxt₋₁].

(30) Let us construct the matched state-space model with sig-nal process X0:T and fixed observations Y1:T = y1:T. Since

X0is a dummy variable in this example, it is trivial to choose

the prior π(x0)= U(x0; −T2,+T2), which does not mod-ify the location of the maxima of the posterior pdf. We also note that the partial cost at time t= 1 is independent of X0,

C(x₀_:1)= 1

σ2(x1− y1)

2_{, hence we can also select a uniform}

density for the random variable X1, π(x1|x0)= π(x1)= U (x1; −T2,+T2), and let the likelihood function be Gaus-sian, π(y1|x1)∝ exp & − 1 σ2(x1− y1) 2'_.

Thus, the posterior pdf at time t= 1 is a truncated Gaussian function, namely

π(x1|y1)∝ π(y1|x1), where x1∈ [−T2, T2].

In order to determine the form of the matched state-space model for t≥ 2 we have to select the transition densities and the likelihood functions to comply with the relationship

π(yt|x0:t, y1:t−1)π(xt|x0:t−1) ∝ exp & − 1 σ2[(xt− yt) 2_{− x} txt₋₁] ' , (31)

where the proportionality constant must be independent of x0:t. There are several choices compatible with (31).

A simple one is to choose the transition to be uniform, π(xt|x0:t−1)= π(xt)= U(xt; −T2,+T2), and let the

like-lihood account for the partial cost, π(yt|x0:t, y1:t−1)= π(yt|xt_−1:t) ∝ exp & −(xt− yt)2− xtxt−1 σ2 ' .

(13)

where st= 1 +1₂xt−1. The proportionality constant for this pdf is κt= 5" T2 −∞ N + xt; st, σ2 2 , dxt. − " _−T2 −∞ N + xt; st, σ2 2 , dxt 6₋₁ , hence π(xt|xt₋₁)= κtexp & − 1 σ2(xt− st) 2'_, for−T2_{≤ x}

t≤ +T2. Let us note that

(xt− st)2= (1 − xt)2− xtxt₋₁− xt₋₁ + xt−1 4 +1 , ,

i.e., π(xt|xt₋₁)∝ exp{−ct(xt_−1:t)} but the proportionality

constant depends on the variable xt₋₁.

The likelihood π(yt|x0:t, y1:t−1)has to be selected to ac-count for the choice of π(xt|xt−1) and comply with (31)

when yk= 1 for all 1 ≤ k ≤ t. In particular, we define

π(yt|x0:t, y1:t−1)= π(yt|xt−1)= N + yt; zt, σ2 2 , , where zt∈ 7 1± 8 σ2(bT + log κt)− xt−1 +_x t₋₁ 4 +1 ,' (32) and bT ≥ T 2 σ2( T2

4 + 1) is a constant chosen to guarantee that

zt∈ R. Equation (32) ensures that

π(yt= 1|xt₋₁)= κ_t−1 √ {πσ2_}exp & −xt−1 σ2 + xt−1 4 +1 ,'

and, hence, (31) holds true.

Another class of problems that abound in engineering, fi-nance and other disciplines consist of the minimization of the maximum value of a certain function (see, e.g., Du and Pardalos 1995; Rao et al. 2007; Pankov et al. 2003). Let a∨ b and a ∧ b denote the maximum and the minimum, respectively, of a and b. In a second example, we study cost functions of the formCt(x0:t)=Ct₋₁(x0:t−1)∨ct(x0:t).

We can also apply the exponential transformation in this case, to obtain

π(x0:t|y1:t−1)∝ exp{−C0:t(x0:t)}

= exp{−(Ct₋₁(x0:t−1)∨ct(x0:t))}, (33)

for t = 1, . . . , T . We can put (33) in a form comparable with (2),

π(x0:t|y1:t−1)∝ exp{−C0:t−1(x0:t−1)} ×_exp exp{−ct(x0:t)}

{−(C0:t−1(x0:t−1)∧ct(x0:t))}

, with the proportionality constant independent of x0:t, and

reduce the problem of building a matched state-space model to selecting a transition density and a likelihood function such that

π(yt|x0:t, y1:t−1)π(xt|x0:t−1)

∝ exp{−ct(x0:t)}

exp{−(C0:t−1(x0:t−1)∧ct(x0:t))}

. (34)

We work out a brief example from the signal processing literature.

Example 2(Cross-talk cancellation) In Rao et al. (2007), the design of an acoustic filter for cross-talk cancellation in a 3D audio system is stated as a minimax problem. Let ha(n), n∈ Z, be a sequence that represents the combined

effect of the acoustic impulse responses between the sound sources (loudspeakers) and (say) the listener’s left ear and let hf(n), n∈ Z, be the cross-talk cancellation filter that

should let the desired source signal pass while mitigating all other signals coming from different sources (see Rao et al. 2007for details). The impulse response ha(n)is causal with

length 2M− 1, i.e., ha(n)= 0 for all n < 0 and n ≥ 2M − 1,

while the filter hf(n)is assumed causal with length K, i.e.,

hf(n)= 0 for all n < 0 and n ≥ K.

The goal is to find the response hf(n)such that the

con-volution c(n)= ha(n)∗ hf(n)=$_k2M₌₀−2ha(k)hf(n− k) is

the closest to the desired response

d(n)= &

1, if n= 0,

0, otherwise, (35)

i.e., the filter hf(n)is selected to invert the combined

acous-tic response ha(n). Perfect inversion is not possible, since

hf(n)has a finite length, hence we seek to solve the

equa-tions d(n)− c(n) = 0, n = 0, . . . , K + 2M − 3, approxi-mately instead. Let us collect the complete set of filter coeffi-cients into the vector hf = [hf(0), . . . , hf(K− 1)]⊤∈ RK.

In Rao et al. (2007) it is proposed to select hf as the solution

(14)

We can easily rewrite problem (36) using our notation. For the unknowns, we let xt = hf(t − 1) ∈ R for t =

0, 1, . . . , K (hence, x0= 0 and xt >K = 0). The desired

se-quence d(n) plays the role of the observations, hence yt =

d(t − 1), t = 1, 2, . . . , K + 2M − 2. We define the partial cost at time t≥ 1 as ct(x0:t)= * * * * *yt− 2M_%−2 k=0 ha(k)xt−k+1 * * * * *

while, trivially,C0(x0)= 0. The overall cost at time t then

becomes C0:t(x0:t)=C0:t−1(x0:t−1)∨ct(x0:t). The time

horizon is T = K + 2M − 2 andC0:T(x0:T)= J (hf).

Note that x0= 1 with probability 1. Also, assume the

fil-ter coefficients xt (equivalently, hf(t− 1)) are restricted to

the interval5₍_{−a, +a). The simplest way to choose the} tran-sition density and likelihood function compatible with (34) is to let

π(xt|x0:t−1)= π(xt)= U(xt; −a, +a), t = 1, . . . , T ,

and make the likelihood account for the cost update,

π(yt|x0:t, y1:t−1)∝_exp_{−(_Cexp{−ct(x0:t)}

0:t−1(x0:t−1)∧ct(x0:t))}

.

Similar to the Example1, other choices of π(xt|x0:t−1)and

π(yt|x0:t, y1:t−1)are possible.

7.3 Numerical results

In this Section we apply Algorithm 2 to the Neumaier 3 problem described in Example 1. The goal of this numeri-cal example is to illustrate the advantage of using a random grid of points (generated by the particle filter and coherent with the posterior probability distribution) over the space of the unknowns in order to perform the optimization. To show it, we have also implemented a deterministic optimization procedure that consists of

1. building a deterministic grid of N equally spaced points in the interval[−T2,+T2], denoted G_TN, and

2. running the Viterbi algorithm to compute the sequence of points in the grid GN

T with the least cost, i.e.,

˜xN

1:T = arg min x1:T∈(GTN)T

CT(x0:T).

Note that this is exactly the same scheme as Algorithm 2, with the only difference that the grid is deterministic rather than stochastic.

5_{This is an actual constraint in a practical fixed-point hardware}

imple-mentation of the filter.

In the first experiment, we check the influence of the scale factor σ2on the solutions generated by the proposed opti-mization algorithm. Note that, even if the choice of σ2_>_{0 is}

irrelevant from the perspective of the solution setXπ

T =X

c T,

the convergence rate of the numerical algorithms used to ap-proximate the solutions inXπ

T may indeed be affected by this

parameter.

Therefore, we have applied Algorithm 2 to the Neu-maier 3 problem with dimension T = 200, using N = 400 particles and values of σ2 ranging from σ2= 100 × T2_to

σ2= 104× T2. Figure3(top) shows the average cost (nor-malized by T2_{) of the solutions generated by Algorithm 2}

for the various values of σ2_{. Each point in the plot has been}

obtained by averaging the normalized cost of the solution,

C0:T(ˆx0N:T)/T2, over 50 independent simulation runs. The

figure also depicts the true minimum cost for reference (la-beled ‘optimum’). It is observed that the smaller scale fac-tors yield solutions which are poorer than the output of the deterministic algorithm (labeled ‘Deterministic VA’), while for σ2_{≥ 400 × T}2_{the solutions generated by the}

(stochas-tic) Algorithm 2 yield a clearly lower cost.

Figure3(bottom) shows the convergence of Algorithm 2 as the number of particles, N, is increased. For a fixed scale factor σ2_{= 4000 × T}2 _{and T} _{= 200 variables, we}

have carried out 50 independent simulation trials and av-eraged the normalized cost of the approximate solution,

C0:T(ˆx₁N_:T)/T2, for several values of the number of parti-cles N. The error reduction as N grows is apparent. We observe that for the lowest number of particles considered, N= 100, the Viterbi method with a deterministic grid out-performs Algorithm 2. For N≥ 200, however, the random grid generated by the particle filter always yields a lower cost.

8 Summary

We have analyzed the asymptotic convergence of two SMC algorithms for MAP sequence estimation. Both methods rely on the standard SIR technique to generate random-grid approximations of the state space. They differ, however, in the way the “best node” of the grid is sought. In Algorithm 1, a simple linear search among the paths{x(n)

0:T}n_=1,...,N

pro-duced by the SIR method is carried out, while in Algorithm 2 these paths are combined to create a finer grid which is then explored using the Viterbi algorithm, as proposed in Godsill et al. (2001). The output of the algorithms is the node in the (corresponding) grid with the highest posterior density.

Our analysis starts with an extension of well-known re-sults on the convergence of particle filters by Del Moral and Miclo (2000) and Crisan and Doucet (2000). We provide explicit convergence rates for the Lp error in the

(15)

Fig. 3 Performance of Algorithm 2 for the Neumaier 3 problem with

dimension T _{= 200. Top: Average cost of the solution ˆx}N 0:T, with

N= 400 particles, for several values of the scale parameter σ2. Both the cost (in the vertical axis) and σ2_{(horizontal axis) are normalized}

by T2_{. Bottom: For fixed σ}2_{= T}2_{× 4 × 10}4_{, average cost of} _ˆxN 0:T

(normalized by T2_{) for N}_{= 100, 200, 400, 800, 1600, 3200}

the joint posterior probability measure of the sequence of states X0:T. Using this new result, we prove that the

poste-rior density of the output paths of Algorithms 1 and 2 con-verge almost surely (as N→ ∞) to the actual maximum of the posterior pdf. We have also found explicit lower bounds on the number of particles that are needed to ensure a given accuracy in the approximation of the maximum of the pdf.

The last part of the paper is devoted to the application of Algorithms 1 and 2 to the global minimization of a class of objective functions (possibly non convex and possibly non differentiable) that admit a certain recursive factorization. By way of two examples, we have described how to select state-space models “matched” to a given cost function. For these models, the global minima of the cost function co-incide with the global maxima of the a posteriori pdf and, hence, we can apply Algorithms 1 and 2 to locate them. In

this context, the SIR method can be interpreted as a tool to generate a random grid (in the space of the unknowns of the cost function) that is dense in the region where the cost is low and sparse elsewhere. We have presented numerical simulations that show how this approach can be more effi-cient than the use of deterministic grids.

Acknowledgements J.M. acknowledges the support of the Ministry of Science and Technology of Spain (program Consolider-Ingenio 2010 CSD2008-00010 COMONSENS and project DEIPRO TEC2009-14504-C02-01) and a joint program of Comunidad de Madrid and Uni-versidad Carlos III de Madrid (project ETORS CCG10-UC3M/TIC-5225 ETORS).

Part of this work was done during D.C.’s visit to the Department of Signal Theory & Communications, Universidad Carlos III (Spain), in April 2008. The hospitality of the Department is gratefully acknowl-edged.

The work of P.M.D. has been supported by the National Science Foundation under Award CCF-1018323 and by the Office of Naval Re-search under Award N00014-09-1-1154. Part of this work was carried out while P. M. D. held a Chair of Excellence of Universidad Carlos III de Madrid-Banco de Santander.

Appendix: Proof of Lemma1

We proceed by induction in T . For T = 0, the random mea-sure πN

0:0(dx)is constructed from an i.i.d. sample of size N

from the distribution with pdf π0:0. Hence, it is

straightfor-ward to check that

∥(f, πN 0:0)− (f, π0:0)∥p≤ cp₀∥f ∥_∞ √ N , where cp 0 is a constant independent of N.

Now we assume that

∥(f, πN

0:T)− (f, π0:T)∥p≤

cp_T∥f ∥_∞ √

N , (37)

for an integer T > 0 and aim at proving the corresponding inequality for T+ 1.

The recursive step of the SIR algorithm, as presented in Sect. 4.1, consists of three sub-steps. Let pN

0:T +1 be

the empirical measure obtained after the first sub-step, i.e., pN₀_{:T +1}(dx)= _N1 $N_n₌₁δ_¯x(n)

0:T +1(dx), where δ¯x0(n):T +1 denotes the unit delta measure centered at ¯x(n)

0:T +1. Also let GT ,N

de-note the σ -algebra generated by the random variables X(n) 0:T,

n= 1, . . . , N. Then, for f : (Rdx₎T+2→ R, we have E[(f, p₀N_{:T +1})|GT ,N] = ( ¯f , π₀N_:T), (38) where ¯f is obtained from f by integrating with respect to the measure π(xT+1|x0:T)dxT+1, i.e.,

(16)

Obviously, ¯f is bounded (since∥ ¯f∥_∞≤ ∥f ∥_∞) and, from the induction hypothesis (37), we deduce that

∥( ¯f , π₀N_:T)− ( ¯f , π0:T)∥p≤ cp_T∥f ∥_∞ √ N . (40) Moreover, since E[((f, p₀N_{:T +1})− E[(f, p₀N_{:T +1})|GT ,N])p|GT ,N]1p ≤ ˜c p T+1_√∥f ∥∞ N , (41) where˜cp

T₊₁is a positive random variable independent of N,

it is straightforward to combine (38), (40) and (41) using the triangle inequality to arrive at

∥(f, pN 0:T +1)− ( ¯f , π0:T)∥p≤ ¯˜c p T+1_√∥f ∥∞ N , (42) where ¯˜cpT+1= E[˜c p T+1] + c p T.

Consider next the measure ¯πN

0:T +1 that is obtained after

sub-step ii. of the algorithm. This measure can be defined by

(f,¯π₀N_{:T +1})=(f gT+1, p

N 0:T +1)

(gT+1, pN₀_{:T +1})

(43)

(recall that gT+1(x0:T +1)= π(yT+1|x0:T +1, y1:T) is the

bounded likelihood function). Also let p0:T +1(x0:T +1)= π(xT+1|x0:T)π0:T(x0:T)be the predictive pdf at time T+ 1,

which satisfies (f, p0:T +1)= ( ¯f , π0:T), and rewrite (42) as

∥(f, pN

0:T +1)− (f, p0:T +1)∥p≤ ¯˜c p

T+1_√∥f ∥∞

N . (44)

Since, from the Bayes’ rule,

(f, π0:T +1)=(f gT+1, p0:T +1) (gT+1, p0:T +1)

, (45)

we can take (43) and (45) together in order to obtain

(f,¯π₀N_{:T +1})− (f, π0:T +1)=(f gT+1, p N 0:T +1) (gT+1, pN₀_{:T +1}) −(f gT+1, p0:T +1) (gT₊₁, p0:T +1) .

By adding and subtracting the term (fgT+1, pN₀_{:T +1})/(gT+1,

p0:T +1)in the equation above, we easily arrive at (f,¯π₀N_{:T +1})− (f, π₀_{:T +1}) =(f gT+1, p N 0:T +1)[(gT₊₁, p0:T +1)− (gT₊₁, pN₀_{:T +1})] (gT₊₁, pN₀_{:T +1})(gT₊₁, p0:T +1) +(f gT+1, p N 0:T +1)− (fgT+1, p0:T +1) (gT+1, p0:T +1) and, since |(fgT+1, p₀N_{:T +1})| ≤ ∥f ∥∞(gT+1, pN₀_{:T +1}), it

The latter inequality, together with (44) and the assumed boundedness of the likelihood gT₊₁, yields

∥(f, ¯πN 0:T +1)− (f, π0:T +1)∥p≤ ˘cp T+1_√∥f ∥∞ N , (46) where ˘cp T₊₁= 2∥gT+1∥∞¯˜c p T₊₁/(gT₊₁, p0:T +1) is a con-stant independent of N.

In order to analyze the last substep (the resampling), we introduce the σ -algebra generated by the random vari-ables ¯X₀(n)_{:T +1}, n= 1, . . . , N, and denote it as ¯GT+1,N. It is straightforward to obtain that E[(f, πN

0:T +1)| ¯GT+1,N] =

(f,¯π₀N_{:T +1}), hence the conditional expectation of the error becomes E[((f, π₀N_{:T +1})− (f, ¯π₀N_{:T +1}))p| ¯GT+1,N]p1 ≤ ´c p T+1_√∥f ∥∞ N , where´cp

T₊₁is a positive random variable independent of N.

As a consequence, taking the expectation on X(n) 0:T +1, n= 1, . . . , N, yields ∥(f, πN 0:T +1)− (f, ¯π0N:T +1)∥p≤ ¯´c p T+1_√∥f ∥∞ N , (47)

where ¯´cpT₊₁is the expected value of ´c p

T₊₁. Combining (46)

and (47) by way of the triangle inequality yields ∥(f, πN 0:T +1)− (f, π0:T +1)∥p ≤ ∥(f, πN 0:T +1)− (f, ¯π0N:T +1)∥p + ∥(f, ¯πN 0:T +1)− (f, π0:T +1)∥p ≤c p T₊₁∥f ∥∞ √ N , where cp T+1 = ¯´c p T+1 + ˘c p T+1 is a constant independent of N. " References