Estimating 3D Motion and Occupancy - Multi-view dynamic scene modeling

Given the identified dependencies, ideally the goal is to compute the full distribution of 𝑝(𝒟,𝒢∣ℐ) since the primary interest is to estimate the displacement field and occupancy probabilities at time 𝑡 given image observations. Unfortunately this inference is intractable given the huge state spaces of 𝒟 and 𝒢. However the problem is well suited for an EM formulation [Dempster et al. (1977)] with some simplifications. By

focusing on estimating the Maximum A Posteriori (MAP) of 𝑝(𝒟∣ℐ) and treating 𝒢 as a latent, unobserved variable set, an EM procedure can be constructed that iteratively reﬁnes estimates𝑑0_{, 𝑑}1_{⋅ ⋅ ⋅} _{, 𝑑}∗ _{of the motion ﬁeld} _𝒟_{(M-step), while providing with each} new iteration 𝑘+ 1 an estimate of 𝑝(𝒢∣ℐ, 𝑑𝑘_{) (E-step). The latter term corresponds to}

estimating voxel occupancy probabilities given the image data at time 𝑡 and using the previous iteration’s computed motion ﬁeld, thus providing a probabilistic shape estimate representation analog to previous probabilistic methods [Broadhurst et al. (2001); Franco and Boyer (2005)]. A MAP-EM formulation is chosen also because the tradi- tional EM formulation does not consider the prior 𝑝(𝒟) on motion ﬁelds, which is used to enforce spatial continuity.

5.4.1 Expectation Maximization Formulation

Here is a MAP-EM formulation, whose goal is to ﬁnd the optimal 𝑑∗ such that:

𝑑∗ = argmax𝒟 𝐹(𝒟) with 𝐹(𝒟) =𝑝(𝒟∣ℐ). (5.4)

This goal is achieved iteratively starting from an initial guess 𝑑0_{, by building a se-}

quence of motion ﬁeld estimates𝑑0_{, 𝑑}1_,_{⋅ ⋅ ⋅} _{, 𝑑}∗ _{which increase the log-posterior objective} function𝐹(𝒟), i.e. 𝐹(𝑑0₎_{≤ ⋅ ⋅ ⋅ ≤}_𝐹₍_𝑑∗_{). This is usually achieved at each step by max-} imizing a lower bound of 𝐹(𝒟), whose maximum coincides with an analytically simpler function 𝑄(𝒟∣𝑑𝑘). For a good choice of lower bound and 𝑄(𝒟∣𝑑𝑘), the maximization guarantees an increase of the log-posterior, such that 𝑑𝑘+1 can be deﬁned as follows:

The M-Step: 𝑑𝑘+1 = argmax_𝒟 𝑄(𝒟∣𝑑𝑘). (5.5)

our problem using the following conditional expectation:

𝑄(𝒟∣𝑑𝑘) =𝐸𝒢∣ℐ,𝑑𝑘{ln𝑝(ℐ,𝒢,𝒟)}

=∑

𝒢

𝑝(𝒢∣ℐ, 𝑑𝑘) ln𝑝(ℐ,𝒢,𝒟). (5.6)

The E-step evaluates 𝑝(𝒢∣ℐ, 𝑑𝑘_{) of Eq. 5.6, i.e. the grid occupancy probabilities}

given images ℐand the previously predicted displacement 𝑑𝑘_{. Given this distribution,}

the log-expectation of 𝑝(ℐ,𝒢,𝒟) can be computed for all possible 𝒟as in Eq. 5.6. The E-step of the algorithm then consists in evaluating the 𝑝(𝒢∣ℐ, 𝑑𝑘) term of the expression, i.e. the grid occupancy probabilities given images and the previously predicted displacement. Both probability distributions can be obtained from Eq. 5.3. Each step is developed subsequently in further detail.

5.4.2 E-step: Occupancy Probability Update

In order to compute voxel probabilities at a given EM iteration, one needs to express

𝑝(𝒢∣ℐ, 𝑑𝑘_{) in terms of the joint probability distribution (5.3). Bayes’ rule is used (5.7) and}

the summations is re-factored (5.8) to depending terms, with∝denoting proportionality up to a unit normalization factor:

𝑝(𝒢∣ℐ, 𝑑𝑘)∝∑ ¯ 𝒢 𝑝( ¯𝒢𝒢𝑑𝑘ℐ) (5.7) ∝∏ 𝑋 Φ(𝒢𝑋)⋅ ∑ ¯ 𝒢 𝑋−𝑑𝑘 𝑋 𝑝( ¯𝒢_𝑋₋_𝑑𝑘 𝑋)𝑝(𝒢𝑋∣ ¯ 𝒢_𝑋₋_𝑑𝑘 𝑋), (5.8) where∑ ¯ 𝒢 𝑋−𝑑𝑘 𝑋 𝑝( ¯𝒢_𝑋−𝑑𝑘 𝑋)𝑝(𝒢𝑋∣ ¯ 𝒢_𝑋−𝑑𝑘

𝑋) sums possibilities over occupancy states of the voxel 𝑋 − 𝑑𝑘

𝑝(𝒢𝑋∣𝒢¯𝑋−𝑑𝑘

𝑋) is set deterministically: if the previous voxel 𝑋−𝑑

𝑘

𝑋 was occupied (resp.

empty), then once displaced to 𝑋it is still occupied (resp. empty) with probability 1. Expression (5.8) then becomes:

𝑝(𝒢∣ℐ, 𝑑𝑘)∝∏ 𝑋 ( Φ(𝒢𝑋)⋅𝑝([ ¯𝒢𝑋−𝑑𝑘 𝑋 =𝒢𝑋]) ) . (5.9)

For the purpose of providing a probabilistic shape estimate, each voxel 𝑋’s probability after an E-step can thus be identiﬁed as 𝑝(𝒢𝑋∣ℐ, 𝑑𝑘)∝Φ(𝒢𝑋)⋅𝑝([ ¯𝒢𝑋−𝑑𝑘

𝑋 =𝒢𝑋]), the product of current observation terms at time 𝑡, with the probability of the voxel mapped to 𝑋from 𝑡−1.

Φ(𝒢𝑋) can be computed by expliciting the image formation terms 𝑝(ℐ𝑖∣𝒢𝑋). For

every pixel 𝑥 in every image, it is assumed that the parameters ℬ of a background model have been learned oﬄine from images of a quasi-static scene with no object of interest. Similar to Section 2.4.2, the image term 𝑝(ℐ𝑖∣𝒢𝑋) can then be set as following

[Franco and Boyer (2005)]:

𝑝(ℐ𝑖∣𝒢𝑋) =𝑝(𝒢𝑋= 0)𝑝(ℐ𝑖∣ℬ) +𝑝(𝒢𝑋= 1)𝒰(ℐ𝑖),

where 𝒰(ℐ𝑖) is the uniform distribution over pixel color space, used to model the

aspect of objects of interest since no information about it is used, and 𝑝(ℐ𝑖∣ℬ) is the

probability ofℐ𝑖to be drawn from the background modelℬ. 𝑝(ℐ𝑖∣ℬ) can be, for instance,

a Normal or Gaussian Mixture Model distribution. Here the voxel 𝑋’s occupancy 𝒢𝑋

serves as a mixture label to drawℐ𝑖 from foreground or background aspect distributions.

The silhouette information is latent in this representation and does not require any binary segmentation decision.

5.4.3 M-step: 3D Motion Field Update

To optimize Eq. 5.5, one still needs to expand the expression of𝑄(𝒟∣𝑑𝑘_{) in Eq. 5.6. The}

distribution𝑝(ℐ,𝒢,𝒟) can be computed by marginalizing Eq. 5.3 over ¯𝒢,and simpliﬁed

similarly to Eq. 5.8: 𝑝(ℐ,𝒢,𝒟)∝𝑝(𝒟)∏ 𝑋 ( Φ(𝒢𝑋)⋅𝑝([ ¯𝒢𝑋−𝒟𝑋 =𝒢𝑋]) ) (5.10)

Taking the logarithm of Eq. 5.10 to compute𝑄(𝒟∣𝑑𝑘), and noting that theΦ(𝒢𝑋)term

does not depend on 𝒟, the M-step becomes:

𝑑𝑘+1 = argmax_𝒟 ln(𝑝(𝒟)) +∑ 𝑋 ∑ 𝒢𝑋 𝑝(𝒢𝑋∣ℐ, 𝑑𝑘)⋅ln𝑝([ ¯𝒢𝑋−𝒟𝑋 =𝒢𝑋]) (5.11)

where 𝑝(𝒢𝑋∣ℐ, 𝑑𝑘) is computed in the E-step (Eq. 5.9).

To model 3D motion ﬁeld continuity, 𝑝(𝒟) is chosen to be a Markov Random Field. The M-step in the EM framework becomes a standard ﬁrst-order MRF MAP problem. From time 𝑡 to 𝑡 + 1, if the displacement possibilities at every point is quantized to

𝑛 displacement options denoted as a label set ℒ = {𝑙1,⋅ ⋅ ⋅, 𝑙𝑛}, then Eq. 5.11 can be represented as standard graph optimization pairwise and unary energy terms:

𝐸𝑀𝑅𝐹 = ∑ 𝑋 ∑ 𝑌∈𝒩(𝑋) 𝐸𝑋𝑌(𝑙𝑋, 𝑙𝑌) + ∑ 𝑋 𝐸𝑋(𝑙𝑋), (5.12)

where 𝒩(𝑋) is the neighborhood system of point 𝑋in the 3D graph. In Eq. 5.12, ∑

𝑋

∑

𝑌∈𝒩(𝑋)𝐸𝑋𝑌(𝑙𝑋, 𝑙𝑌) are the binary terms, and

∑

They correspond to the negative of ln(𝑝(𝒟)) and ∑ 𝑋 ∑ 𝒢𝑋𝑝(𝒢𝑋∣ℐ, 𝑑 𝑘₎_⋅_ln_𝑝_{([ ¯}_𝒢 𝑋−𝒟𝑋 = 𝒢𝑋]) in Eq. 5.11 respectively.

The M-step in this EM framework thus becomes a discrete multi-labeling problem, with the go∑

𝑋

∑

𝑌∈𝒩(𝑋)𝐸𝑋𝑌(𝑙𝑋, 𝑙𝑌) are the binary terms, and

∑

𝑋𝐸𝑋(𝑙𝑋) are the

unary terms. They correspond to the negative of ln(𝑝(𝒟)) and ∑

𝑋

∑

𝒢𝑋 𝑝(𝒢𝑋∣ℐ, 𝑑

𝑘₎_⋅

ln𝑝([ ¯𝒢𝑋−𝒟𝑋 =𝒢𝑋]) in Eq. 5.11 respectively.

The M-step in our EM framework becomes a discrete multi-labeling problem, with the goal of computing a labeling 𝐿∈ ℒ∣𝒳 ∣, which assigns each grid node 𝑋 ∈ 𝒳 a label fromℒsuch that the energy 𝐸𝑀𝑅𝐹 is minimized. Thus

𝑑𝑘+1 = argmin_𝐿𝐸𝑀𝑅𝐹. (5.13)

The solution to this MRF thus provides the updated displacement ﬁeld in the EM iteration. Further details on MRF implementation is given in the sections below.

In document Multi-view dynamic scene modeling (Page 107-112)