Performance evaluation of multi-camera visual tracking

(1)

Performance evaluation of multi-camera visual

tracking

Lucio Marcenaro, Pietro Morerio, Mauricio Soto, Andrea Zunino, Carlo S. Regazzoni

DITEN, University of Genova

Via Opera Pia 11A 16145 Genoa - Italy Emails: {pmorerio}@ginevra.dibe.unige.it {lucio.marcenaro,carlo.regazzoni}@unige.it

Abstract—Main drawbacks in single-camera multi-target vi-sual tracking can be partially removed by increasing the amount of information gathered on the scene, i.e. by adding cameras. By adopting such a multi-camera approach, multiple sensors cooperate for overall scene understanding. However, new issues arise such as data association and data fusion. This work addresses the issue of evaluating the performance of a multi-camera tracking algorithm based on Rao-Blackwellized Monte Carlo data association (RBMCDA) on real data. For this purpose, a new metric based on three performance indexes is developed.

I. INTRODUCTION

Video-based tracking techniques are widely used in different fields related to ambient intelligence and scene understanding. Video surveillance is only one of the possible applications ranging from security to man-machine interaction; elderly people monitoring for sanitary assistance or rehabilitation is only one of the interesting fields where video-sensors and automatic processing algorithms can be used efficiently. Moreover, technological improvements in imaging hardware and related price cuts, enabled na even more pervasive usage of such sensors: IP based video cameras allowed easy hardware installation and remote monitoring and configuration while mega-pixel sensors and efficient video compression algorithms greatly improved quality and spatio-temporal resolution of acquired images.

In this scenario, visual tracking is one of the principal techniques that are used for extracting high-level descriptors of the scene. The main purpose of tracking algorithms is to estimate moving objects’ trajectories by correctly assigning them a specific label that must be propagated over-time as the object moves on the scene. Each tracking technique starts from low-level data acquired directly from images, thus can be modelled as a state estimate problem starting from noisy observations. In this sense, tracking can be seen as a filtering process for cancelling observations noise and extracting the actual movement of each object in the scene. Several high-level applications can be developed based of trajectories and detected movements, such as people counting, traffic monitor-ing, abandoned objects detection, automotive safety, etc..

The complexity of visual-tracking is very high and main problems that must be solved are related with unpredictable object movements, targets or scene appearance variations, non-rigid parts, occlusions (intra-, inter- or environmental) and

sensor undesired movements. A multi-camera approach can be adopted in order to deal with these issues, although data fusion and association problems must be considered when multiple sensors have to cooperate for overall scene understanding.

The work is organized as follows: Section II reviews the probabilistic approach to visual tracking and introduces the reader to particle filtering, Rao-blackwellized particle filtering and Rao-blackwellized data association (RBMCDA). Section III describes the experimental set-up and model used for per-formance evaluation. Section IV introduces the novel metric and performance indexes. In section V performance evaluation results are given. Conclusion are drawn in section VI. Future developments of the work are also proposed.

II. PROBABILISTIC APPROACH

Visual tracking problems can be mathematically modelled through probabilistic reasoning. Target positions can be stored in a state vector containing its descriptive features: target state must be estimated starting from noisy observations obtained from acquired images. Stochastic state and observation models are needed for this kind of approach. Bayesian theory can be used to estimate the state’s a posteriori probability density function (pdf ): after the pdf is estimated, it can be used for evaluating the new state and for measuring the state estimate precision. The hidden state vector to be estimated can be expressed as xt∈ Rnxwhere nxis the state vector dimension

and t ∈ N is associated to the considered temporal instant. The state vector xtcan contain object position, speed, acceleration

or shape related features (e.g., corners). Available observation vector z1:t = {z1· · · zt} at time instant t will be used for a

posteriori state estimate.

An on-line solution is needed for object tracking as one must be able to estimate the new state of each tracked object as new observations become available. The a posteriori state probability density function p(xt|z1:t) is computed recursively

through a two-steps prediction-updating procedure. If the previous step posterior pdf p(xt−1|z1:t−1) is known, the pdf

of the following state can be predicted by means of the state transition model p(xt|xt−1), given by:

p(xt|z1:t−1) =

Z

(2)

Moreover, by applying the Bayes theorem, it is possible to obtain the a posteriori state pdf p(xt|zt) through the following

update equation that shows its dependency by observation probability p(zt|zt−1) and the likelihood probability coming

from the observation model p(zt|xt)

p(xt|z1:t) =

p(zt|xt)p(xt|z1:t−1)

p(zt|z1:t−1)

(2) This formulation assumes that xt is a state of a Markov

process with initial distribution p(x0). Equations (1) and (2)

have an analytical solution only in the case of linear-Gaussian models. The Kalman filter provides the optimal solution. In the case of non-linear or non-Gaussian models, a numerical approximation, like the one provided by applying a Particle Filter (PF), must be made [1].

A. Particle filtering

The Particle Filter approximates the solution of the Bayesian filtering problem by representing the a posteriori pdf as a finite set of samples (particles) χt = {xmt , wmt }

Ns

m=1 that

are weighted by using the importance sampling technique. According to this algorithm, a set of Ns samples {˜xmt }

Ns

m=1

are created starting from a reference distribution q(x0:t−1|z1:t)

instead of the a posteriori pdf p(x0:t−1|z1:t) that, in general,

cannot be expressed mathematically. The difference between the reference pdf and the a posteriori one is managed through a correction procedure, by using different weights. Under independence hypotheses, the reference distribution can be expressed as q(xt|xt−1, zt) and the weights can be computed

by using the Sequential Importance Sampling (SIS) technique as

w(m)_t ∝ w(m)_t−1p(zt|xt)p(xt|xt−1) q(xt|xt−1, zt)

(3) According to the SIS technique, Nssamples xmt } can be

gen-erated starting from the importance distribution q(xt|xt−1, zt)

and correspondent weights w(m)t can be computed by using

Eq. (3). After a normalization step, the a posteriori pdf at time t can be approximated with a sum of Nsweighted Dirac

deltas p(xt|z1:t) ≈ X m=1:Ns w_t(m)δ(xt− x (m) t ). (4)

It has been demonstrated that this algorithm leads to a unstable behaviour due to the continuous increase of the weighting coefficients variance. This can be solved by using re-sampling procedures that try to redistribute particles in such a way that softly weighted ones are cancelled while those associated to higher weights are duplicated. The efficiency of particle filter strongly depends on the appropriateness of the importance distribution q(xt|xt−1, zt) that must be close to

the a posteriori pdf. Within visual tracking, the definition of a reliable importance distribution is a complex task because of the high variability of movements and deformations that can be associated to a monitored target.

Typically, several features are extracted and tracked from a single target: this problem can be addressed as a multiple

target tracking problem for each object. This is a challenging problem because of the uncertainty of the measurement origin (i.e., the centre of the object) and the cluttered and missed detections due to inaccuracies of the feature extractor and/or possible (self)occlusions. It can be demonstrated that data association uncertainty in tracking leads to a NP-hard problem, i.e. the association hypotheses grow exponentially with time as in the Multiple Hypotheses Tracker (MHT)[2]. As a result, approximations are required in order to make the problem feasible for a reasonable number of targets.

In the past few years, great focus has been given to sequential Monte Carlo (SMC) strategies under the Bayesian framework because they apply a semi-soft-decision data as-sociation approach, i.e. hard-decision asas-sociations per particle while keeping multiple parallel hypotheses associations with all particles. Different strategies avoiding explicit enumeration of all possible hypotheses, thus making this approach to scale well with an increasing number of targets, are described in [3].

B. Rao-blackwellized particle filter

The algorithm that will be considered hereinafter is an ex-tended version of the Rao-Blackwellized Monte Carlo data as-sociation(RBMCDA) [4], which is basically an extended Rao-Blackwellized particle filtering based multiple target tracking algorithm [5]. The algorithm models as stochastic processes not only the state evolution of the target, but also its creation and suppression, i.e. the starting and ending time instant estimations for target tracking are not based on heuristics, but are accounted for in the probabilistic model.

The basic idea behind the Rao-Blackwell algorithm is that, due to the conditioning of the data association and the targets creation or suppression, the a posteriori probability distribution of target states can be approximated by means of Gaussian pdfs (marginalization of the filtering or Rao-Blackwellization). Suppose the state x can be split into two substates r and s such that

p(xt|xt−1) = p(st|rt−1, st−1)p(rt|rt−1), (5)

and p(s0:t|z1:t, r0:t) is analytically tractable. Decompose the

posterior as

p(r0:t, s0:t|z1:t) = p(s0:t|z1:t, r0:t)p(r0:t|z1:t). (6)

The term p(s0:t|z1:t, r0:t) can be analytically estimated,

while p(r0:t|z1:t) can be derived by means of the particle filter

as

p(r0:t|z1:t) =

p(zt|z1:t−1, r0:t)p(rt|zt−1)p(r0:t−1|z1:t−1)

p(zt|z1:t−1)

(7) This way, target states can be computed, while SMC particle filtering only need to be applied to the data associations and creation and deletion processes. This greatly reduces computational complexity, thus increasing the efficiency of the particle filter. Moreover, the Rao-Blackwell theorem states that the mean squared error of the RaoBlackwell estimator

(3)

does not exceed that of the original estimator (i.e. of the non-marginalized particle filter).

In this framework, a data association event can be modelled through a data association indicator ct, which takes Tt+ 1

values: ct= 0 for clutter and ct= j, j = 1, ..., Tt where Tt

is the total number of targets at time instant t

A priori probabilities for data association can be modelled as a m-order Markov random chain with uniform priors over targets and clutter:

p(ct|ct−1, ..., ct−m, Tt−1, ..., Tt−m) =

1 1 + Tt−1

. (8) Clutter measurement probabilitiescan be uniformly distributed in a measurement space of volume V , namely

p(zt|ct= 0) = 1/V. (9)

Target measurements and target dynamics are basically mod-elled as in a Kalman Filter (linear dynamic model and Gaus-sian noise):

p(zt|xt,j, ct= j) = N (zt|Ht,jxt,jRt,j), (10)

p(xt,j|xt−1,j) = N (xt,j|At−1,jxt−1,jQt−1,j), (11)

where the measurement, covariance, transition and process noise covariance matrices Ht,j, Rt,j, At,j and Qt,j can be

different for each target. Extensions to non-linear models is straightforward (EKF, UKF)

Target births only occur jointly with an association event: no association to a newborn target, no birth. Equivalently, the state prior is constant until a first measurement is associated, i.e. the target state does not follow the dynamic model before that moment. This very time instant is than identified as target birth.

After an association is made, the lifetime of the target is given a probability density p(td). A Gamma distribution

Γ(α, β) has been chosen for modelling such a probability. Increasing values of α shift the distribution to the right, thus resulting in lower probabilities for small lifetimes, while increasing values of β somehow flatten the distribution. Ba-sically, by modelling deaths, targets which have not been assigned a measure for a certain amount of time are removed. If the last association with target j is made at time τ and at the previous time step t−1 the sampled hypothesis correspond to an alive target, then the probability of having a dead target at current time step t is

p(deathj) = P (td∈ [t − 1 − τ, t − τ ]|td≥ t − 1 − τ ). (12)

III. SETUP ANDMODEL

RBMCDA has been tested through a Matlab toolbox [6], which implements all the algorithm’s functionalities and even-tually provides simulations for testing. Tests have instead been run on real sequences. The dataset that has been considered can be found in [7], together with the ground-truth. It consists in four video sequences corresponding to four different views (i.e. different cameras) of a scene. Six people randomly access

Fig. 1. Four different views of the scene

Fig. 2. Ground projections of tracked objects (∆) and ground-truth (∗)

a 5,5 m x 5.5 m area in a room and walk around quite as much randomly for about 150 seconds at 25 fps. Together with the ground-truth data, homography parameters (coded in 3x3 matrices) for each cameras are supplied, to allow ground plane projections (or, equivalently, to allow a top-view shot). A screenshot of the four video framing is show in Fig. 1, while projections of the tracked objects on the world ground plane are given in Fig 2. Tracked objects are marked with different (∆) symbols (one colour for each camera view), while the ground-truth with a (∗).

The model that has been employed is now briefly sketched: the considered state vector is

xt=x y x˙ y˙ T

(4)

where x, y, ˙x and ˙y are the horizontal and vertical coordinates and their time variations respectively. The measurement vector is simply given by averaged position measurements

zt=x y T

. (14)

The transition matrix for the state dynamic evolution is

At=      1 0 ∆t 0 0 1 0 ∆t 0 0 1 0 0 0 0 1      , (15)

where ∆t is simply 1 as only discrete unitary time steps are here considered. The observation matrix is

Ht= " 1 0 0 0 0 1 0 0 # . (16)

The noise Rtis taken to be zero-mean Gaussian, with zero

off-diagonal correlation terms, while the measurement covariance matrix is of the form

Q_t=      0 0 0 0 0 0 0 0 0 0 σ 0 0 0 0 σ      (17)

where sigma is an appropriate constant. IV. PERFORMANCE EVALUATION

To evaluate the algorithm performances, global traditional metrics, which exploit the information carried by all the particles, could not be employed. The key point is that this leads to an esteem that is corrupted by misassociations between particles and targets, thus being quite meaningless: a particle can in be associated to different targets. Even by considering the particle with higher weight, the association problem persists and its complexity grows exponentially with the number of targets.

A novel metric has been therefore designed based on three performance indices which are discussed in the following, namely

• a fragmentation index IF; • a tracking index IT; • a visibility index IV.

a) Fragmentation index: This index focuses on trajectory fragmentation and tries to measure its amount. It consists in the ratio of the difference between the actual number of trajectories Tgtand the number of trajectories T generated by

the algorithm

If:=

T − Tgt

Tgt

. (18)

Such an index approaches the optimal value of zero as the number of generated trajectories approaches the number of the real ones; it takes positive values when more trajectories are generated, i.e., roughly speaking, as fragmentation takes place. Negative values of If basically result from losing a

target without retrieving it.

b) Tracking index: By means of this index, the number of frames in which target position estimations are effectively related to real data is accounted for (independently of tra-jectories fragmentation). Equivalently, the index measures the amount of time in which position estimations are supported by ground-truth data, as the fps rate is kept constant. Given the set of all the measured trajectories γi: (∆t)i → R2, let S

be the sum of all the supports of γi, S =P_i(∆t)iand Sgtbe

the same quantity computed for the ground-truth trajectories. The index is defined as

If :=

Sgt− S

Sgt

. (19)

It approaches zero (optimal value) as fast as the generated trajectories can “hook up” the when a birth occurs. In fact the former never overstay the latter, as in the case that has been considered real trajectories persist up to the end of the sequence.

c) Visibility index: This performance index measures the difference between the actual number of people on the scene (Ngt) and those estimated by the algorithm (N ), averaged over

all the processed frames Nf

Iv:= 1 Nf Nf X i=1 (Ngt− N )_i (20)

This index has an optimal value of zero, as the previous. However, its optimization turns out to be extremely hard, as it is strongly affected by frequent and unpredictable occlusions in the scene

V. RESULTS

Performances have been evaluated by tuning some of the parameters introduced in section II-B, namely Pb, which

models the birth probability density, and the two parameters α and β, which shape the Gamma function that models the lifetime probability of each target.

Two 500-frame scenes with a different intrinsic complexity have been studied.The will be referred to as Scene 1 and 2 (simpler and more complex scene respectively).

The three parameters have been varied in turn, while keep-ing the two other fixed.

Table I summarizes some results for Scene 1. The ranges of α and β and Pb have been explored systematically to

localize local maxima for the tree performance indexes. Low values of α result in short lifetimes, and in fact in trajectories fragmentation, as one can infer from the positive values of IF

(Table I(a)). On the other hand, the values of IT and IV do

not show monotonicity; instead there seems to be a minima at α = 4. By varying β (Table I(b)), the fragmentation IF

remains constant at the optimal null value, while the other two indexes’ behaviour is quite fuzzy. β = 1 seems to give the best fit. Eventually (Table I(c)), it can be noticed that better performances are obtained with the higher considered value of Pb, i.e. Pb= 0.15. IV and IT improve as Pb rises: this is

basically due to the fact that there is less delay in hooking the target as the birth probability increases. Even higher values of

(5)

(a) Varying α, β = 1, Pb= 0.01 α IF IT IV 2 0.5 0.1496 -0.2020 3 0.5 0.1585 -0.2140 4 0 0.1274 -0.1720 5 0 0.1481 -0.2000 6 0 0.1481 -0.2000 (b) Varying β, α = 4, Pb= 0.01 β IF IT IV 0.5 0 0.1541 -0.2080 1 0 0.1274 -0.1720 2 0 0.1600 -0.2160 3 0 0.1481 -0.2000 4 0 0.1422 -0.1920 (c) Varying Pb, α = 4, β = 1 Pb IF IT IV 0.05 0 0.1437 -0.1940 0.10 0.5 0.1215 -0.1640 0.15 0 0.1111 -0.1500 TABLE I Scene 1

Pb result in positive values of IF (as for Pb = 0.1), as the

algorithm generates too many targets.

Table II summarizes some results for Scene 1. Here scene is complicated by more targets, resulting in more occlusions and misdetections. The optimal value for α has been found for α = 8 (Table II(a)). It can be noticed that the value of the three indexes show here a kind of local minimum: in particular, the algorithm estimates more trajectories than the real ones for α < 7 (IF > 0) and less for α > 8 (IF < 0). Table II(b)

shows how β = 5 represents some kind of asymptote turning point for the values of all the indexes. in particular IV turns

positive, meaning that too delayed target deaths result in too many people detected on the scene. From the analysis of Table II(c), it look like that than algorithm cannot avoid fragmenting trajectories in a more complex scene.

For Scene 2 a deeper analysis was performed by jointly varying both the indexes α and β, while keeping Pb fixed

(Figure 3). The vertical coordinates of the plot is z = |IF| +

|IT| + |IV|. It can be noticed that z clearly decreases as α and

β increase. This is coherent with the above mentioned fact that real trajectories in the scene last until the end of the sequence, i.e. no target exits the scene before the very end. As a matter of fact, high values of α and β allow for long lifetimes.

By exploiting the best values of α and β just recovered, the average distance error between real and estimated trajectories has been computed. Its dependence on the number of particles has been also evaluated. Results are given in Table III. It can be noticed that the lowest average error corresponds to n = 10 particles. However, as one can infer from the (negative) value of IF given in the table, this is due to the fact

that less trajectories have been generated. Being the number of trajectories equal, better performance is reached as the number of particles increases. This is perfectly coherent with the theory, which states performance improvement on particle

(a) Varying α, β = 4, Pb= 0.1 α IF IT IV 6 0.2 0.1109 -0.2700 7 0.2 0.0841 -0.1640 8 0 0.0689 -0.1040 9 -0.2 0.0668 -0.0960 10 -0.2 0.0967 -0.2140 (b) Varying β, α = 8, Pb= 0.1 β IF IT IV 1 0 0.0689 -0.1040 2 -0.2 0.0765 -0.1340 4 -0.2 0.0618 -0.0760 5 0 0.0263 -0.0640 6 0 0.0263 -0.0640 (c) Varying Pb, α = 8, β = 5 Pb IF IT IV 0.05 0.2 0.0613 -0.0740 0.10 0.2 0.0466 -0.0160 0.15 0.2 0.0476 -0.0200 TABLE II Scene 2 Fig. 3. Pb= 0.01

increment, especially in noisy conditions.

Number of particles Average error (pixels) IF

10 9.50 -0.2

100 11.16 0

600 10.33 0

TABLE III

AVERAGE ERROR BETWEEN ESTIMATED AND REAL TRAJECTORIES

(GROUND-PIXELS)

VI. CONCLUSIONS

A set-up for RBMCDA testing has been presented. A novel metric based on three indexes has been exploited in order to evaluate the algorithm performance, both qualitatively and quantitatively. Each index accounts for a typical problem in

(6)

visual tracking. Performance has been evaluated as α, β and Pb, which model target birth and death probabilities, vary.

Future developments of this work surely include compara-tive performance analysis with other trackers. Including simple particle filter (non-marginalized filter) evaluation could give evidence to the Rao-Blackwell theorem.

REFERENCES

[1] A. Doucet, S. Godsill, and C. Andrieu, “On Sequential Monte Carlo Sampling Methods for Bayesian Filtering,” Statistics and Computing, vol. 10, pp. 197–208, 2000. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.12.5875 [2] D. B. Reid, “An algorithm for tracking multiple targets,” IEEE

Transac-tions on Automatic Control, vol. 24, pp. 843–854, 1979.

[3] J. Vermaak, S. J. Godsill, and P. P´erez, “Monte carlo filtering for multi-target tracking and data association,” IEEE Transactions on Aerospace and Electronic Systems, vol. 41, pp. 309–332, 2005.

[4] S. S¨arkk¨a, A. Vehtari, and J. Lampinen, “Rao-blackwellized particle filter for multiple target tracking,” Information Fusion Journal, vol. 8, p. 2007, 2005.

[5] A. Doucet, N. De Freitas, K. Murphy, and S. Russell, “Rao-Blackwellised particle filtering for dynamic Bayesian networks,” in Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, 2000, pp. 176–183.

[6] LCE Models and Methods: RBMCDA Toolbox for Matlab TKK -http://www.lce.hut.fi/research/mm/rbmcda/.