The Advantages of Using a Fixed Stereo Vision sensor

(1)

Real-Time People Localization and Tracking through Fixed Stereo Vision

S. Bahadori¹, L. Iocchi¹², G.R. Leone¹, D. Nardi¹, and L. Scozzafava¹

1 Dipartimento di Informatica e Sistemistica University of Rome “La Sapienza”, Rome, Italy

E-mail {lastname}@dis.uniroma1.it

2 Artificial Intelligence Center SRI International, Menlo Park, CA, USA

Abstract. Detecting, locating, and tracking people in a dynamic environment is important in many applications, ranging from security and environmental surveillance to assistance to people in domestic environments, to the analysis of human activities. To this end, several methods for tracking people have been developed using monocular cameras, stereo sensors, and radio frequency tags.

In this paper we describe a real-time People Localization and Tracking (PLT) System, based on a calibrated fixed stereo vision sensor. The system analyzes three interconnected representations of the stereo data (the left intensity image, the disparity image, and the 3-D world locations of measured points) to dynamically update a model of the background; ex- tract foreground objects, such as people and rearranged furniture; track their positions in the world.

The system can detect and track people moving in an area approximately 3 x 8 meters in front of the sensor with high reliability and good precision.

1 Introduction

Localization and tracking of people in a dynamic environment is a key building block for many applications, including surveillance, monitoring, and elderly assistance. The fundamental capability for a people tracking system is to determine the trajectory of each person within the environment.

In recent years this problem has been primarily studied by using two different kinds of sensors: i) markers placed on the person to transmit their real world position to a receiver in the environment; ii) video cameras. The first approach provides high reliability, but is limited by the fact that it requires markers to be placed on the people being tracked, which is not feasible in many applications.

There are several difficulties to be faced in developing a vision-based people tracking system: first of all, people tracking is difficult even in moderately crowded environments, because of occlusions and people walking close each other or to the sensor; second, people recognition is difficult and cannot easily be integrated in the tracking system; third, people may leave the field of view of the sensor and re-enter it after some time (or they may enter the field of view

Proc. of International Conference on Industrial & Engineering Applications of Artificial Intelligence & Expert Systems (IEA/AIE), 2005

(2)

of another sensor) and applications may require the ability of recognizing (or re-acquiring) a person previously tracked (or tracked by another sensor in the network of sensors).

Several approaches have been developed for tracking people in different applications. At the top level, these approaches can be grouped into classes on the basis of the sensors used: a single camera (e.g. [17, 18]); stereo cameras (e.g. [4, 6, 2, 3]); or multiple calibrated cameras (e.g. [5, 13]).

Although it is possible to determine the 3-D world positions of tracked objects with a single camera (e.g. [18]), a stereo sensor provides two critical advantages:

1) it makes it easier to segment an image into objects (e.g., distinguishing people from their shadows); 2) it produces more accurate location information for the tracked people.

On the other hand, approaches using several cameras viewing a scene from significantly different viewpoints are able to deal better with occlusions than a single stereo sensor can, because they view the scene from many directions.

However, such systems are difficult to set up (for example, establishing their geo- metric relationships or solving synchronization problems), and the scalability to large environments is limited, since they may require a large number of cameras.

This paper describes the implementation of a People Localization and Track- ing (PLT) System, using a calibrated fixed stereo vision sensor.

The novel features of our system can be summarized as follows: 1) the background model is a composition of intensity, disparity and edge information; and is adaptively updated with a learning factor that varies over time and is different for each pixel; 2) plan-view projection computes height maps, which are used to detect people in the environment and refine foreground segmentation in case of partial occlusions; 3) plan-view positions and temporal color-based appearance models are integrated in the tracker and an optimization problem is solved in order to determine the best matching between the observations and the current status of the tracker.

2 System Architecture

The architecture of the PLT System, shown in Figure 1, is based on the following components:

– Stereo Computation, which computes disparities from the stereo images ac- quired by the camera.

– Background Modelling, which maintains an updated model of the background, composed of intensities, disparities, and edges (see Section 3).

– Foreground Segmentation, which extracts foreground pixels and image blobs from the current image, by a type of background subtraction that combines intensity and disparity information (see Section 3).

– Plan View Projection, which projects foreground points into a real world (3- D) coordinate system and computes world blobs identifying moving objects in the environment (see Section 4).

(3)

Fig. 1. PLT System Architecture

– People Modelling, which creates and maintain appearance models of the people being tracked (see Section 5).

– Tracker, which maintains a set of tracked objects, associating them with world blobs by using an integrated representation of people location and appearance and a Kalman Filter for updating the status of the tracker (see Section 6).

The stereo vision system is composed of a pair of synchronized fire-wire cameras and the Small Vision System (SVS) software [10], which provides a real- time implementation of a correlation-based stereo algorithm. We assume that the stereo camera has been “calibrated” in three ways: correcting lens distortion (done by the SVS software), computing the left-right stereo geometry (also done by the SVS software) and estimating the sensor’s position and orientation in the 3-D world (done by standard external calibration methods). Given this calibration information, the system can compute several important things, such as the location of the ground plane and the 3-D locations of all stereo measurements.

For best results in the localization and tracking process, we have chosen to place the camera high on the ceiling pointing down with an angle of approximately 30 degrees with respect to the horizon. This choice provides for a nice combination of tracking and person modelling.

In the following sections we describe in further details the components of our system, except for the Stereo Computation module, whose description can be found in [10].

3 Background modelling and Foreground segmentation

When using a static camera for object detection and tracking, maintaining a model of the background and consequent background subtraction is a common technique for image segmentation and for identifying foreground objects. In order

(4)

to account for variations in illuminations, shadows, reflections, and background changes, it is useful to integrate information about intensity and range and to dynamically update the background model. Moreover, such an update must be different in the parts of the image where there are moving objects [17, 16, 8].

In our work, we maintain a background model including information of intensity, disparity, and edges, as a Gaussian probability distribution. Although more sophisticated representations can be used (e.g. mixture of Gaussians [16, 8]), we decided to use a simple model for efficiency reasons. We also decided not to use color information in the model, since intensity and range usually provide a good segmentation, while reducing computational time.

The model of the background is represented at every time t and for every pixel i by a vector Xt,i, including information about intensity, disparity, and edges computed with a Sobel operator. In order to take into account the uncertainty in these measures, we use a Gaussian distribution over X_t,i, denoted by mean µ_X_t,iand variance σ_X²

t,i. Moreover, we assume the values for intensity, disparity, and edges to be independent each other.

This model is dynamically updated at every cycle (i.e., for each new stereo image every 100 ms) and is controlled by a learning factor αt,i that changes over time t and is different for every pixel i.

µXt,i = (1 − αt,i) µXt−1,i+ αt,iXt,i

σ²_X_t,i = (1 − αt,i) σ_X²_t−1,i+ αt,i(Xt,i− µX_t−1,i)²

The learning factor αt,i is set to a higher value (e.g. 0.25) for all pixels in the first few frames (e.g. 5 seconds) after the application is started, in order to quickly acquire a model of the background. In this phase we assume the scene contains only background objects. Notice that the training phase can be completely removed and the system is able to build a background model even in presence of foreground moving objects since the beginning of the application run. Of course it will require a longer time to stabilize the model.

After this training phase αt,i is set to a lower nominal value (e.g. 0.10) and modified depending on the status of pixel i. In regions of the image where there are no moving objects, the learning factor αt,i is increased (e.g. 0.15) speeding up model updating. While in the regions of the image where there are moving objects this factor is decreased (or set to zero) In this way we are able to quickly update the background model in those parts of the image that contain stationary objects and avoid including people (and, in general, moving objects) in the background. The numerical values used for αt,i depend on the characteristics of the application and can be used to tune the reactivity of the system in background model update.

In order to determine regions of the images in which background should not be updated, the work in [8] proposes to compute activities of pixels based on intensity difference with respect to the previous frame. In our work, instead, we have computed activities of pixels as their difference between the edges in the current image and the background edge model. The motivation behind this choice is that people produce variations in their edges over time even if they are standing

(5)

still (due to breathing, small variations of pose, etc.), while static objects, such as chairs and tables, do not. However, note that edge variations correctly determine only the contour of a person or moving foreground object, and not all the pixels inside this contour; therefore, if we consider as active only those pixels that have high edge variation, we may not be able to correctly identify the internal pixels of a person. For example, if a person with uniform color clothes is standing still in a scene, there is high probability that the internal pixels of his/her body have constant intensity over time, and a method for background update based only on intensity differences (e.g., [8]) will eventually integrate these internal pixels into the background.

To overcome this problem we have implemented a procedure that computes activities of pixels included in a contour with high edge variation. This computation is based on first determining horizontal and vertical activities H_t(v) and Vt(u), as the sum over the pixels (u, v) in the image, of the variation between current edge E and edge component of the background model µE, for each row/column of the image.

H_t(v) =X

u

|Et,(u,v)− µE,t,(u,v)| V_t(u) =X

v

|Et,(u,v)− µE,t,(u,v)| Then, these values are combined in order to assign higher activity values to those pixels that belong to both a column and a row with high horizontal and vertical activity:

A_t(u, v) = (1 − λ) A_t−1(u, v) + λ H_t(v)V_t(u)

In this way, the pixels inside a contour determined by edge variations will be assigned a high activity level. Note also that, since the term Ht(v)Vt(u) takes into account internal pixels for people with uniformly colored clothes, the learning factor λ can be set to a high value to quickly respond to changes. In our implementation the learning factor λ used for updating activities is set to 0.20.

The value At(u, v) is then used for determining the learning factor of the background model: the higher the activity A_t(u, v) at each pixel i = (u, v) the lower the learning factor α_t,i. More specifically, we set α_t,(u,v) = α_NOM(1 − ηAt(u, v)), where η is a normalizing factor.

Foreground segmentation is then performed by background subtraction from the current intensity and disparity images. By taking into account both intensity and disparity information, we are able to correctly deal with shadows, detected as intensity changes, but not disparity changes, and foreground objects that have the same color as the background, but different disparities. Therefore, by combining intensity and disparity information in this way, we are able to avoid false positives due to shadows, and false negatives due to similar colors, which typically affect systems based only on intensity background modeling.

The final steps of the foreground segmentation module are to compute connected components (i.e. image blobs) and characterize the foreground objects in the image space. These objects are then passed to the Plan View Segmentation module.

(6)

4 Plan View Segmentation

In many applications it is important to know the 3-D world locations of the tracked objects. We do this by employing a plan view [3]. This representation also makes it easier to detect partial occlusions between people.

Our approach projects all foreground points into the plan view reference system, by using the stereo calibration information to map disparities into the sensor’s 3-D coordinate system and then the external calibration information to map these points from the sensor’s 3-D coordinate system to the world’s 3-D coordinate system.

For plan view segmentation, we compute a height map, that is a discrete map relative to the ground plane in the scene, where each cell of the height map is filled with the maximum height of all the 3-D points whose projection lies in that cell, in such a way that higher objects (e.g., people) will have a high score.

The height map is smoothed with a Gaussian filter to remove noise, and then it is searched to detect connected components that we call world blobs (see Fig. 2b where darker points correspond to higher values). Since we are interested in person detection, world blobs are filtered on the basis of their size in the plan view and their height, thus removing blobs with sizes and heights inconsistent with people. The Plan View Segmentation returns a set of world blobs that could be people moving in the scene.

It is important to notice that Plan View Segmentation is able to correctly deal with partial occlusions that are not detected by foreground analysis. For example, in Figure 2 a situation is shown in which a single image blob (Fig. 2a) covers two people, one of which is partially occluded, while the Plan View Segmentation process detects two world blobs (Fig. 2b). By considering the association between pixels in the image blobs and world blobs, we are able to determine image masks corresponding to each person, which we call person blobs. This process allows for refining foreground segmentation in situations of partial occlusions and for correctly building person appearance models.

5 People Modelling

In order to track people over time in the presence of occlusions, or when they leave and re-enter the scene, it is necessary to have a model of the tracked people.

Several models for people tracking have been developed (see for example [7, 15, 12, 9]), but color histograms and color templates (as presented in [15]) are not sufficient for capturing complete appearance models, because they do not take into account the actual position of the colors on the person.

Following [7, 12], we have defined temporal color-based appearance models of a fixed resolution, represented as a set of unimodal probability distributions in the RGB space (i.e. 3-D Gaussians), for each pixel of the model. Computation of such models is performed by first scaling the portion of the image characterized by a person blob to a fixed resolution and then updating the probability distribution for each pixel in the model. Appearance models computed at this stage are used during tracking for improving reliability of data association process.

(7)

Fig. 2. a) Foreground segmentation (1 image blob); b) Plan View Projection (2 world blobs); c) Plan View Segmentation 2 person blobs.

6 Tracking

Tracking is performed by maintaining a set of tracked people, updated with the measurements of person and world blobs (extracted by the previous phases). We use a probabilistic framework in which tracked people P_t = {N (µ_i,t, σ_i,t) | i = 1..n} and measurements Zt= {N (µ⁰_j,t, σ⁰_j,t) | j = 1..m} are represented as multi- dimensional Gaussians including information about both the person position in the environment and the color-based person model. The update step is performed by using a Kalman Filter for each person. The system model used for predicting the people position is the constant velocity model, while their appearance is updated with a constant model. This model is adequate for many normal situations in which people walk in an environment. It provides a clean way to smooth the trajectories and to hold onto a person that is partially occluded for a few frames.

With this representation data association is an important issue to deal with.

In general, at every step, the tracker must make an association between m observations and n tracked people. Association is solved by computing the Maha- lanobis distance di,jbetween the predicted estimate (through the Kalman Filter) of the i^th person N (µi,t|t−1, σi,t|t−1) and the j^th observation N (µ⁰_j,t, σ⁰_j,t).

An association between the predicted state of the system P_t|t−1 and the current observations Z_tis denoted with a function f , that associates each tracked person i to an observation j, with i = 1..n, j = 1..m, and f (i) 6= f (j), ∀i 6= j.

The special value ⊥ is used for denoting that the person is not associated to any observation (i.e. f (i) = ⊥). Let F be the set of all possible associations of the current tracked people with current observations. The best data association is computed by minimizingP

id_{i,f (i)}. A fixed maximum value is used for d_{i,f (i)} when f (i) = ⊥.

(8)

Although this is a combinatorial problem, the size of the sets P_t and Z_t on which this is applied are very limited (not greater than 4), so |F | is small and this problem can be effectively solved.

The association f^∗, that is the solution of this problem, is chosen and used for computing the new status of the system P_t. During the update step a weight w_i,t is computed for each Gaussian in P_t(depending on w_i,t−1 and d_{i,f (i)}), and if such a weight goes below a given threshold, the person is considered lost.

Moreover, for observations in Zt that are not associated to any person by f^∗ a new Gaussian in entered in P_t.

The main difference with previous approaches [2, 11, 13] is that we integrate both plan-view and appearance information in the status of the system, and by solving the above optimization problem we find the best matching between observations and tracker status by considering in an integrated way the information about the position of the people in the environment and their appearance.

7 Applications and experiments

The system presented in this paper is in use within the RoboCare project [1, 14], whose goal is to build a multi-agent system that generates services for human assistance and develops support technology which can play a role in al- lowing elderly people to lead an independent lifestyle in their own homes. The RoboCare Domestic Environment (RDE), located at the Institute for Cogni- tive Science and Technology (CNR, Rome, Italy), is intended to be a testbed environment in which to test the ability of the developed technology.

In this application scenario the ability of tracking people in a domestic environment or within a health-care institution is a fundamental building block for a number of services requiring information about pose and trajectories of people (elders, human assistants) or robots acting in the environment.

In order to evaluate the system in this context we have performed a first set of experiments aiming at evaluating efficiency and precision of the system. The computational time of the entire process described in this paper is below 100 ms on a 2.4GHz CPU for high resolution images (640x480)³, thus making it possible to process a video stream at a frame rate of 10 Hz. The frame rate of 10 Hz is sufficient to effectively track walking people in a domestic environment, where velocities are typically limited.

For measuring the precision of the system we have marked 9 positions in the environment at different distances and angles from the camera and measured the distance returned by the system of a person standing on these positions.

Although this error analysis is affected by imprecise positioning of the person on the markers, the results of our experiments, averaging 40 measurements for each position, show a precision in localization (i.e. average error) of about 10 cm, with a standard deviation of about 2 cm, which is sufficient for many applications.

Furthermore, we have performed specific experiments to evaluate the integration of plan-view and appearance matching during tracking. We have compared

3 Although some processing is performed at low resolution 320x240.

(9)

two cases: the first in which only the position of the people is considered during tracking, the second in which appearance models of people are combined with their location (as described in Section 6). We have counted the number of association errors (i.e. all the situations in which either a track was associated to more than a person or a person is associated to more than a track) in these two cases. The results of our experiments have shown that the integrated approach reduces the association errors by about 50% (namely, from 39 in the tracker with plan-view position only to 17 in the one with integrated information, over a set of video clips with a total of 200,000 frames, of which about 3,500 contain two people close each other).

8 Conclusions and Future Work

In this paper we have presented a People Localization and Tracking System that integrates several capabilities into an effective and efficient implementation: dynamic background modelling, intensity and range based foreground segmentation, plan-view projection and segmentation for tracking and determining object masks, integration of plan-view and appearance information in data association and Kalman Filter tracking. The novel aspects introduced in this paper are:

1) a background modelling technique that is adaptively updated with a learning factor that varies over time and is different for each pixel; 2) a plan-view segmentation that is used to refine foreground segmentation in case of partial occlusions;

3) an integrated tracking method that considers both plan-view positions and color-based appearance models and solves an optimization problem to find the best matching between observations and the current state of the tracker.

Experimental results on efficiency and precision show good performance of the system. However, we intend to address other aspects of the system: first, using a multi-modal representation for tracking in order to better deal with uncertainty and association errors; second, evaluating the reliability of the system in medium- term re-acquisition of people leaving and re-entering a scene.

Finally, in order to expand the size of the monitored area, we are planning to use multiple tracking systems. This is a challenging problem because it em- phasizes the need to re-acquire people moving from one sensor’s field of view to another. One way of simplifying this task is to arrange an overlapping field of view for close cameras; however, this arrangement increases the number of sensors needed to cover an environment and limits the scalability of the system.

In the near future we intend to extend the system to track people with multiple sensors that do not overlap.

Acknowledgments

This research is partially supported by MIUR (Italian Ministry of Education, University and Research) under project RoboCare (A Multi-Agent System with Intelligent Fixed and Mobile Robotic Components). Luca Iocchi also acknowledges SRI International where part of this work was carried out and, in particular, Dr. Robert C. Bolles for his interesting discussions and useful suggestions.

(10)

References

1. S. Bahadori, A. Cesta, L. Iocchi, G. R. Leone, D. Nardi, F. Pecora, R. Rasconi, and L. Scozzafava. Towards ambient intelligence for the domestic care of the elderly. In P. Remagnino, G. L. Foresti, and T. Ellis, editors, Ambient Intelligance: A Novel Paradigm. Springer, 2004.

2. D. Beymer and K. Konolige. Real-time tracking of multiple people using stereo.

In Proc. of IEEE Frame Rate Workshop, 1999.

3. T. Darrell, D. Demirdjian, N. Checka, and P. F. Felzenszwalb. Plan-view trajectory estimation with dense stereo background models. In Proc. of 8th Int. Conf. On Computer Vision (ICCV’01), pages 628–635, 2001.

4. T. Darrell, G. Gordon, M. Harville, and J. Woodfill. Integrated person tracking using stereo, color, and pattern detection. International Journal of Computer Vision, 37(2):175–185, 2000.

5. D. Focken and R. Stiefelhagen. Towards vision-based 3-d people tracking in a smart room. In Proc. 4th IEEE Int. Conf. on Multimodal Interfaces (ICMI’02), 2002.

6. I. Haritaoglu, D. Harwood, and L. S. Davis. W4S: A real-time system detecting and tracking people in 2 1/2D. In Proceedings of the 5th European Conference on Computer Vision, pages 877–892. Springer-Verlag, 1998.

7. I. Haritaoglu, D. Harwood, and L. S. Davis. An appearance-based body model for multiple people tracking. In Proc. of 15th Int. Conf. on Pattern Recognition (ICPR’00), 2000.

8. M. Harville, G. Gordon, and J. Woodfill. Foreground segmentation using adaptive mixture models in color and depth. In Proc. of IEEE Workshop on Detection and Recognition of Events in Video, pages 3–11, 2001.

9. J. Kang, I. Cohen, and G. Medioni. Object reacquisition using invariant appearance model. In Proc. of 17th Int. Conf. on Pattern Recognition (ICPR’04), 2004.

10. K. Konolige. Small vision systems: Hardware and implementation. In Proc. of 8th International Symposium on Robotics Research, 1997.

11. J. Krumm, S. Harris, B. Meyers, B. Brumitt, M. Hale, and S. Shafer. Multi- camera multi-person tracking for easyliving. In Proc. of Int. Workshop on Visual Surveillance, 2000.

12. J. Li, C. S. Chua, and Y. K. Ho. Color based multiple people tracking. In Proc.

of 7th Int. Conf. on Control, Automation, Robotics and Vision, 2002.

13. A. Mittal and L. S. Davis. M2Tracker: A multi-view approach to segmenting and tracking people in a cluttered scene using region-based stereo. In Proc. of the 7th European Conf. on Computer Vision (ECCV’02), pages 18–36. Springer-Verlag, 2002.

14. Robocare project. http://robocare.istc.cnr.it.

15. K. Roh, S. Kang, and S. W. Lee. Multiple people tracking using an appearance model based on temporal color. In Proc. of 15th Int. Conf. on Pattern Recognition (ICPR’00), 2000.

16. C. Stauffer and W. E. L. Grimson. Adaptive background mixture models for real-time tracking. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’99), pages 246–252, 1999.

17. Christopher Richard Wren, Ali Azarbayejani, Trevor Darrell, and Alex Pentland.

Pfinder: Real-time tracking of the human body. IEEE Trans. on Pattern Analysis and Machine Intelligence, 19(7):780–785, 1997.

18. T. Zhao and R. Nevatia. Tracking multiple humans in crowded environment. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR’04), 2004.