4.3 Face Tracking and Pose Estimation Using an Adaptive Correlation Filter
5.2.2 Person tracking
Tracking people over time presents additional challenges due to the complexity of data association in different scenes. Several groups have investigated person tracking with laser range finders (Montemerlo et al. (2002), Schulz et al. (2003), Arras et al. (2008)). These approaches usually keep tracking only the motion of people and do not try to dis- tinguish individuals. One approach which distinguishes different motion states in laser data is presented by Taylor and Kleeman (2004). Combinations of laser and vision data are presented by Bennewitz et al. (2005) and Schulz (2006). Both detect the position
5.2 Related Work
of people in the laser scan and distinguish persons based on vision data. Bennewitz et al. (Bennewitz et al. (2005)) base the vision part on color histograms whereas Schulz (Schulz (2006)) learns silhouettes of individuals from training data. This, however, re- quires a time consuming learning phase for each new person.
In machine vision, people tracking is a well-studied problem. Two main approaches can be distinguished: model-based and feature-based methods. In model-based track- ing approaches, a model of the object is learned in advance, usually from a large set of training images which shows the object from different viewpoints and in different poses (Rohr (1994)). Learning a model of a human is difficult because of the large number of degrees of freedom of the human body and the variability in human motion. Current ap- proaches include simplified human body models, for example, stick, ellipsoidal, cylindric or skeleton models (Breglera et al. (2004), Urtasun et al. (2006), Mikic et al. (2003)), or shape-from-silhouettes models (Cheung et al. (2005)). Although these approaches have reached good performance in laboratory settings with static cameras, they have usually not been applicable in real world environments on a mobile system. They usually do not operate in real-time and often rely on a static, uniform background. Other promis- ing approaches for human tracking are online learning methods to handle the complex appearance variation of human poses. The basic idea of online learning methods is to typically learn a model to represent the target object and update this model in every frame in order to adapt to appearance variation. Some examples of these algorithms include in- cremental learning (Ross et al. (2008)), online multiple instance learning (Babenko et al. (2011)) and visual tracking using L1 minimization (Mei and Ling (2009)). Jepson et al. (Jepson and Fleet (2003)) use a Gaussian mixture model with online update to adapt to object appearance variations. Kwon et al. (Kwon and Park (2008)) improve particle filtering by using multiple observation and motion models to address large appearance and motion variation. Although these methods achieved a considerable success, they still have unsolved problems, such as drift. The drift problem usually occurs when the target changes drastically while the number of training samples is not enough to cover the po- tential areas far from the current tracked position of the object. Grabner et al. (Grabner and Bischof (2006)) proposed an online boosting algorithm to select features for tracking arbitrary objects. In fact, his appearance model is updated with one positive sample and a few negative samples. The positive sample is taken from the current tracker location, and negative samples are collected at the positions around the tracker location. If the tracker location is not precise, the appearance model might update with a wrong positive sample. Over time this can degrade the model, and can cause drift and misalignment. Babenko et al. (Babenko et al. (2009)) improve the efficiency of online tracking algorithms by using a multiple instance learning framework where samples are collected from positive and negative bags. This method has been shown to be robust to partial occlusion and drift. Nevertheless, its process of training samples has a high computational complexity; therefore, it is infeasible to apply this method on mobile robots which are required to run in real time. To deal with the appearance change of the object and its partial occlusion, (Zhang et al. (2012)) proposed the method of compressive tracking. This method uses
Chapter 5 Person Detection and Tracking using RGB-D Images
compressed features, extracted from the tracked object, to online update a simple Bayes classifier. As a result, this classifier is able to quickly adapt to the object changes of pose, rotation, deformation, and self-occlusion. In addition, this method is suitable for real time applications because of its low computational costs. Since the mobile robot and humans often move and change their directions and orientations, an effective im- provement of the compressive tracker can be a good solution for adapting to all of these changes and reliably tracking humans.
Feature-based tracking approaches on the other hand do not learn a model but track an object based on simple features such as color cues or edges. One approach for feature- based tracking is the Mean Shift algorithm (Comaniciu and Meer (2002), Comaniciu
et al. (2000)) which classifies objects according to a color distribution. Variations of
this method are presented by Bradski (1998) and Perez et al. (2002). Although almost all approaches have not been designed specifically for person tracking, they might be applicable in this area as well. One limitation remaining with the above methods is that they operate only on color and, therefore, are dependent on colored objects.
In the field of mobile robots, although there are many approaches to tracking mul- tiple humans, such as sample-based joint probabilistic data association filters (Schulz
et al. (2001)), and Kalman filters (Bellotto and Hu (2009)), most of them have not been
successful in adapting to human changes of pose, scale and appearance as well as to par- tial or full occlusions. State-of-the-art algorithms of human detection (Viola and Jones (2001b), Dalal and Triggs (2005)), make a great contribution to tracking-by-detection approaches (Wojek et al. (2009), Choi and Savarese (2010)), thus significantly improv- ing the tracking capability of mobile robots. Choi et al. (Choi et al. (2011b)) proposed a method of detecting and tracking people by mobile robots, based on the algorithm of reversible jump Markov chain Monte Carlo particle filtering (RJ-MCMC). Due to de- tecting humans based on relatively reliable observation cues of humans in each frame, this method was shown to be robust to complicated changes of human poses and par- tial occlusions. These observation cues include a human detector using a Histogram of Orientations (Dalal and Triggs (2005)), a face detector using the Viola-Jones method of objection detection (Viola and Jones (2001b)), and the detectors of skin, motion and depth-based shape. However, the computational costs of the detectors and the tracking algorithm of reversible jump Markov chain Monte Carlo particle filtering are very ex- pensive. For human-robot interaction, the computational complexity of this algorithm has not met the requirement of real time performance.