Taking mobile multi-object tracking to the next level
199
0
0
Full text
(2)
(3) WICHTIG: D 82 überprüfen !!! Selected Topics in Computer Vision. herausgegeben von Prof. Dr. Bastian Leibe Lehr- und Forschungsgebiet Informatik 8 (Computer Vision) RWTH Aachen University. Band 1. Dennis Mitzel. Taking Mobile Multi-Object Tracking to the Next Level. Shaker Verlag Aachen 2014.
(4) Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available in the Internet at http://dnb.d-nb.de. Zugl.: D 82 (Diss. RWTH Aachen University, 2013). Copyright Shaker Verlag 2014 All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without the prior permission of the publishers. Printed in Germany.. ISBN 978-3-8440-2524-8 ISSN 2198-3372 Shaker Verlag GmbH • P.O. BOX 101818 • D-52018 Aachen Phone: 0049/2407/9596-0 • Telefax: 0049/2407/9596-9 Internet: www.shaker.de • e-mail: [email protected].
(5) Abstract. Recent years have seen considerable progress in automotive safety and autonomous navigation applications, fueled by the remarkable advance of individual Computer Vision components, such as object detection, tracking, stereo and visual odometry. The goal in such applications is to automatically infer semantic understanding from the environment, observed from a moving vehicle equipped with a camera system. The pedestrian detection and tracking components constitute an actively researched part in scene understanding, important for safe navigation, path planning, and collision avoidance. Classical tracking-by-detection approaches require a robust object detector that needs to be executed in every frame. However, the detector is typically the most computationally expensive component, especially if more than one object class needs to be detected. A first goal of this thesis was to develop a vision system based on stereo camera input that is able to detect and track multiple pedestrians in real-time. To this end, we propose a hybrid tracking system that combines a computationally cheap low-level tracker with a more complex high-level tracker. The low-level trackers are either based on level-set segmentation or stereo range data together with a point registration algorithm and are employed in order to follow individual pedestrians over time, starting from an initial object detection. In order to cope with drift and to bridge occlusions that cannot be resolved by low-level trackers, the resulting tracklet outputs are fed to a high-level multihypothesis tracker, which performs longer-term data association. With this integration we obtain a real-time tracking framework by reducing object detector applications to fewer frames or even to few small image regions when stereo data is available. Reduction of expensive detector evaluations is especially relevant for the deployment on mobile platforms, where real-time performance is crucial and computational resources are notoriously limited. To overcome another limitation of a classical tracking-by-detection pipeline, employment only for tracking of objects for which a pre-trained object classifier is available, we propose a tracking-before-detection system that is able to track known and unknown.
(6) objects robustly, based purely on stereo information. With this approach we track all visible objects in the scene by first segmenting the point cloud into individual objects and associating them to trajectories based on a simple registration algorithm. The core of our approach is a compact 3D representation that allows us to robustly track a large variety of objects, while building up models of their 3D shape online. In addition to improving tracking performance, this representation allows us to detect anomalous shapes, such as carried items on a person’s body. Moreover, classical pedestrian tracking approaches ignore important aspects of human behavior, that should be considered for better scene understanding. Humans are not moving independently, but they closely interact with their surroundings, which includes not only other persons, but also further scene objects. Being able to track not only humans but also their objects, such as child strollers, suitcases, walking aids and bicycles, we propose a probabilistic approach for classifying person-object interactions, which associates objects simultaneously to persons and predicts their interaction type. In order to demonstrate the capabilities of proposed tracking algorithms, we evaluated them on several challenging video sequences, captured in busy and crowded shopping street environments. As our experiments prove we come closer to the goal of better scene understanding, being able to detect and track multiple objects in the scene in real time and to predict their possible interactions.. ii.
(7) Zusammenfassung. In den letzten Jahren hat die Entwicklung von Fahrerassistenzsystemen und mobilen Robotern erhebliche Fortschritte gemacht. Dies wurde m¨oglich durch bemerkenswerte Fortschritte von einzelnen Methoden des maschinellen Sehens wie Objekterkennung, Objektverfolgung, Stereotiefensch¨atzung und Stereokamera-basierte Odometrie. Das Ziel dieser Methoden beim Einsatz in mobilen Robotern ist es, dem Roboter ein Szenenverst¨andnis zu vermitteln. M¨oglich wird dies durch das automatische Auswerten von Bildern einer auf dem Roboter montierten Kamera. Objekterkennung und Objektverfolgung sind die f¨ ur das Szenenverst¨andnis wichtigsten Komponenten, da diese sichere Navigation, Pfadplanung und Kollisionsvermeidung erm¨oglichen und deshalb zu stark erforschten Gebieten des maschinellen Sehens geh¨oren. Ein klassisches Verfahren zur Objektverfolgung wird durch den sogenannten Trackingby-Detection Ansatz realisiert. Hierbei wird f¨ ur jedes Videobild ein Objektdetektor ausgewertet und die resultierenden Objektdetektionen dann mit Hilfe der Odometrie frame¨ ubergreifend zu Trajektorien verbunden. Der Nachteil dieses klassischen Ansatzes ist der zwingend notwendige Einsatz eines Objecktdetektors auf jedem Frame. Da dieser Detektor typischerweise die rechenintensivste Komponente der Tracking-Pipeline ist, wird dadurch der Einsatz vom Tracking-by-Detection f¨ ur echtzeitkritische Anwendungen unm¨oglich. Aus diesem Grund war das erste Ziel der Arbeit die Entwicklung eines Objektverfolgungsverfahrens, welches ausgehend von Bildern einer Stereokamera Fußg¨anger in Echtzeit finden und verfolgen kann. Dazu haben wir einen hybriden Objektverfolgungsansatz entwickelt, welcher einen recheneffizienten Low-Level Tracker und einen High-Level Tracker kombiniert. Der Low-Level Tracker basiert entweder auf einer Level-Set Segmentierung oder Stereotiefe kombiniert mit dem ICP Algorithmus. Diese Tracker sind verantwortlich f¨ ur die Verfolgung von Fußg¨angern u ¨ber die Zeit basierend auf einer initialen Objektdetektion. Da die Low-Level Tracker nicht mit Abweichungen von der echten Position des Objektes, oft verursacht durch Verdeckungen, umgehen k¨onnen wird das Verfolgungssystem durch einen High-Level Tracker erweitert. Der High-.
(8) Level Tracker erzeugt lange Trajektorien und erkennt durch entsprechende Konsistenztests die Divergenz der Low-Level Tracker. Durch diese Kombination wird die Auswertung eines Detektors auf wenige Frames oder sogar wenige kleine Bildregionen pro Frame reduziert. Diese drastische Reduktion schafft die Voraussetzung f¨ ur ein echzeitf¨ahiges System, das den Einsatz auf mobilen Robotern erst m¨oglich macht. Im zweiten Teil der Arbeit stellen wir einen neuen Tracking-before-Detection Ansatz vor. Dieser erlaubt es uns, nicht nur bekannte Objektkategorien, wie Fußg¨anger, sondern auch unbekannte, vorher ungesehene Kategorien zu verfolgen. Mit diesem Ansatz u ¨berwinden wir auch die starke Einschr¨ankung von typischen Tracking-by-Detection Verfahren, dass ein vortrainierter Objektdetektor erforderlich ist und k¨onnen somit alle sichtbaren Objekte der Szene verfolgen. Dazu verwenden wir die Punktwolken, die mit Hilfe der Stereosch¨atzung extrahiert werden. Die Punktwolken werden dabei in individuelle Objekte segmentiert und zu Objekttrajektorien verbunden. Dies geschieht mit Hilfe eines Registrierungsverfahrens, welches zwei Punktwolken auf einander registriert. Den Kern des Verfahrens bildet eine neue, kompakte 3D Objektrepr¨asentation, die uns auf der einen Seite robuste Verfolgung von beliebigen Objekten erlaubt und auf der anderen Seite das Lernen von 3D-Objektformen online erm¨oglicht. Die gelernten 3D-Objektformen f¨ ur Fußg¨anger erlauben uns die Detektion von getragenen Objekten wie Taschen. Basierend auf der F¨ahigkeit der Verfolgung von allen Objekten einer Szene wurde in Rahmen dieser Arbeit ein weiterer wichtiger Aspekt der Bewegung von Menschen untersucht. Menschen bewegen sich nicht unabh¨angig, sondern interagieren sehr stark mit ihrer Umgebung. Diese besteht nicht nur aus anderen Menschen, sondern auch aus weiteren unbekannten Objekten wie Kinderw¨agen, Koffern, Gehhilfen und Fahrr¨adern. Um diese Interaktionen modellieren zu k¨onnen, stellen wir einen neuen probabilistischen Ansatz vor, der uns erlaubt Objekte mit Personen zu assoziieren. Gleichzeitig l¨asst sich die Art der Interaktion vorhersagen, was wiederum f¨ ur die Verbesserung der Objektverfolgung verwendet werden kann. Um die Leistungsf¨ahigkeit der vorgestellten Verfahren zu demonstrieren, haben wir die Algorithmen auf mehreren anspruchsvollen Sequenzen aus sehr belebten Einkaufstraßen evaluiert. Unsere Experimente zeigen, dass wir dem Ziel von einem besseren Szenenverst¨andnis deutlich n¨aher gekommen sind. Wir sind in der Lage Objekte in Echtzeit zu finden, zu verfolgen und ihre Interaktionen vorherzusagen.. iv.
(9) Acknowledgments. This dissertation is a product of the invaluable support I received in the last few years from a number of great people. First and foremost I thank my supervisor Prof. Dr. Bastian Leibe for his continuing intensive support in all stages during my PhD time, for great and inspirational ideas, countless fruitful discussions and teaching me what research is all about. My thanks goes also to Prof. Dr. Luc Van Gool for his interest in my work and agreeing to co-examine my thesis. I would like to express my gratitude to my diploma thesis supervisor Prof. Dr. Daniel Cremers who intrigued my interest in Computer Vision from the first lecture. I am deeply grateful to all my colleagues Esther Horbert, Tobias Weyand, Patrick Sudowe, Georgios Floros and Wolfgang Mehner making the life in UMIC really fun. Furthermore, Tobias thank you for offering me a place to sleep during the nights close to the deadlines, that become longer and longer the closer the deadline was approaching and especially thank you for the excellent espresso in the morning. Georgios thank you for all the great Greek supplies as the amazing feta and spinach pies or the excellent olive oil from Crete. Esther thank you very much for being supportive during the difficult time in the first year of my PhD. Patrick thank you for all the discussions about cars in context of car detection in video sequences, but also convincing me why Volkswagen Golf GT is the best car ever. Moreover, I would like to thank the team of KHAO-LAKBEACH being always punctual in delivering the best Vietnamese food during rough deadline time. Thanks also to my diploma/master theses students who have contributed to this theses with their work: Esther Horbert, Tobias Baumgartner, Philipp Fischer, Stefan Breuers, Wolfgang Mehner, Emmanouil Tzouridis, Seyed Hamidreza Odabai-Fard, Jonathan Meyer..
(10) Furthermore, I would like to thank Tanja, my two years old daughter, for coaching me to get along with only 3-4 hours sleep per night, which helped me to stay awake and concentrated during long nights before the deadlines. Last but not least, I would like to thank my family and my friends, especially my parents always believing in me and being always there.. vi.
(11) Contents 1 Introduction 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 1 1 2 3. 2 State of the Art 2.1 Image-based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Tracking-by-Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Stereo-based Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 7 7 10 14. 3 Preliminaries 3.1 Object Detection . . . . . . . . . 3.2 Stereo Estimation . . . . . . . . . 3.3 Visual Odometry . . . . . . . . . 3.4 Multi-Hypothesis Tracking . . . . 3.5 Real-Time Tracking-by-Detection 3.6 Camera Setup . . . . . . . . . . . 3.7 Discussion . . . . . . . . . . . . .. 17 17 20 23 26 31 36 37. I. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. Hybrid High-Level/Low-Level Tracking. 4 Hybrid High-Level/Low-Level Tracking 4.1 Motivation . . . . . . . . . . . . . . 4.2 Related Work . . . . . . . . . . . . 4.3 Integrated Tracking Frameworks . . 4.4 Hybrid Tracking with Level Sets . . 4.5 Hybrid Tracking with ICP . . . . . 4.6 Discussion . . . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 39. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 41 41 44 46 47 61 73. vii.
(12) Contents. II ROI based Object Detection and Tracking 5 Robust ROI Extraction and Segmentation 5.1 Point Cloud Labeling . . . . . . . . . . 5.2 ROI Extraction . . . . . . . . . . . . . 5.3 ROI Segmentation . . . . . . . . . . . 5.4 Experimental Results . . . . . . . . . . 5.5 Discussion . . . . . . . . . . . . . . . . 6 Close-Range Human Detection and 6.1 Related Work . . . . . . . . . . 6.2 Approach . . . . . . . . . . . . 6.3 Experimental Evaluation . . . . 6.4 Extensions . . . . . . . . . . . . 6.5 Discussion . . . . . . . . . . . .. . . . . .. 77 81 82 84 85 87 90. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. . . . . .. 91 . 92 . 93 . 97 . 101 . 103. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 7 Tracking with Time-Constrained Detection 7.1 Related Work . . . . . . . . . . . . . . . 7.2 System Overview . . . . . . . . . . . . . 7.3 Poisson Process Attention Model . . . . 7.4 Detailed Implementation . . . . . . . . . 7.5 Experimental Results . . . . . . . . . . . 7.6 Discussion . . . . . . . . . . . . . . . . .. . . . . . .. III Tracking People and Their Objects. 107 108 109 110 111 115 117. 121. 8 Tracking Known and Unknown Objects 8.1 Related Work . . . . . . . . . . . . . . . . . . 8.2 Overview . . . . . . . . . . . . . . . . . . . . . 8.3 3D Object Representation . . . . . . . . . . . 8.4 Stereo Depth-Based Tracking-Before-Detection 8.5 Carried Item Detection . . . . . . . . . . . . . 8.6 Experimental Results . . . . . . . . . . . . . . 8.7 Discussion . . . . . . . . . . . . . . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. . . . . . . .. 125 126 128 129 130 132 135 138. 9 Person-Person and Person-Object Interaction 9.1 Related Work . . . . . . . . . . . . . . . . 9.2 Modeling Person-Object Interactions . . . 9.3 Learning . . . . . . . . . . . . . . . . . . . 9.4 Inference and Prediction . . . . . . . . . . 9.5 Robust 3D Data Association and Tracking 9.6 Experimental Results . . . . . . . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. . . . . . .. 141 143 143 145 146 148 151. viii. . . . . . .. . . . . . ..
(13) Contents 9.7. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156. 10 Conclusion 161 10.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 10.2 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 Bibliography. 167. ix.
(14)
(15) 1. Introduction 1.1. Motivation. Computer Vision is a broad and varied research field concerned with the problem of extracting semantic information from the images of a scene. Having its beginning in the early 1970s, it has been a very vivid research area. Scientists around the world put enormous effort into the development of algorithms and methods trying to tackle the problem of extracting relevant information from existing images. As a result, the techniques of this field have applications in a wide range of scenarios, including manufacturing, security, robotics, car industry, communication and many more. For many applications the behavior of humans in urban scenarios is of particular interest. For example, a traffic safety application could analyze a video stream from a camera system, mounted inside a car or on a mobile robot, in order to issue warnings in case of future path intersections or possible collisions. In order to achieve this goal, methods are required that can process video streams automatically and in real-time. Furthermore, in order to understand the behavior of people, it is also important to recognize and track other objects in their surroundings. In practical scenarios, this includes a large variety of objects such as bicycles, child strollers, shopping carts, trolleys, or wheelchairs. In recent years a number of tracking-by-detection approaches have been proposed to address these goals, reaching remarkable performance for robust people detection and tracking in dynamic and complex real-world scenes. However, those approaches have two major limitations. On the one side they are not yet satisfactory for use on autonomous platforms with respect to their requirements for computational power and energy consumption. On the other side they are naturally restricted to tracking objects for which pre-trained detector models (e.g., pedestrians) are available. In this thesis, we investigate the problem of multi-object tracking in busy inner-city scenarios. Starting with classical tracking-by-detection approaches, we focus on algorithmic means for improving run-time efficiency in order to make them applicable for use on a mobile robot. Based on the lessons learned from this endeavor, we investigate different. 1.
(16) 1. Introduction means for reducing the dependency on an expensive object detector by introducing a hybrid tracking framework. Such a framework is a combination of a computationally cheap low-level tracker with a high-level tracker. The low-level tracker follows pedestrians over time after an initial detection and thus takes over the role of the computationally expensive object detector. We investigate different choices for the low-level tracker, exploring both appearance-based and depth-based approaches. In both cases the low-level tracker is augmented by a high-level tracker that, using the tracklets output by the low-level tracker, performs longer-term data association bridging drift and occlusions that cannot be resolved by the low-level trackers. In the second part of the thesis, we then spin this idea further. Assuming that a mechanism exists for extracting regions-of-interest from the input video data (in our case from stereo data), we explore how those ROIs can be used for both simplifying and improving the object detection and tracking stages. Finally, we address the problem of tracking both known and unknown scene objects, which is a prerequisite for robust performance in many real-world settings such as mobile robotics and intelligent vehicles. For this, we extend the ROI-based scheme to a true tracking-before-detection approach, which can automatically track a large number of object candidates even before knowing their categories. This paradigm shift has important consequences for the design of the tracking pipeline. In particular, before tracking we first need to decide how an object candidate that we want to track is defined. To this end, we make use of regions-of-interest which are robustly segmented into candidate objects. Each such region is then tracked independently in 3D using a model-based point cloud registration tracker. In order to learn a representation of the objects, we develop an approach that reconstructs 3D shape models of each tracked object, which allow us besides robustly tracking a large variety of objects, an analysis of their shape. In addition, relying on the tracking results of known and unknown objects, we analyze person-object interactions and use this knowledge to make improved predictions for the continuation of observed trajectories. In this sense, we believe that the contributions of this thesis have brought tracking a significant step towards the next level.. 1.2. Contributions In detail we have made the following contributions: • We show how the classical tracking-by-detection framework can be complemented by a cheap and fast low-level tracker, based either on appearance or depth, resulting in a real-time tracking system. We systematically present the required consistency checks and interactions between the components in order to solve the difficulties in street-level mobile tracking tasks with a number of non-trivial challenges. • We present an integrated system for upper body detection which is purely based on depth information. The system overcomes the drawback of classical full-body detectors, which often fail to detect pedestrians close to the camera, due to strong. 2.
(17) 1.3. Structure of the Thesis occlusions. It is highly optimized for inner city scenarios and yields superior performance on very challenging outdoor data with more than 40fps. • We present a novel tracking-before-detection system, that, relying only on stereo depth information, is able to track a large variability of objects with unknown appearance in very complex street scenarios, while simultaneously building up their 3D shape models. The framework combines: visual odometry, ground plane estimation, point cloud classification, ROI extraction, segmentation and tracking into a robust framework. • Relying on the output of our tracking-before-detection system, including the tracked objects and their 3D shape models, we present a probabilistic framework for classifying person-object interactions. The system associates unknown objects to the persons and can help in stabilizing trajectories and adapting dynamic models for certain object/person constellations. • We acquired several hours of video material from mobile stereo camera systems and annotated various sequences with bounding boxes around pedestrians, object/person segmentations and corresponding actions between pedestrians and objects. These sequences, including corresponding annotations, stereo depth, visual odometry, and estimated ground planes have been made publicly available, allowing other researchers to build upon our results without having to assemble a complete system on their own.. 1.3. Structure of the Thesis The thesis is structured as follows: In Chapter 2 - “State of the Art” we will summarize the general state-of-the-art work related to tracking-by-detection approaches. In each following chapter additional related work will be presented which discusses the difference between the proposed approach and already existing approaches in more detail. Chapter 3 - “Preliminaries” introduces the basic components we rely on in the following chapters. This includes the mobile platforms we constructed and used in order to acquire data sets, a brief introduction to object detection, visual odometry and stereo estimation and a combination of these components into a multi-hypothesis tracking system that performs in real-time. In Chapter 4 - “Hybrid High-Level/Low-Level Tracking” we present an integrated framework for mobile street-level tracking of multiple persons. In contrast to classic tracking-by-detection approaches, we propose a hybrid-tracking approach that employs. 3.
(18) 1. Introduction efficient low-level trackers in order to follow individual pedestrians over time. The lowlevel trackers can either be based on a level-set color segmentation or on stereo depth data combined with ICP registration. Both trackers are initialized and periodically updated by a pedestrian detector and are kept robust through a series of consistency checks. In order to cope with drift and to bridge occlusions, the resulting tracklet outputs are fed to a high-level multi-hypothesis tracker (Mitzel et al., 2011b), which performs longerterm data association. This design has the advantage of simplifying short-term data association, resulting in higher-quality tracks that can be maintained even in situations where the pedestrian detector does no longer yield good detections. In addition, it only requires the pedestrian detector to be active part of the time, resulting in computational savings. The chapter is based on research originally presented in (Mitzel and Leibe, 2011; Mitzel et al., 2010). The following three chapters explore different aspects of region-of-interest based tracking, making use of stereo depth as additional cue. In Chapter 5 - “Robust ROI Extraction and Segmentation” we describe a robust regionof-interest (ROI) extraction approach based on depth information, optimized for busy shopping street scenarios. The idea behind the classical ROI extraction using stereo data is to fix the attention of the detector/tracker only on the few regions which may contain a target object. The ROIs themselves are represented by grouped bins from a 2D histogram collected from projected 3D points. We introduce several extensions to the classical approach in order to cope with the problem of partially occluded objects and propose a segmentation procedure for dividing ROIs which usually contain several objects, into individual objects by using the Quick Shift algorithm (Vedaldi and Soatto, 2008). The chapter describes a major component we build upon in several of the following chapters, as well as in several publications (Baumgartner et al., 2013; Mitzel and Leibe, 2011, 2012a,b; Mitzel et al., 2011a). In Chapter 6 - “Close-Range Human Detection and Tracking” we consider the problem of multi-person detection from the perspective of a head mounted stereo camera. Since pedestrians close to the camera cannot be detected by classical full-body detectors due to strong occlusion, we propose a stereo depth-template based detection approach for close-range pedestrians. We perform a sliding window procedure, where we measure the similarity between a learned depth template and the depth image. To reduce the search space of the detector we slide the detector only over few selected ROIs. The ROI selection allows us to further constrain the number of scales to be evaluated, significantly reducing the computational cost. Besides the technical design and evaluation of our proposed detector, a second main contribution of this chapter is its empirical demonstration of the somewhat surprising fact that such a relatively simple and fast approach can reach superior detection performance on very challenging outdoor data. The chapter is based on research originally presented in (Mitzel and Leibe, 2012a; Mitzel et al., 2011b).. 4.
(19) 1.3. Structure of the Thesis. In Chapter 7 - “Tracking with Time-Constrained Detection” we consider the problem of making best use of an object detector with a fixed and very small time budget. This constraint is not unusual and often arises in robotic scenarios, where, e.g., several vision-based components need to share processing power. The question we pose is: Given a fixed time budget that allows for detector-based verification of k small regionsof-interest in the image, what are the best regions to attend to in order to obtain stable tracking performance? We address this problem by applying a statistical Poisson process model in order to rate the urgency by which individual ROIs should be attended to. These ROIs are initially extracted from a 3D depth-based occupancy map of the scene, as described in Chapter 5, and are then tracked over time. This allows us to balance the system resources in order to satisfy the twin goals of detecting newly appearing objects, while maintaining the quality of existing object trajectories. The chapter is based on research originally presented in (Mitzel et al., 2011a,b). Finally, the following two chapters develop methods that allow us to extend tracking to objects people are interacting with. In Chapter 8 - “Tracking Known and Unknown Objects” we aim to take mobile multi-object tracking to the next level. Our approaches presented in the previous chapters work in a tracking-by-detection manner, which limits them to object categories for which pre-trained detector models (e.g., for pedestrians) are available. In contrast, in this chapter we propose a tracking-before-detection approach that can track both known and unknown object categories in very challenging street scenes. Our approach relies on noisy stereo depth data in order to segment and track objects in 3D. At its core is a novel, compact 3D representation (Generalized Christmas Tree - GCT) that allows us to robustly track a large variety of objects, while building up models of their 3D shape online. In addition to improving tracking performance, this representation allows us to detect anomalous shapes, such as carried items on a person’s body. The chapter is based on our recent research presented in (Mitzel and Leibe, 2012b; Mitzel et al., 2011b). In Chapter 9 - “Person-Person and Person-Object Interaction” we investigate, given the ability of tracking both known and unknown objects, whether we can derive any relationships between those objects. Here, we rely on the fact that humans are not moving independently, but they closely interact with their environment, which includes not only other persons, but also different scene objects. Typical everyday scenarios include people moving in groups, pushing child strollers, or pulling luggage items. Thus, we propose a probabilistic approach for classifying such person-object interactions, associating objects to persons, and predicting how the interaction will most likely continue. Our approach relies on stereo depth information in order to track all scene objects in 3D, while simultaneously building up their 3D shape models, presented in the previous chapter. These models and their relative spatial arrangement are then fed into a prob-. 5.
(20) 1. Introduction abilistic graphical model which jointly infers pairwise interactions and object classes. The inferred interactions can then be used to support tracking by recovering lost object tracks. The chapter fuses the research presented in (Baumgartner et al., 2013; Mitzel and Leibe, 2012b; Mitzel et al., 2011b) and adds some further experiments. Chapter 10 concludes the thesis and gives an outlook on possible future research directions and extensions. Note: The thesis is based on the technical contributions of my respective first author publications (Baumgartner et al., 2013; Mitzel and Leibe, 2011, 2012a,b; Mitzel et al., 2010, 2011a,b). Several images and text passages in the three major parts of this thesis are taken from these articles. However, additional and new content about further extensions has been added in order to provide deeper insights into our approaches. Several publications resulted from diploma or master thesis projects which were supervised by Prof. Bastian Leibe and myself. For these particular parts an additional note will be given in the footnote of the corresponding chapter.. 6.
(21) 2. State of the Art. In this thesis, we will mainly focus on multi-object tracking approaches from moving vehicles, e.g. a mobile robot that is equipped with a pair of synchronized, forward-looking cameras. This task is very challenging, since multiple objects may appear or emerge from occlusions in every frame that need to be detected. Since background modeling (Stauffer and Grimson, 1999) is no longer applicable in a mobile scenario, this is typically done using visual object detectors (Dollar et al., 2009). Consequently, tracking-by-detection has become the dominant paradigm for multi-object tracking applications (Andriluka et al., 2008; Ess et al., 2009b; Huang et al., 2008; Leibe et al., 2008a; Okuma et al., 2004; Wu and Nevatia, 2007). In such a framework, a generic person detector is applied to every frame of the input video sequence, and the resulting detections are associated to tracks. This leads to challenging data association problems, since the detections may themselves be noisy, containing false positives and misaligned detection bounding boxes (Dollar et al., 2009). Several approaches have been proposed to address this issue, e.g., by optimizing over a larger temporal window using dynamic programming (Berclaz et al., 2006), multi-hypothesis tracking (Arras et al., 2008), model selection (Leibe et al., 2008a), network flow optimization (Zhang and Nevatia, 2008), hierarchical (Huang et al., 2008) or MCMC data association (Zhao et al., 2008). In the following, we will review the different tracking approaches in more detail starting with the early methods based on background subtraction, proceeding with state-of-the art tracking-bydetection approaches and concluding with stereo based detection and tracking methods.. 2.1. Image-based Tracking A considerable number of object tracking approaches has been proposed starting in the seventies. Each of these tracking approaches requires an object detection procedure which is executed either in every frame or only initially when an object enters the tracking area. Early detection methods relied on temporal information in order to find changing image regions, caused by dynamics of the objects, by simply differencing. 7.
(22) 2. State of the Art temporally adjacent frames. The pixels undergoing strong changes are then marked as foreground and are connected to coherent blobs by a simple connected components algorithm. These frame differencing approaches are known as background subtraction and have been intensively studied since the seventies starting with a pioneering work from Jain and Nagel (1979). However, simple frame differencing is sensitive to illumination changes and usually yields many false positives. To cope with this problem, Wren et al. (1997) propose to model the intensity of each pixel in the static background with a normal distribution, learning the mean and the variance from several consecutive frames. Once the model is learned, pixels in the new frames deviating from the distribution are marked as foreground. Stauffer and Grimson (2000) propose to model the pixel values using mixtures of Gaussians, which provide better fits for outdoor scenes, where the repetitive movement of tree leaves, shadows or reflectance are correctly classified as background, but yield false positives using the unimodal model from Wren et al. (1997). Elgammal et al. (2000) extend the subtraction process, under the assumption that the neighboring pixels are likely to have the same label, such that an individual pixel should also match to nearby pixel values. This extension corresponds to a typical smoothing assumption, reducing the typical pepper/salt noise significantly. The limitation of all background subtraction methods is that they are only applicable for image sequences acquired from a static camera. Although, there are approaches e.g., from Irani and Anandan (1998) that attempt to compensate for the camera motion by building up mosaics for the background pixels, they are still limited to a slow camera motion, which make them not applicable for our scenarios. Considering the connected regions in the image corresponding to moving objects, the next step in the tracking pipeline is to track these blobs by establishing blob correspondence across frames. Grimson et al. (1998) and Wren et al. (1997) perform blob tracking based on a Kalman Filter (Gelb, 1974) by updating the position of the center of the blobs and their size. Paragios and Deriche (2000) rely on background subtraction in order to obtain an initial detection of the moving regions in the image, which are used as initialization for a level set contour. The level set formulation captures the motion detection and the tracking task simultaneously, by forcing the contour to converge towards the moving area, avoiding areas with a high gradient or static objects. Isard and MacCormick (2001) propose to jointly model the foreground and background by using mixtures of Gaussians, similar to (Stauffer and Grimson, 2000). Given a ground plane estimate, the foreground regions, modeled as cylinders, are projected into a 3D world coordinate system and are tracked by employing a particle filter, which models 3D position, shape and velocity of the objects. The limitation of the standard blob tracking approaches is that they require either only one object to be present in the scene or the objects moving to have a certain distance to each other in the image plane. The lack of a mechanism to separate blobs into individual objects makes it difficult to generate unique track labels for different people. To address this problem, Haritaoglu et al. (2000) propose to utilize the vertical projection histogram of the contour resulting from background subtraction in order to to determine whether the foreground region con-. 8.
(23) 2.1. Image-based Tracking tains multiple people. To this end, a vertical projection histogram template is learned from single person annotations and each contour from an incoming blob is compared to the learned silhouette from a single person using the Sum of Absolute Differences (SAD) method (Haritaoglu et al., 1998) yielding a separation into individual pedestrian regions. The regions are then approximated by a rectangle and are tracked using a second order motion model. Collins et al. (2001) classify the blobs into different classes before tracking by providing class labels for each blob based on a neural networks approach. Building on this, they propose a classifier based on linear discriminant analysis to distinguish between different vehicle types. Similar to (Hager and Belhumeur, 1998; Zheng and Chellappa, 1995) they then apply an image region matching approach for the tracking part, that determines the best match to the current region by normalized cross correlation using intensity values around candidate regions in the new image. Another well-established routine towards efficient image based tracking is to use a sparse collection of features such as edges and prior knowledge about the model of the target objects. In general, model-based tracking approaches try to obtain more information about the tracked objects by estimating their precise pose and they use this information for predicting future motion more accurately. Gennery (1982) presented one of the first approaches for tracking of solid 3D-objects assuming the model to be known. Using a procedure similar to the Kalman Filter including prediction and updating steps, Gennery (1982) propose a six degree of freedom model for modeling the position and the orientation of the 3D object, which are predicted using previous knowledge and updated by extracting edge elements closest to the predicted line segments of the model. (Koller et al., 1993) propose a vehicle detection and tracking framework based on a 3D car model. By clustering coherently moving image features first, image regions likely to contain target objects are extracted. Assuming that the vehicles are moving on a planar ground surface and given a rough estimate of the plane and camera parameters, edge segments extracted from the image are matched to the 2D model edges obtained from back-projection of the 3D polyhedral model placed on the ground plane. The 3D object is tracked on the ground plane assuming an uniform motion model along a circular arc, using a prediction/update framework. Similarly, Dellaert and Thorpe (1997) propose an approach that tracks vehicles in highway scenarios by predicting how an imaginary cube around a car position in 3D will fit the projection to the image plane. Prior knowledge about the objects to track was also extensively employed in region based level-set tracking approaches (Cremers, 2006; Leventon et al., 2000; Tsai et al., 2001). The approaches perform a local optimization, iterating between a segmentation and a warping step to track an object’s contour over time incorporating prior knowledge about the shape. Since both steps only need to be evaluated in a narrow band around the currently tracked contour, they can be implemented very efficiently. However, in general region based approaches suffer from the fact that a fixed model of the target object is required, which makes it hard to apply them for complex outdoor settings where e.g. a huge number of vehicle types/shapes are available. Furthermore, occlusions and clutter in real scenarios will often cause divergence in the described approaches.. 9.
(24) 2. State of the Art A number of approaches have been proposed in the context of template matching based tracking. The general procedure in this category of tracking approaches is to perform a brute force search for a region in the image similar to a predefined template. A template can be fixed (Comaniciu et al., 2003; Schweitzer et al., 2002) or generated and updated online (Grabner et al., 2006; Jepson et al., 2003). Comaniciu et al. (2003) represent a target object by a simple weighted color histogram. Instead of a brute force search for the new location of the object in the next frame, they propose a Mean-Shift approach (Cheng, 1995) that searches for the mode by trying to find a position in the new frame that maximizes the appearance similarity of the template and the corresponding image location. Grabner et al. (2006) propose an on-line AdaBoost (Freund and Schapire, 1995) based approach, that starting from an initial detection learns a template of the object in an AdaBoost framework, extracting features inside the detection bounding box as positive examples and around the bounding box as negative background examples. In the following frames a most likely new position is found by sampling the neighborhood of the previous object position. The template is then updated based on the new features. This allows to adapt the template while tracking the object, coping with appearance changes of the object caused by illumination changes or plane rotations. In general, online updated template based methods perform better due to the ability to compensate for rotations or scale and appearance changes. Because fixed template-based approaches encode the object appearance usually generated from only a single view, they are only suitable for tracking objects that undergo little pose changes. For further reading, we refer the reader to a survey from (Yilmaz et al., 2006) for a thorough review of image-based tracking approaches.. 2.2. Tracking-by-Detection As mentioned before, measurement extraction approaches employed for early tracking methods were often relying on background subtraction, which makes them not applicable for our scenarios with a moving cameras. Low-level region/template-based tracking approaches are usually sensitive to illumination, appearance changes or occlusions. In addition, they require a high-level tracker for robust multi-object tracking, which performs consistency checks for recovering from failures cased by the low-level trackers, as we will show in Chapter 4. Furthermore, these approaches do not have a discriminative model that classifies objects into different categories of interest, making the data association much harder. However, due to remarkable progress in object detection and classification (Dalal and Triggs, 2005; Dollar et al., 2010; Felzenszwalb et al., 2010b) the most successful approach for tracking in recent years has been tracking-by-detection. In this process the output of an object detector e.g., (Dalal and Triggs, 2005; Felzenszwalb et al., 2010b; Leibe et al., 2008b) executed in each frame is integrated into long-term trajectories. In general, a typical tracking-by-detection pipeline is divided into two parts: first, a trajectory hypothesis generation process that, using an appearance and dynam-. 10.
(25) 2.2. Tracking-by-Detection ical model, links the detection from adjacent frames into trajectory hypotheses; second, an optimization process that selects a set of hypotheses that is most likely representative for the scene. Many approaches (Choi and Savarese, 2010; Ess et al., 2009b; Leibe et al., 2008a; Okuma et al., 2004; Wu and Nevatia, 2007; Yang and Nevatia, 2012; Zhang and Nevatia, 2008), including our own approaches presented later, employ a simple color histogram extracted from the image content inside a detection bounding box in order to decide, based on some histogram distance measurement, whether two detections should be linked together or not. As similarity measurement (Ess et al., 2009b; Leibe et al., 2008a; Wu and Nevatia, 2007) employ the Bhattacharyya distance. Kuo et al. (2010) build an appearance model based on several complementary features such as color histograms, HOG features (Dalal and Triggs, 2005), modeling object shape and covariance matrices (Tuzel et al., 2006), describing the texture of a detection. With these features an appearance representation is discriminatively trained in a one vs. all manner for each trajectory using a boosting framework. Obviously, a simple color-based data association procedure is not really robust for a pedestrian tracking scenario due to illumination changes and shadow artifacts or indistinguishable clothing (E.g., pedestrians dress mostly dark in winter and are not distinguishable during the linking process). Therefore, in the data association process the appearance model is usually reinforced by a dynamic model responsible for spatially plausible association. When tracking is performed in 3D world coordinates, a common dynamic model assumption is employed in many approaches, the constant velocity model, which has shown to be sufficient for our tracking scenarios and other state-of-the-art approaches (Choi and Savarese, 2010; Ess et al., 2009b; Gavrila and Munder, 2007; Leibe et al., 2008a). In 2D an implicit assumption on constant velocity is made, associating detections with a similar 2D position and scale (Wu and Nevatia, 2007; Yang and Nevatia, 2012; Zhang and Nevatia, 2008). We employ a hypothesise-and-verify procedure (cf. Chapter 3.5) similar to (Ess et al., 2009b; Leibe et al., 2008a) as the optimization process in our tracking approaches in order to infer the most likely trajectory set that best represents the observations from past and current frames. The hypothesise step generates an overcomplete set of trajectory hypotheses by linking the pedestrian detections in a space-time volume using the Extended Kalman Filter with a constant velocity motion model and a histogram-based appearance model. For obtaining an optimal set of trajectory hypotheses in the verify step we apply the MDL (Minimum Description Length) approach, similar to (Leibe et al., 2008a). Another approach for data association often used for offline tracking is based on a tracklet generation process followed by a global optimization (Andriluka et al., 2008; Huang et al., 2008; Kuo et al., 2010; Singh et al., 2008; Stauffer, 2003; Yang and Nevatia, 2012). Tracklets are confident snippets of trajectories including detections with a very high affinity. For the final global data association and the connection of tracklets to final trajectories, many approaches (Huang et al., 2008; Singh et al., 2008; Wu and Nevatia, 2007) rely on the Hungarian algorithm proposed by Kuhn (1955). Another popular optimization process employed for the global data association task is. 11.
(26) 2. State of the Art based on the general min-cut/max-flow network paradigm (Berclaz et al., 2011; Izadinia and Shah, 2012; Leal-Taixe et al., 2011; Pirsiavash et al., 2011; Zhang and Nevatia, 2008), where the tracking problem is modeled as an optimization of a flow network. The nodes of the network represent the individual detections and frame-to-frame links represent the affinity between the corresponding detections. In order to obtain the optimal assignment of detections to trajectories Zhang and Nevatia (2008) use a min-flow algorithm from Goldberg (1997) which is repeatedly applied with different amounts of flow (equivalent to number of objects), inferring occlusions and associations iteratively. In order to cope with long term occlusions, besides differentiating the value of flow, the graph is expanded with possible occlusion hypotheses that are linked with observed tracklet pairs if consistent with the appearance and scale. The complexity of the employed optimization procedure (Goldberg, 1997) is polynomial in the number of frames. However, recently Pirsiavash et al. (2011) proposed to solve the multi-object tracking problem formulated as a flow network using a greedy algorithm that sequentially instantiates tracks using a shortest path procedure, which results in linear time complexity in the number of frames and the number of objects. Andriyenko and Schindler (2011); Andriyenko et al. (2012); Yamaguchi et al. (2011) propose to formulate the data association problem as minimization of a continuous energy function. The energy function consists of different terms that model a desired configuration for a pedestrian trajectory by linking detections plausibly, with regard to pedestrian dynamics, collision avoidance and object persistence. Trying to approximate the most realistic trajectory configuration that reflects the real world scenario, the energy functionals usually become highly non-convex. To cope with non-convexity Andriyenko and Schindler (2011) propose to optimize the energy function using a conjugate gradient method which is augmented with trans-dimensional jumps, allowing to jump out of local minima and thus to find a better configuration that minimizes the energy. Yamaguchi et al. (2011) however, employ a variant of the simplex algorithm combined with several restarts in order to escape the local minima. So far, the presented tracking-by-detection approaches perform tracking without taking into account the interaction between the individual persons, having weak constraints on the pedestrian motion, assuming a constant velocity model for human dynamics. However, usually in a real world scenario human behavior is influenced by many factors such as the intended goal, other scene objects and obstacles. Based on this fact, several approaches have been proposed to model the human motion by incorporating physical and social constraints of the surroundings (Leal-Taixe et al., 2011; Luber et al., 2010; Pellegrini et al., 2009; Yamaguchi et al., 2011). Originally, social force models were employed for crowd simulations (Helbing and Moln´ar, 1995; Kl¨ ugl and Rindsf¨ user, 2007; Saboia and Goldenstein, 2011) modeling the behavior of pedestrians in evacuation scenarios or modeling plausible dynamics of crowds of virtual pedestrians in computer graphics (Heigeas et al., 2003; Lerner et al., 2007). Helbing and Moln´ar (1995) propose to model the dynamics of pedestrians by social forces. In particular, the assumption is made that each object in the scene emanates. 12.
(27) 2.2. Tracking-by-Detection a repulsive force. These forces represent the fact that humans always want to keep a certain minimum distance to other scene participants and static objects. Consequently, the movement of each pedestrian is constrained, by forces from other scene objects, while moving to the desired destination with a desired speed. The proposed social-force model consists of three parts: the first part models the acceleration towards the desired velocity of motion; the second part models the repulsive effect of walls and other people; and the third part models an attraction energy emanated by the motivation of pedestrians to reach a certain goal. Helbing and Moln´ar (1995) show that a weighted combination of these terms allows a very realistic pedestrian dynamics simulation. Inspired by this social force concept, Luber et al. (2010) propose to integrate it into a multi-hypothesis target tracker using only measurements from a laser range finder. To this end, the social force model is combined with the prediction step of a Kalman Filter, pulling the prediction to a position in the scene that is consistent with the individual goal and desired speed and accounts for the influence of the environment and other people. The presented experiments show that using social forces results in a more realistic prediction model of human motion, reducing the data association error especially after occlusions. Parallel to the approach of Luber et al. (2010), Pellegrini et al. (2009) propose a social force model based approach (Linear Trajectory Avoidance) for motion prediction in the context of visual multi-person tracking. The proposed approach models the dynamics of each pedestrian based on an energy field which is affected by three different terms similar to the work of Helbing and Moln´ar (1995): interaction cost, desired speed and direction and intermediate goal. In particular, they define an individual energy field for each pedestrian, which is a function over the possible velocity vectors the person can choose. Consequently, in each frame for each pedestrian the decision about the moving direction and velocity is the minimum inside this individual energy field which makes these pedestrians move in the optimal direction. For computing the interaction cost, the point of closest approach (obtained by a simple linear extrapolation of the given trajectory so far) to other pedestrians in the near future is used, rather than just the current positions as in (Helbing and Moln´ar, 1995). This point of closest approach is then used in order to adapt the speed and walking direction in order to minimize the collision likelihood. In the course of his master thesis project, Fischer (2012) reimplemented both state-of-the-art approaches (Luber et al., 2010; Pellegrini et al., 2009) and integrated the proposed models into the prediction step of our high-level tracker presented in Chapter 3. The disadvantage of the approach from Luber et al. (2010) is the modeling of energy potentials only at pedestrians current location resulting in a late effect of the forces and consequently very abrupt direction change as soon as the pedestrians are really close to each other. Modeling the driving force from the point of closest approach as proposed by Pellegrini et al. (2009) results in more plausible pedestrian trajectories. Since this point of closest approach is usually further away from the current position, the path and velocity adaptation begins earlier yielding much smoother trajectories compared to the results from Luber et al. (2010). A further interesting fact we derived during the evaluation of both approaches on our sequences is that social forces can describe and. 13.
(28) 2. State of the Art predict human interaction only for scenarios where the objects are moving independently and not in groups. Pedestrians moving in groups violate the social force rules generating an undesired repulsive effect that yields a wrong prediction. Leal-Taixe et al. (2011); Yamaguchi et al. (2011) have recognized this problem and first perform group detection before applying social force models for prediction of the future motion. This is also a motivation behind our group detection and interaction type classification approach developed in Chapter 9.. 2.3. Stereo-based Tracking As already mentioned in the previous chapter, the enormous progress in object detection (Dalal and Triggs, 2005; Felzenszwalb et al., 2010b), made the development of robust tracking-by-detection approaches possible in the first place (Andriluka et al., 2008; Ess et al., 2009b; Gavrila and Munder, 2007; Huang et al., 2008; Leibe et al., 2008a; Okuma et al., 2004; Wu and Nevatia, 2007). However, classical tracking-by-detection methods typically require the approach to execute a computationally expensive object detector in each frame, making it hard to achieve real-time performance at the system level. Many object detection approaches targeted at real-time applications follow a simple strategy in order to alleviate this problem by extracting ROIs based on object motion (Enzweiler et al., 2008), (not applicable for our scenes, since our camera is also moving) or texture content (Shashua et al., 2004) to reduce the detector search space. In our work, we mostly rely on depth as an additional cue in order to constrain object detection to small ROIs, similar to (Bajracharya et al., 2009a; Bansal et al., 2010; Gavrila and Munder, 2007; Geronimo et al., 2010a). These ROIs are extracted in each frame and are evaluated by the detector to feed a tracking-by-detection process. Bansal et al. (2010) extract ROIs by projecting the 3D points from a stereo depth map onto the estimated 2D ground plane. The local maxima of this projection are backprojected to the image, forming the ROIs which are evaluated in each frame by the detector. The detector output is then associated to trajectories using a correlation tracker. Since only a small number of ROIs is processed in each frame, their approach nearly reaches real-time performance. Similar to Bansal et al. (2010), Bajracharya et al. (2009a) uses range data from stereo in order to generate regions-of-interest. Using shape features that are extracted from 3D points of ROIs, pedestrians are detected. Since the strategy of Bajracharya et al. (2009a) is to robustly detect and track pedestrians in open land scenes with few potential ROI candidates, they did not track individual objects, but ROIs (usually consisting of several pedestrians in crowded scenarios). The goal of tracking ROIs is to reduce the number of false positives due to aggregation of the classification information of single frames and to estimate the velocity of the ROIs. The association of ROIs from frame-to-frame is performed by simply matching the color histograms of the individual ROIs, extracted from the corresponding 2D positions in the image. Gavrila and Munder (2007) constrain the search space using ROIs for generating. 14.
(29) 2.3. Stereo-based Tracking detection hypotheses. Detection hypotheses are represented by measuring the Chamfer distance between a learned shape contour model and the image input, which are then verified by cross correlation between the two stereo images. For trajectory generation Gavrila and Munder (2007) employ a simple α-β filter (Benedict and Bordner, 1962) (closely related to the Kalman Filter) that propagates the uncertainty of the bounding box position and the depth, in a predict-update manner. Similar to our trajectory generation process presented in Chapter 3.5 the resulting hypothesis set can contain multiple-tracks with one and the same measurement assigned. In order to obtain an optimal hypothesis set Gavrila and Munder (2007) use the aforementioned Hungarian algorithm (Kuhn, 1955). Luber et al. (2011) propose a 3D pedestrian detection and tracking approach based on RGB-D data from a Microsoft Kinect camera. Pedestrians are detected by combining the popular HOG detector (Dalal and Triggs, 2005) applied to the color image with a depth image based detector, Histograms of Oriented Depths (HOD). Both detectors are executed independently and the resulting output is fused using a weighted mean. The detections are then fed into an online-tracking system proposed by Grabner et al. (2006) which learns the model of tracked objects based on a boosting framework. The output of the detector and the online-tracker is used in order to generate observations for a Multi-Hypotheses Tracker (MHT) similar to (Reid, 1979) that generates and maintains long-term trajectories. This approach is similar to our proposed hybrid-tracking approaches presented in Chapter 4. On the one side the online-tracker is a low-level tracker that in case of false negatives generates new observations which are fed to the MHT (high-level tracker). The high-level tracker on the other side drives the low-level tracker to updated object positions and thus reduces the sensitivity to drift, typical for online image-based trackers.. 15.
(30) 2. State of the Art. 16.
(31) 3. Preliminaries. In this thesis we focus on the development of tracking approaches for mobile platforms equipped with a stereo camera. Stereo cameras allow us to extract an additional cue, the depth, which we will show to be very useful over the course of the thesis. We started our experiments with already published datasets captured with a stereo rig mounted on a child stroller in urban scenarios, courtesy of Ess et al. (2009a). In addition, we captured more challenging datasets, especially required for the evaluation of our novel tracking-before-detection approach for tracking of unknown objects (cf. Chapter 8). In this chapter, we will present and discuss a subset of approaches we have chosen as components (stereo, visual odometry, object detection and multi-hypothesis tracker ), which we constantly employed while developing our tracking frameworks. In the following Sec. 3.1, we will introduce a subset of pedestrian detection approaches we used. Then, in Sec. 3.2 we will discuss stereo estimation and explain the use of depth information in our tracking frameworks. In Sec. 3.3 we will shortly present the visual odometry which is used as a necessary means in our framework, allowing us to reason about object trajectories in world coordinates. In Sec. 3.4 we will introduce the Multi-Hypothesis Tracking approach by Ess et al. (2009b) which served as basis for our own extended reimplementation. In Sec. 3.5 a real time tracking-by-detection approach will be presented combining all of the aforementioned components in a unified framework. Finally, in Sec. 3.6 we will describe the datasets we have captured and used for systematical evaluation of our frameworks.. 3.1. Object Detection The ability to reliably detect pedestrians in real-world images made the progress in tracking-by-detection possible in the first place. Pedestrian detectors proposed in recent years have reached remarkable detection performance in street scenes (Benenson et al., 2012; Dalal and Triggs, 2005; Dollar et al., 2010; Felzenszwalb et al., 2010b). Even though these approaches rely on conceptually simple Histograms-of-Oriented-Gradients. 17.
(32) 3. Preliminaries. Figure 3.1.: Example results of the pedestrian detectors on the Bahnhof sequence. First row corresponds to the output of the detector proposed by (Felzenszwalb et al., 2010b). Second row (Sudowe and Leibe, 2011) and the last row shows the output from our upper body detector presented in Chapter 6. (HOG) features, they still reach the best performance for fully observed pedestrians as shown in (Dollar et al., 2009). The HOG descriptors were originally presented by Dalal and Triggs (2005) in context of a sliding window pedestrian detector. The central idea behind HOG features is that the local object appearance and shape of an object can be robustly described by the distribution of gradients and their magnitude. The distribution is described by histograms that group the gradients with respect to their orientation weighted by the magnitude into bins. These histograms are computed for each squared cell which decompose the image in a dense uniformly spaced grid. In order to reach better illumination invariance Dalal and Triggs (2005) proposes to contrast-normalize the cells using a block-wise pattern before concatenating them to a descriptor. To this end, the histograms of the cells embedded in a block (usually consisting of 2×2 cells) are accumulated and all cells are normalized by this accumulated value (cf. Fig. 3.2). The extracted features are then. 18.
(33) 3.1. Object Detection. . . . . . . . . . . (a). (b). (c). . . . . . . (d). Figure 3.2.: HOG features extraction pipeline. (a) Original image - cropped from first frame of the Bahnhof sequence. (b) Gradient image. (c) Decomposition of the image into small squared cells. Cells are represented by gradients binned into histograms. Cell histograms are contrast-normalized by the intensity across a block (usually 2×2 cells). (d) Resulting HOG feature.. used in order to train a discriminative, linear SVM classifier. The SVM performs a binary decision for a given image window, whether this window contains an object or not. The decision is repeated during the test procedure for all rectangular windows, at each possible position, by sliding the window over entire image. In order to detect objects at different scales the image is rescaled for several scale strides and the decision process is repeated. The final detections are obtained after non-maximum suppression, which is necessary due to the multi-scale approach. This introduces several additional detections on a person for a number of neighboring scales. A disadvantage of using HOG features is the high computational effort, making approaches relying on HOG features quite expensive to evaluate and thus limits their use in mobile platforms. The high computational costs are related to the multi-scaling required in order to detect pedestrians at different scales. For each rescaled image the feature extraction procedure needs to be repeated. Generally, the image is not only downscaled, but also upscaled in order to detect pedestrians which are far away from the camera. Many approaches have been proposed in recent years to speed up detection tasks, including detection cascades (Felzenszwalb et al., 2010a; Viola and Jones, 2004), other and more efficient feature representations than HOG (Benenson et al., 2012; Dollar et al., 2010), and alternatives to the sliding-window search strategy based on ROI extraction relying on, e.g., stereo range data (Bajracharya et al., 2009a; Bansal et al., 2010; Gavrila and Munder, 2007), motion (Enzweiler et al., 2008) and scene geometry (Geronimo et al., 2010a). In order to cope with the problem of partial occlusions, several frameworks have been proposed combining object part detector outputs in a mixture of experts manner (Enzweiler et al., 2010; Wojek et al., 2011).. 19.
(34) 3. Preliminaries In this thesis, we employed four different publicly available human detectors (Benenson et al., 2012; Felzenszwalb et al., 2010b; Prisacariu and Reid, 2009; Sudowe and Leibe, 2011) and our proposed upper body detector (Mitzel and Leibe, 2012a), which will be explained in detail in Chapter 6. The first two detectors (Prisacariu and Reid, 2009; Sudowe and Leibe, 2011) that we used are efficient reimplementations of the popular HOG feature based approach proposed by Dalal and Triggs (2005). Both reimplementations use the GPU in order to speed up the computationally expensive components of the pipeline such as the feature computation. The reimplementation proposed by Sudowe and Leibe (2011) includes several other extensions, such as limiting the computational efforts to small corridors in the image plane that were estimated by exploiting the given scene geometry, ground plane and camera parameters. Another detector employed in our recent work is a part-based detector from Felzenszwalb et al. (2010b). This uses HOG feature based classifiers in order to detect the body parts and to combine them in a probabilistic model for the final detection. During the integration of our tracking framework within the European Project EUROPA, we employed the detector from (Benenson et al., 2012), that is based on an efficient detector from (Dollar et al., 2010), but proposes several extension in the detection pipeline for reducing the number of image scales by approximating the image features. By using a Stixel representation (Badino et al., 2009) without dense depth computation, the detector evaluation area can further be reduced to small image regions. These extensions led to a pedestrian detector that runs with 100 fps on a GPU.. 3.2. Stereo Estimation The association of detected objects is usually performed in 3D world coordinates, which enables us to employ physically plausible motion models for the individual objects. The motion model becomes particularly important when the object detector fails, in which case it can continue to predict the most likely current object position. There are several options in order to convert the detector output (bounding box) into 3D world coordinates. An often used approach in single camera setups is to compute the intersection of a ray through the bounding box footpoint and an estimated ground plane (Ess et al., 2009b; Gavrila and Munder, 2007; Leibe et al., 2008a). This approach works reasonable well if the bounding box footpoint and the ground plane can be estimated precisely. However, due to imprecise scale selection during the non-maximum suppression process of the object detector and the approximative fit of the ground plane, the 3D positions are usually very noisy, as can be seen in Fig 3.3(c). In several cases detections jumps more than 0.5m from one frame to the next, meaning that they move at a speed of approximately 25km/h, given the framerate of 14 fps. Although both pedestrians are moving at around 4km/h, in several frames the projected positions are oscillating around the same point on the ground plane. However, given the stereo data, the obtained 3D positions are much smoother and more accurate, as shown in Fig. 3.3(d).. 20.
(35) 3.2. Stereo Estimation. 3D position − Using stereo. 19. 19. 18. 18. 17. 17. 16. 16. 15. 15. 14. 14. 13 −2. (b). 20. meters. (a). meters. 3D position − Backprojection bbox 20. −1. 0 meters. (c). 1. 2. 13 −2. −1. 0 meters. 1. 2. (d). Figure 3.3.: (a,b) Detector output using HOG based implementation by (Sudowe and Leibe, 2011). (c) 3D positions on the ground plane obtained by intersecting the bounding box foot point with the ground plane. (d) 3D positions on the ground plane by employing stereo information. Laser scanners are often employed in robotic scenarios, since they yield precise and reliable distance measurements. Fusing camera and laser information allows for a more precise retrieval of the 3D position of detected pedestrians. Generally, this is achieved by using laser distances from a front laser on a robot which are back-projected into the image plane. The z-values of the detected objects are obtained by taking the minimum of the laser points falling inside the detection bounding box, as shown in some example images in Fig. 3.4. With such a fusion the association step of a classical tracking-by-detection approach can be improved significantly, especially for distant objects. Furthermore, in context of a tracking system this fusion has one more advantage, the tracking results can in turn be employed by the local planning system of a robot. In particular, the tracker output can be used for annotation of the laser points as being objects of interest. Using laser scanners for distance measurements eliminates most issues arising with stereo data (e.g., failures of algorithms in homogeneous and low-textured image data or reflections in shopping windows), but it introduces different problems. Many current laser range sensors yield measurements for few horizontal planes with a relatively low radial resolution. This means that thin structures of target objects such as pedestrian legs will often not be hit and the resulting distance inside the detection bounding box will correspond to a different object at a further distance. We can cope with this problem. 21.
(36) 3. Preliminaries. Figure 3.4.: Tracking images showing the back-projected laser information (red dots) falling inside the detection bounding boxes. These distance measurements from the laser were used for a precise 3D position estimation of pedestrians. by assuming some correlation between the projected bounding box footpoint and the retrieved laser distance. Basic Stereo Estimation Pipeline. For a given pair of images from a stereo setup, the goal of to estimate the depth value for each pixel. For this, stereo algorithms usually perform a search for corresponding points or patches (small image regions) based on similarity in intensity or in orientation. The correspondence search is usually reduced to a 1D search by rectifying the two images. The rectification sets the epipolar lines of the images parallel to the rows, such that the y-coordinates of the corresponding points become the same. The output of the stereo algorithm is a disparity image, where each pixel states the distance d of x-position in the left image to the x-position of the corresponding point in the right image. Then given the focal length f and the camera baseline B we obtain the depth by fB z= . (3.1) d. 22.
(37) 3.3. Visual Odometry . . (a). (b). Figure 3.5.: (a) A rectified stereo pair, where the red line shows that the corresponding points have the same y-coordinate. (b) Corresponding disparity image obtained with the approach from (Geiger et al., 2010).. In this thesis we relied on two different stereo approaches (Felzenszwalb and Huttenlocher, 2006; Geiger et al., 2010). The approach by Felzenszwalb and Huttenlocher (2006) is a global MRF-based approach and has the advantage that due to enforced smoothness in neighborhood, the resulting stereo estimation is dense and accurate. The MRF-formulation for stereo, however, in general yields an NP-hard optimization problem due to a large label set which needs to be approximated by graph cut (Boykov et al., 2001) or belief propagation based techniques (Weiss and Freeman, 2001). However, both approximation approaches limit the application in real time scenarios due to space and time complexities (20-30 sec. on a 640×480 image). A second method we used for the experiments was recently introduced by Geiger et al. (2010) proposing a new approach that produces accurate stereo depth maps of comparable quality to global approaches, which allows performance close to real time. The approach first extracts a set of support points that can robustly be matched in both images. By performing triangulation on this set of points one obtains a strong prior on possible disparities and reduces the matching ambiguities of the remaining points around the support points. This allows for efficient exploitation of the disparity search space and yields accurate, dense reconstructions.. 3.3. Visual Odometry Our proposed tracking approaches in this thesis are based on a moving camera setup. That means that in order to be able to estimate objects trajectories in global world coordinates, we need to estimate the camera position in each frame. Often the task of the camera position estimation is performed based on wheel speed sensors or inertial measurement units (IMUs). However, the estimation of odometry based on wheel speed sensors is very imprecise. Especially when the robot is moving on slippery terrain and. 23.
Related documents