1.3 Related Work
1.3.4 Dense Correspondence-Based Methods
The systems we will present in this thesis all fall into the category of dense correspondence-based methods. Such methods use a trained function to predict corre- spondences between the pixels of the image and positions in the local coordinate sys- tem of the object. Similarly to the key-point-based methods, multiple correspondences are selected in a RANSAC like fashion and combined to generate pose hypotheses.
The general idea of densely predicting correspondences was first introduced for the problem human body pose estimation. In the Virtuvian Manifold [73] Taylor et al. address this problem by predicting each pixel’s position on the surface of the human body, before finally fitting a parametric model.
Following this work, Shotton et al. describe a similar system for camera pose esti- mation in known scenes. They predict each pixel’s position in the coordinate system of the scene and use it as stepping stone for their final estimate.
Inspired by this approach, Brachmann et al. [31] published a method for 6D object pose estimation. This method will provide the framework for the systems we will present in this thesis. We will give a description of the framework in Chapter 2. An overview is provided in Figure 1.4.
The method from [31] uses a random forest to jointly predict whether or not a pixel is part of the object and where it is located in the object’s local coordinate system, in case it is part of the object. The former prediction is referred to as object probabilities, the latter prediction as object coordinates.
Brachmann et al. use a RANSAC-based sampling scheme to generate a pool of pose hypotheses, which serve as starting point for a search procedure. During this procedure, pose hypotheses are evaluated via a scoring function, based on analysis- by-synthesis: The function uses a 3D model of the object to render images under the pose hypothesis and then does a pixel-wise comparison against observed images.
The framework from [31] unifies a number of advantages: (i) The predicted cor- respondences can be trained to be invariant against lighting changes. (ii) By using
16 Chapter 1. Introduction
densely predicted correspondences, the method is able to combine some inherent ro- bustness against occlusion, as found in sparse key-point-based methods, with an abil- ity to handle texture-less objects. (iii) They perform well in cluttered scenes, as their hypothesis-based search and analysis-by-synthesis-based scoring allows them to ef- fectively tell similar looking objects apart. The systems presented in this thesis will build on [31] to benefit from these advantages.
We see the biggest limitation of [31] in its handling of occluded objects. Even though dense predictions bring inherent robustness to occlusion, the simple analysis- by-synthesis approach struggles, when applied with occluded objects, which can have a dramatically different appearance compared to an unoccluded rendered 3D model. In System II we will present a machine learning-based way to improve this scoring function.
While Brachmann et al. also use sampling to generate hypotheses, their view on sampling differs significantly from the one we will take in this thesis. The RANSAC- based sampling procedure in [31] functions as a black box, producing hypotheses. The systems, we will present in this thesis will include this procedure as well. However, they will additionally introduce different Monte Carlo sampling schemes in the vari- ous parts of the framework.
In Contrast to [31], we will always view sampling as a way to represent various probability distributions we are interested in. In Systems I and II sampling is used to represent a posterior distribution over poses that ultimately coincides with our knowl- edge of the pose. In Systems III we use samples to describe a probability distribution, governing the behavior of an RL agent.
Apart from the systems to be presented in this thesis, multiple other works have successfully build upon [31]. In [74] Michel et al. extend the method for objects, which are not rigid, but include moving parts, such laptop computers or cupboards with drawers and doors. In [75] Mund et al. adapt the system for the use with a LiDAR sensor to improve airfield safety. In [45] the system is modified to work without depth information and applied to camera pose estimation. Finally, in [46] Brachmann et al. replace the random forest with a CNN and find an RL inspired perspective to allow end-to-end training.
The approach from [46] is related to System III, which we will present in Chap- ter 6. Both systems use RL to learn a pose estimation pipeline including discrete choices. However, Brachmann et al. learn only a single discrete choice at the end of their pipeline, namely which of the hypotheses is to be selected as final estimate. This enables them to learn the prediction of object coordinates at an earlier stage of the pipeline in an end-to-end fashion. In contrast, we will describe how to train an iter- ative system to dynamically make repeated discrete choices with the goal of making best use of a restricted budget.
1.4. Thesis Overview 17