We have demonstrated that we can successfully recover 3D geometry from only a single input depth image by making structured predictions. Our algorithm efficiently combines both shape selection and pose estimation, using simple feature test evaluations to pre- dict local geometry occupancy. Voxlets are learned from training data, which has so far included only man-made table-top sized objects. Even similar objects vary in size and viewing direction, but objects from distinct semantic classes share enough 3D shape com- ponents to allow good, though not perfect, reconstructions.
Sharp edges are frequently rounded due to averaging. Also, as a supervised learning algorithm, our algorithm is limited by the data available at training time. This can be seen in Figure 5.13, as no computer-sized objects are in the training set. Currently, our voxlets are a fixed size, and success correlates with the test-scene having similar sized objects.
Timings and complexity Predicting the occupancy for a single scene takes less than 30 seconds using our unoptimized Python implementation. We found that memory was a limiting factor at both training and test time. This could be improved using a more efficient voxel storing scheme.
5.6.1 Future work
We present here two interesting opportunities for future research to improve the quality of the results from the algorithm
Physics-based reasoning For some applications, the quality of our predictions may already be enough, e.g. to aid robot grasping or navigation. However, there is currently nothing guaranteeing that our results are physically stable or smooth. Learning to make physically plausible predictions is difficult. How to best incorporate physics-based reason- ing [178, 146] is still an open problem, but enforcing this strong prior may result in even more accurate results.
Trading off goodness-of-fit against coverage We showed in Section 5.5.5 that by selecting the best fitting voxlets from all the proposals, instead of accumulating all the
proposals from all the sampled points, we can get a better overall result. The intuition behind this is clear: at some of the sampled points, the forest is not able to provide a good prediction. In many cases the bad prediction from the forest is immediately obvious by virtue of the fact that the geometry of the predicted voxlet does not match the geometry of the scene at the proposed location. However, at other nearby points, the forest is able to make a good prediction which does match the geometry of the scene. By avoiding including bad-fitting voxlets into the scene at all, we should be able to increase the performance of the algorithm. However, the point at which we should stop adding in voxlets is unclear. If we make voxlet predictions everywhere in the target image, then the good predictions are marred and muddied by the voxlet proposals which are bad. Alternatively, if we add in too few voxlets, we end up failing to make predictions in many areas. We demonstrate this concept visually in Figure 5.18. Finding a way to balance ‘goodness-of-fit’ against coverage of the unknown space should be a key opportunity for future research.
✗
✓
✗
✓
(a) Here we have made a set of predictions for the scene, as described in this chapter. Some of
the predictions are good fits to the observed data (X), while others do not match the observed data well (7). This goodness-of-fit can
be evaluated from the observed data.
(b) By averaging all the predictions together we find that the bad voxlet predictions (7) can cause a poor final
prediction.
(c) We could just average together the predictions which fit the observed data the best (X). However, this may mean that some areas of the scene do not get any predictions made at all, leading to a
low rate of recall.
(d) The best solution can be found by balancing goodness-of-fit against coverage. Here we have used well-fitting predictions (X) where possible. Where no
good predictions exist, we have made the best predictions possible in order to make some predictions for the unknown space. Figure 5.18: An overview of proposed future work, regularising the voxlet predictions to ensure a balance between goodness-of-fit and coverage of unknown space.
Learning to discover objects for
object completion
In this thesis we set out to infer the occupancy of a scene given as input a single depth image. We have made some steps towards achieving this goal in Chapters 4 and 5. These methods used statistical inference, together with carefully constructed features, to predict the missing geometry of the scene.
However, the results we have obtained have only been predictions of the missing data. Our overall problem, as given, is an ambiguous inverse problem: The only real way to know what the back of the scene looks like is to observe it from more than one viewpoint. In Chapter 1 we described the problem that in many cases, it is impossible to capture all the images required for such a reconstruction. In this chapter, therefore, we introduce a novel method to recover the 3D shape of scenes.
If we had a library of training objects, each with ground truth occupancy, we might be able to recover the full 3D scene structure from a single depth image by fitting the 3D models to the scene. However, such a library of training objects is not available for many objects in the world (Section 3.3). We hypothesise that the scene viewed in many images is made up of individual objects, each of which we may have seen in other scenes. These other images may have been captured by a different device in a different location in the world, or by the same device nearby to the current image. If any of the other scenes containing the object has had its full geometry recovered (e.g. with a Kinect Fusion
Algorithms+presented+in+ this+thesis+
LONDON+
TOKYO+
I’ve+observed+what+these+ objects+are+used+for+ I’ve+observed+what+these+ objects+are+called+ I’ve+observed+the+full+3D+ shape+of+this+scene+SAN+FRANCISCO+
LONDON+
TOKYO+
I’ve+now+managed+to+ complete+the+3D+shape+of+ this+scene+ And+I’ve+learned+what+these+objects+are+called+ I+now+know+what+some+of+these+objects+might+be+ used+for+
SAN+FRANCISCO+
…+
…+
Figure 6.1: An overview image motivating the high-level concept of the algorithms presented in this chapter. By combining multiple partial views at an object level, we hypothesise that we should be able to complete many unknown regions of incomplete scans. Instead of using a library of training images, we require each object to be seen from differing viewpoints throughout a collection of input images.
system [80]), we can use the reconstructions of each object from these other scenes to help to directly complete the geometry of objects in the test scene.
However, this concept is non-trivial. Not only must a final alignment be found, at an object level, but regions of images corresponding to different objects must also be found between depth images captured of different scenes.
In this chapter we make two specific contributions to enable this method of 3D recon- struction:
1. We introduce a novel probabilistic framework for discovering objects from RGBD images. In contrast to previous works in this field, which typically require one or more human-set parameters for a clustering algorithm, we are able to learn how to discover objects by making use of training data.
2. We introduce a method for using these corresponding regions to enable the comple- tion of some of the unseen parts of objects embedded in a test image.
6.1
Problem statement
We reason that many objects found in the world can be considered as instantiations of abstract object templates. Each template can be thought of as a mould which holds information on the three-dimensional shape and size of the object, its colour, texture, material and other physical properties. We define a scene as being a set of one or more object instantiations together with background environment (e.g. walls and floors), and further properties such as lighting and background colour. Each scene is then captured in a single static RGB-D image D.
The aim of our object discovery system is to recover the number of object templates and each of their appearance models, given a set of input RGB-D images {D1...N}, and to
find the set of pixels in each image corresponding to each object. We assume that we do not know information about the shape of object templates in advance, instead requiring this as an output from the algorithm. The recovered correspondences between regions in each image is then used as input to our system to recover missing parts of 3D shape of the objects in each scene. An overview of this approach is given in Figure 6.2.