6.4 Framework Description
6.5.6 Long Training Results
We trained our CNN for 5000 iterations in the previous experiments because we observed that this is usually enough to see how well different framework configurations perform. To give an insight on how longer training affects our results, we repeated training on the A dataset (Section 6.5) with 5 perturbed frames over 22000 iterations. Compared to 5000
6.6 Summary
Tr(E)xTe(E) Tr(A)xTe(A) Tr(A)xTe(E) Tr(O)xTe(E) Tr(A)xTe(R)
Rot Tran Rot Tran Rot Tran Rot Tran Rot Tran
2 frames ∅ grad.Equal 0.63 34.82 0.64 35.25 0.67 37.03 0.70 38.14 0.79 46.89
0.60 33.45 0.68 38.06 0.70 39.14 0.72 38.61 0.86 47.23
5 frames perturbed
Equal 0.48 29.59 0.51 27.80 0.53 32.04 0.61 34.96 0.95 43.91
∅ grad. 0.49 29.12 0.54 29.99 0.57 34.46 0.61 35.08 0.88 46.06
Table 6.5: Dependence of the training results on the weighting of images during training. We
trained our network once by weighting all images equally and once by weighting images based on their average image gradient (rows) and compare the results for different frame configurations (row blocks) and training / testing data (columns).
iterations, we observed the testing loss on the E dataset to drop from 1.03 to 0.58 (-44%) while the rotation error dropped from 0.57 to 0.42 degrees (-26%) and the error of the translation direction improved from 34.46 to 24.91 degrees (-28%).
6.6 Summary
In this chapter, we present a framework to train a CNN for Visual Odometry based on online-rendered, realistic images from the Unreal Game Engine with high speed. We describe how we integrated the UE in a machine learning framework. Our results show that our synthetically generated data generalizes to other rendered scenery and real world captures. The online training data generation enables us to run several special evalua- tions like training with different numbers of frames per training sample or different label distributions. Such evaluations are not easily achievable with pre-generated or captured training data. Based on our framework, we can show how the number of frames given for tracking, the resolution and the distribution of camera rotations and translations on the training samples influence the quality of the results. We also give detailed insights on the behavior of our CNN on difficult, confusable camera motions and show how different tracking errors behave over time during training.
In the future, this evaluation could be extended to more recent and more complex NN structures. Another interesting component that could easily be integrated is learning the camera intrinsics.
Chapter 7
Visualization
Examining our reconstruction results based on raw numbers or rendered images is in many cases very cumbersome because it is difficult to get an impression of the 3D struc- ture of the observed content and of effects distributed in a scene such as the relationships between scene surface points and corresponding features on different camera images. We therefore developed an interactive, 3D, OpenGL based visualization framework for the reconstructed scenes which works quite similarly for dense, feature based and sparse, depth map based methods.
Our visualization framework (see Figures 7.1 and 7.2) runs in a separate thread and opens its own OpenGL window. This enables us to update the presented content vir- tually everywhere in our reconstruction process: We can either just represent the final reconstruction result at the end or update the presentation on every tiny reconstruction step such as on each newly triangulated point, each additional camera that was added or even on each separate LM iteration which allows us to get a very detailed insight in the process of the reconstruction. Furthermore, due to the fact that the visualization runs in its own thread, it stays fully responsible while the reconstruction is performed in the back- ground and even interoperates nicely with debuggers such as GDB in non-stop mode: it is possible to stop the reconstruction thread exclusively which enables us to fly through our 3D reconstruction visually while debugging the processing on it in the code of the same program, without having to restart it.
Our visualizer shows the reconstructed scene in one big window which can be explored in free flight, WASD-style mode. It is possible to select cameras or scene points to obtain a numerical representation of selected entities. Moreover, the user can seamlessly move the observer camera to a reconstructed camera pose, seeing the captured image or features as well as the related scene reconstruction parts as captured from that camera pose. The user may also zoom in to specific image regions to see the sub-pixel precise alignment of image feature points and the corresponding 3D scene point reprojections or the subpixel alignment of warped keyframe and tracked frame image content.
There are some differences in the representation of the scene structure and the image content for sparse, feature based and for dense, depth map based methods:
Figure 7.1: Visualization of a sparse surface, feature based reconstruction. 3D scene points are
represented in light gray or, if seen by the currently selected camera, in red. For the selected camera, the image feature points are represented in blue. Image planes are shown in dark gray. The top image shows the reconstruction in free flight mode while the bottom row contains two images from a camera’s perspective, closely zoomed in on the right so that reprojection errors become visible as offset between a red scene point and its corresponding blue image feature.
Figure 7.2: Visualization of a dense, depth map based reconstruction. The depth maps of the
keyframes are represented in red while their variances are shown in blue. The images from the tracked frames which have no depth information are rendered in green color. The top split image shows the reconstruction in free flight mode with (right) and without (left) visualized variances of the depth values. The bottom row contains two images from a camera’s perspective. The left view shows that the new, green frame is nicely aligned with the warped keyframe data (red) after tracking. The closeup view on the right reveals where the warped pixels are placed exactly on the tracked frame. In addition, one can see that the variances are in the order of magnitude of the pixel size when projected to the image plane while being much larger in the pixel’s direction in space (top right).
7.1 Feature Based Visualization
For the visualization of sparse scenes (see Figure 7.1), the dimensions of the image plane are shown in front of every camera pose and the direction towards all features detected for a camera pose are visualized only when the camera is selected (to avoid cluttering the screen). In addition, all scene points are shown and marked red if they are attached to the currently selected camera pose. This enables us to quickly see the points based on which a camera’s pose is determined. If a camera pose and an attached point are selected, our tool visualizes the reprojection error from point to feature as a line which can be analyzed in 3D or image space.