Learning a mapping from features to voxlets

We pose unobserved geometry estimation, given partial observed information, as a super- vised learning problem. More specifically, our goal is to learn a function f : X → Y that maps a feature vector x ∈ X , computed from observed geometry, to the output space Y ∈ Y representing the corresponding 3D geometry in the region R around s. Unlike standard classification, where the goal is to predict a category label for each x, our output space is a three-dimensional array Y ∈ Rw×d×hthat encodes the TSDF values in the local region. The dimensionality of Y is prohibitively large, making it difficult to use standard multivariate regression approaches, e.g. [30]. Inspired by the recent work of Doll´ar and Zitnick [40], we use a structured Random Forest to learn the function f .

5.3.1 Training

Our training set, {(x1, Y1), ..., (xn, Yn)}, comprises region and feature pairs sampled from

the full 3D reconstructions of scene. To train the structured forest we pass a random subset of 50% of the training set to each tree, starting at the root node. This use of a random subset for each tree is known as ‘bagging’. It helps to reduce overfitting [16] and helps ensure that the predictions from each tree exhibit diversity. Each node is then tasked with splitting the data using the x variables such that the data sent sent to each child node are as similar as possible in shape, i.e. with similar Y values. One way of achieving this would be to find a split in the data which minimises the sum of squared differences of the set of labels at the left node (YL) and right node (YR), i.e. which minimises:

E(S) = X d∈{L,R} X Y∈Yd ||Y − ¯Y_d||2₂, where (5.2) ¯ Y_d= 1 |Yd| X Y∈Yd Y. (5.3)

In effect this energy function rewards having a small spread of labels at the child nodes. However, for our high-dimensional label space computing this energy change for each candidate split at each node in each tree would become prohibitively expensive. Instead of minimizing this loss directly, we follow [40] in approximating this loss at each node using a classification loss. To use a classification loss, at each node each Y ∈ Y is assigned a proxy label ∈ {0, 1}. A split is then found in the data which minimises the classification loss, as if we were performing binary classification at the node. To convert the structured problem into a classification one, we create our two proxy classes by clustering: Before splitting the data at a node, we sample a different random subset of the dimensions of each Yi, reduce their dimensionality to M dimensions, and then cluster. Then a standard

classification loss can be used on this new discretization, to evaluate the quality of different candidate splits for each xi — in our case we use the Gini impurity measure. In practice,

we efficiently perform this dimensionality reduction and clustering at each node using randomized PCA [65]. A training example is then assigned to one of the two possible clusters based on the sign of the value of its first principal component. See Figure 5.4 for a pictorial overview of this process.

This process is repeated until we reach our maximum depth (which we set to 14), or we have fewer than five training examples at a node. In either of these cases, the node automatically becomes a leaf node, and splitting stops. Finally, as in [40], each leaf node stores the medoid of all the examples that have arrived there, which we refer to as a voxlet. We store the medoid for efficiency reasons but it is also possible to store multiple modes, e.g. [59].

5.3.2 Features

To describe the neighbourhood of s we use the surface feature xsurface as described in

Section 4.5.3. The surface feature is suitable for our purpose as it describes the shape of the surface in the vicinity of the query point. It is also very quick to compute, and invariant to camera translation along the z-axis. By training on scenes captured from multiple angles we cover a wide range of possible camera x − y translations and camera rotations.

y

₁

y

₂

(a) Here we show the label space for a structured labelling problem. Each training data point at a node, shown here as a circle, is also associated with

a point in feature space, which is not shown here. The aim is to find a split in feature space which

corresponds to a ‘good’ division in label space.

(b) Each proposed split in feature space divides the training examples in two; here we depict the label space result of a split, using red and green to colour

the points. Traditionally, the split which minimises Equation 5.3 (i.e. which forms the tightest clusters

in label space) may be selected from the set of candidates. This evaluation is expensive when

dimensionality is high.

e

₁

_e

Y

principal directions (e1, e2) in label space. These form a coordinate system centred on the mean of

all the points ¯Y.

(d) By looking at the sign of the first principal component of each data point, we can assign it a

temporary, proxy label. We depict these proxy labels here as orange and purple outer rings.

(e) Finally we evaluate our splits, which again have been proposed in feature space (not shown here).

We choose the split which assigns the points left and right (red and green) in a way that most closely agrees with the proxy labels. The split shown here has a fairly good agreement of split labels with

proxy labels.

Pre-segmentation for clutter Real world scenes contain clutter and interacting objects. This poses a challenge when we are extracting our training regions. While it is possible to represent the shape variation of isolated objects, modeling variations in ar- rangements of objects is much harder. This is intuitive, as the space of geometry induced by object combinations is much larger than that of individual objects. To overcome this problem, we perform an unsupervised segmentation of the training scenes, to separate individual objects. We use the same method for this segmentation of the training data as we do in Section 4.6.1, and an example results is shown in Figure 5.6(b). This segmentation encourages each training region to only model the shape of isolated objects. If objects are not well segmented at this stage, it is not a problem, but that node will be likely to make conjoined predictions at test time.

In document Learning to Complete 3D Scenes from Single Depth Images (Page 87-90)