• No results found

5.3 Large-scale Grasp Prediction as Classification

5.3.2 Learning Methods

A key observation about the proposed feature representation is that it can be interpreted as a 2D image representation for which different channels have different semantic meanings. Deep neural networks are the state-of-the-art learning approach on large datasets with image like data. A key aspect for these methods is that they do not require data pre-processing but rather extract the feature transformations in a data-driven manner. Inspired by the successful applications of proxy objectives such as unsupervised (Erhan et al. 2010) and supervised pre-training (Collobert et al. 2008), we analyze the possibility to exploit the synthetically generated labeled datasets as initialization and regularizer for real-world data for grasping. Therefore we train a simple logistic classifier (Chapter 2.2.2.3) as a baseline learning method. This approach provides further insights into the overall data complexity since it only optimizes a linear mapping from the feature representation to a scalar output. (Herzog et al. 2014) showed promising results in previous work using this classifier on smaller datasets, both in terms of objects and data samples. The logistic classifier in (Herzog et al. 2014) was manually optimized as part of a non-parametric classifier which does not scale to our dataset size. We compare this method to a deep neural network, LeNet like architecture (Lecun et al. 1998), a highly parametric non-linear function approximator. We chose this architecture since it exploits 2D image like structures while being very efficient to evaluate.

From

Grasp

From

Side

Figure 5.10:Influence of the viewpoint on the local shape representation. All images show these representations for the same grasp (cyan line = approach direction) but with a different viewpoint (pink line). The top row shows the heightmap from the grasp direction while the bottom row shows the same heightmap from the side. While the surface points stay largely the same, the occlusion area changes drastically.

5.3.3 Experiments

Our experiments, presented hereafter, demonstrate that a simple convolutional neural network (CNN), representing the state-of-the-art deep learning architecture for computer vision, can be used to learn a mapping from our feature representation to grasp success, better leveraging the dataset size. The CNN model applied in our experiments can be seen as a logistic classifier which learns an additional non-linear feature mapping from data. The output of each method can be interpreted as a certainty that the given grasp is stable (close to 1) or unstable (close to 0). A threshold on this certainty can be used to achieve different rates of true and false positives (Figure 5.13).

The dataset for this experiment contains eight different rolls of the hand around every approach direction. This structure could be used to split the problem into eight different learning tasks. However, these eight different rolls can be directly represented in the template by simple rotation. Thus, a single learning method of appropriate complexity should be able to distinguish these rolls. Our feature representation cannot disambiguate different grasp standoffs. Thus we learn a different classifier for these cases.

The classification performance is reported on each dataset, using a binary threshold on the ratio- metric of choice at 0.9. For theε-ratio, we threshold the rawεat 0.002 to obtain a comparable data distribution for both ratio metrics. Each of our datasets contains randomly sampled objects within the data group (bottle, small, medium, large, all) if more than one object per category is available. Otherwise, it is only added to the training dataset. We exclude all objects that result in point clouds with less than 30 points, yielding more than 600 remaining object instances. Data from one object is exclusive to either the training, validation or testing dataset. Two thirds of the data are used for training and validation data and one third for reporting the results in Figure 5.13, 5.14, and 5.15. The validation dataset is used to select the best-learned predictor and tune hyperparameters such that the network convergences. We always report results on the test dataset, composed out of unique object not seen during training and validation. Although the majority of possible grasps, both in simulation and reality are unstable, resulting in a very biased dataset, we conducted experiments with balanced datasets to allow comparison with prior work, e.g., (Lenz et al. 2014). However, learning on balanced datasets improves performance but cannot be applied to the real unbalanced data distribution. We believe that in robotic grasping, it is important to achieve a high true positive rate at a low false positive rate. This tradeoff lowers the risk of applying unsuccessful grasps. Therefore, we report the ROC-curve up to 20% false positive rate. We want to stress that this analysis is not about the absolute performance of the deep learning method. We rather want to demonstrate that the data is too complex for linear methods (Figure 5.15b) and that the data sample size is within a domain which is reasonable for frameworks such as deep learning. One insight from the learning performance is that the physics-

0 50 100 150 200 small medium large true positive true negative true positive true negative true positive true negative (a) 0 50 100 150 200 small medium large true positive true negative true positive true negative true positive true negative (b)

Figure 5.12:The ground truth labels for the three object groups (small,medium,large) assigned to grasps selected by (a) physics-metric and (b)ε-metric, assuming a rejection ratio of 0.3. For

every result we report the correlation with the human labels for which at least>2 (bottom),

>3 (middle), and>4 (top) labeled the grasp accordingly. In all cases the proposed physics- metric outperforms theε-metric. The x axis denotes the absolute number of grasps with labels. In

green we show consistent labels from both humans and metrics, red means human have selected the opposite outcome and blue reflects undecided.

0.00 0.05 0.10 0.15 0.20 false positive rate

0.0 0.2 0.4 0.6 0.8 1.0

true positive rate

0.0-e-cnn (r: 0.18) 0.0-p-cnn (r: 0.13) 0.0-e-lreg (r: 0.18) 0.0-p-lreg (r: 0.13)

Figure 5.13:ROC curve for the bottle dataset (standoffs 0; results for standoff 25mm are reported online) Labels based onε-ratio (e) and physics-ratio (p) for CNN and logistic classification (lreg).

CNN based prediction always outperforms lreg and physics-ratio outperforms ε-ratio based

datasets. The data ratio (r) for each dataset suggests that despite the stronger negative bias of the physics-ratio based dataset it is still more consistent and thus easier to optimize for.

metric-based results are almost always better than the one based on theε-metric (Figure 5.15b),

even though the physics-metric-based datasets are more biased and thus more challenging for classification. Since the presented data is strongly biased towards predicting a negative outcome, we believe that this interesting dataset can also result in new algorithms within the deep learning framework.

0.00 0.05 0.10 0.15 0.20 false positive rate

0.0 0.2 0.4 0.6 0.8 1.0

true positive rate

e-cnn (r: 0.20) p-cnn (r: 0.15) e-lreg (r: 0.20) p-lreg (r: 0.15) (a) 0.00 0.05 0.10 0.15 0.20

false positive rate 0.0 0.2 0.4 0.6 0.8 1.0

true positive rate

e-cnn (r: 0.10) p-cnn (r: 0.08) e-lreg (r: 0.10) p-lreg (r: 0.08)

(b)

Figure 5.14:Results for small objects (a) and medium objects (b) illustrate that a linear logistic classifier cannot cope with the data complexity.

0.00 0.05 0.10 0.15 0.20

false positive rate 0.0 0.2 0.4 0.6 0.8 1.0

true positive rate

e-cnn (r: 0.07) p-cnn (r: 0.03) e-lreg (r: 0.07) p-lreg (r: 0.03) (a) 0.00 0.05 0.10 0.15 0.20

false positive rate 0.0 0.2 0.4 0.6 0.8 1.0

true positive rate

e-cnn (r: 0.12) p-cnn (r: 0.08) e-lreg (r: 0.12) p-lreg (r: 0.08)

(b)

Figure 5.15: Results for large objects (a) and all objects (b) illustrate that a linear logistic classifier cannot cope with the data complexity. In addition to that, the presented result supports the hypothesis that the physics-ratio generates more consistent data which is in term easier to learn.