SUMMARY & FUTURE WORK - End-to-end learning of local point cloud feature descriptors

5.1 Summary

This thesis uses an experimental approach to study the use of a pointwise convolutional neural network (PwCNN) for the purpose of creating local 3D feature descriptors in point clouds. A network architecture that produces per-point feature descriptors and keypoint scores is designed, and a suitable loss function for the problem is developed for training the network. Two primary evaluation metrics are introduced and described: precision-recall curves summarize the descriptor’s matching ability, and performance in a typical object registration pipeline provides a holistic view of the overall descriptor quality.

A number of experiments were run on four different objects representing various surface charac- teristics. Each object was artificially augmented with noise representative of a 3D camera, and the noisy data was used for both training and testing. Further, three of the objects were 3D printed and scanned to evaluate how the synthetic noise compares to true sensor noise.

The results indicate that the PwCNN in the proposed architecture is able to produce feature descriptors which show moderate resistance to noise. Further, the keypoint scores are of high enough accuracy to reduce the number of RANSAC iterations by three times, while using fewer than 25% of the original points.

An analysis of the internal learned filters of the network provides insights into what geometrical features are encoded by the network, and consequently, the descriptors themselves. This analysis reveals difficulties with representing rotation-invariant features and highlights the network’s sensitivity to the voxel size hyperparameter.

5.2 Future Work

There are a number of areas where further work can be done to better understand the model and its limitations, and improve its results. This section will highlight some of the most apparent next steps.

The effects of noise in the training data can be further explored. The experiments revealed that introducing noise narrows the performance gap between low and high noise test data, but this was sometimes accompanied by an overall decrease in performance, and sometimes by an overall increase. The precise relationship between noise and performance is yet to be fully explained, and is apparently dependent upon the object’s shape. Tests with more objects could answer this question more clearly.

To relate these descriptors more closely to other recent work in the field, the described model could be run on a standard benchmark dataset, such as 3DMatch [34]. Although this benchmark was developed in the context of resistance to occlusion, rather than noise, it would still be a valuable comparison.

On the aspect of descriptor quality, it is observed that rotation invariance presents a difficulty for the network. This could be alleviated either by changing the input point cloud representation, or by modifying the PwCNN operation. The point pair features used in [6] are an example of a rotation- invariant point representation, and similar ideas could be applied to the input representation here. Alternatively, the PwCNN operation itself could be modified to be more rotation-invariant, for example, by doing calculations relative to the center point’s normal rather than the global refer- ence frame. The descriptor quality can also be improved by better tuning the hyperparameters, specifically the voxel size.

Lastly, the PwCNN model is very slow during training, primarily due to it being unfriendly to GPU memory during backpropagation by having irregular memory access patterns. Further engineering and algorithmic improvements can be made to the implementation to speed it up.

REFERENCES

[1] M. Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, Soft- ware available from tensorflow.org, 2015.

[2] K. S. Arun, T. S. Huang, and S. D. Blostein, “Least-Squares Fitting of Two 3-D Point Sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PAMI-9, no. 5, pp. 698–700, Sep. 1987.

[3] J. Bromley, I. Guyon, Y. LeCun, E. S¨ackinger, and R. Shah, “Signature Verification Using a ”Siamese” Time Delay Neural Network,” in Proceedings of the 6th International Conference on Neural Information Processing Systems, ser. NIPS’93, San Francisco, CA, USA, 1993, pp. 737–744.

[4] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst, “Geometric deep learning: Going beyond euclidean data,” IEEE Signal Processing Magazine, vol. 34, no. 4, pp. 18–42, Jul. 2017.

[5] H. Deng, T. Birdal, and S. Ilic, “PPFNet: Global Context Aware Local Features for Ro- bust 3d Point Matching,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, Jun. 2018, pp. 195–205.

[6] H. Deng, T. Birdal, and S. Ilic, “PPF-FoldNet: Unsupervised learning of rotation invariant 3d local descriptors,” in Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, Sep. 2018, pp. 602–618.

[7] D. Eggert, A. Lorusso, and R. Fisher, “Estimating 3-d rigid body transformations: A compar- ison of four major algorithms,” Machine Vision and Applications, vol. 9, no. 5, pp. 272–290, Mar. 1997.

[8] Ford, Ford engine block, https://www.thingiverse.com/thing:40257, Accessed: 2019-03- 29.

[9] Fotonic, Fotonic P60UA series, Mar. 2017.

[10] Y. Guo, M. Bennamoun, F. Sohel, M. Lu, J. Wan, and N. M. Kwok, “A Comprehensive Performance Evaluation of 3d Local Feature Descriptors,” International Journal of Computer Vision, vol. 116, no. 1, pp. 66–89, Jan. 2016.

[11] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality Reduction by Learning an Invariant Mapping,” in 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Volume 2 (CVPR’06), vol. 2, New York, NY, USA, 2006, pp. 1735–1742.

[12] B. Hua, M. Tran, and S. Yeung, “Pointwise convolutional neural networks,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, Jun. 2018, pp. 984–993.

[13] A. E. Johnson and M. Hebert, “Using spin images for efficient object recognition in cluttered 3d scenes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 5, pp. 433–449, May 1999.

[14] M. Khoury, Q. Zhou, and V. Koltun, “Learning Compact Geometric Features,” in 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, Oct. 2017, pp. 153–161. [15] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR, Conference Track Proceedings, San Diego, CA, USA, May 2015.

[16] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural net- works,” in Advances in neural information processing systems, vol. 30, Long Beach, California, USA, Dec. 2017, pp. 971–980.

[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolu- tional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90, May 2017. [18] B. Kulis, “Metric Learning: A Survey,” Foundations and Trends in Machine Learning,R

vol. 5, no. 4, pp. 287–364, 2013.

[19] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, p. 436, May 2015. [20] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox, “DeepIM: Deep iterative matching for 6D pose

estimation,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds., Munich, Germany, Sep. 2018, pp. 695–711.

[21] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[22] M. Oberweger, M. Rad, and V. Lepetit, “Making deep heatmaps robust to partial occlusions for 3d object pose estimation,” in Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part XV, Sep. 2018, pp. 125–141. [23] PortedtoReality, Toy train, https : / / www . thingiverse . com / thing : 1545967, Accessed:

2019-05-08.

[24] C. R. Qi, H. Su, M. Kaichun, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, Jul. 2017, pp. 77–85.

[25] C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in Advances in Neural Information Processing Systems 30, Long Beach, California, USA, Dec. 2017, pp. 5099–5108.

[26] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in 2011 International Conference on Computer Vision, Nov. 2011, pp. 2564–2571. [27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A.

Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, Dec. 2015. [28] R. B. Rusu, N. Blodow, and M. Beetz, “Fast Point Feature Histograms (FPFH) for 3D registration,” in 2009 IEEE International Conference on Robotics and Automation, Kobe, Japan, May 2009, pp. 3212–3217.

[29] H. Samet, “An Overview of Quadtrees, Octrees, and Related Hierarchical Data Structures,” in Theoretical Foundations of Computer Graphics and CAD, R. A. Earnshaw, Ed., 1988, pp. 51–68.

[30] H. Sarbolandi, D. Lefloch, and A. Kolb, “Kinect range sensing: Structured-light versus Time- of-Flight Kinect,” Computer Vision and Image Understanding, vol. 139, pp. 1–20, Oct. 2015. [31] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recogni- tion and clustering,” in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, Massachusetts, USA, Jun. 2015, pp. 815–823.

[32] The Stanford 3D Scanning Repository, http://graphics.stanford.edu/data/3Dscanrep/, Accessed: 2019-02-12.

[33] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “PoseCNN: A convolutional neural network for 6d object pose estimation in cluttered scenes,” in Robotics: Science and Systems (RSS) XIV, May 2018.

[34] A. Zeng, S. Song, M. Nießner, M. Fisher, J. Xiao, and T. Funkhouser, “3DMatch: Learn- ing Local Geometric Descriptors from RGB-D Reconstructions,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii, USA, Jul. 2017, pp. 199–208.

In document End-to-end learning of local point cloud feature descriptors (Page 53-57)