3DMV: Joint
3D-Multi-View Prediction for 3D Semantic Scene
Segmentation
Team Artoo:
Abhinav Gupta Shubodh Sai Pawan Verma
Prof. Avinash Sharma
Aniket Joshi
Implementation Results
Testing some random scenes to generate RGB semantic segmentation. Unfortunately, the ScanNet dataset does NOT contain the ground truth for the test scenes
The 3D voxels Semantic Voxel labelling
Implementation Results
Original RGB-D Scans of Scene0707 from the ScaNet Dataset
Implementation Results (continued)
Point Cloud Semantic voxel labelling
Semantic Voxel Labeling
Outline:
• Recap
• Improving upon 2D semantic segmentation
• The 3D Network
• Salient Features: Joint 2D-3D network and Backprojection Layer
• Implementation Results
• Future Work
Why we chose this paper
• The most spectacular aspect of this paper was its immense advantage over other state-of-the-art methods! The existing methods used either geometry OR the RGB data as the input.
• But here, both the inputs are used in a joint, end-to-end fashion. The heart of the paper is the core idea of combining both geometric and RGB features in a joint network architecture. The combination of RGB and geometric information nicely complements each other.
• Secondly, semantic scene segmentation is a very useful task in computer and robotic vision. A robot must know ‘where’ the objects are in its environment!
• And thirdly, because it’s written by Matthias Niessner from TU Munich is an absolute legend and recently got 8 papers accepted to CVPR 2020. It was truly amazing to implement something he has worked on.
Recap: 3DMV
3DMV (3-dimensional multi-view reconstruction)
• A novel pipeline which combines 2D feature maps with 3D voxel data to generate a 3D semantic scene segmentation.
• Existing methods project colour data on the volumetric feature grid, this method provides better accuracy compared to existing volumetric
architectures.
CORE IDEA:
• Extract 2D feature maps from 2D images using the full-resolution RGB input.
• Down-sample features through convolutions in the 2D domain, and the resulting 2D feature map is subsequently back projected into 3D space.
• Key highlight is the formulation of a joint, end-to-end convolutional neural network which learns to infer 3D semantics from both 3D geometry and 2D RGB input.
Input: Reconstruction of the RGB-D scan as well as the RGB images used for the reconstruction Output: The 3D Semantic Segmentation! - as per voxel labels
Workflow overview
2 3
1
4
Recap: 2D Network Training
• The 2D network aims at extracting features from the input RGB images of the 3D scene.
• The 2D network architecture is based on ENet primarily due to its training speed and memory efficiency, which are essential when performing a joint 2D-3D analysis.
• ENet (Efficient Neural Network) gives the ability to perform pixel-wise semantic segmentation in real-time.
Enet Architecture
• The ENet architecture comprises of a series of bottleneck modules. Each bottleneck module consists of:
• 1x1 projection that reduces the dimensionality
• A main convolution layer ( 3x3 )
• 1x1 expansion
• Batch Normalization and PReLU between all convolutional layers
• If the bottleneck is downsampling, a max pooling layer is added to the main branch. Also, the first 1x1 projection is replaced with 2x2 convolution with stride=2.
Recap: Implementing the 2D Network
PHASE 2
Improving the 2D semantic segmentation
The 3D Network
Improving the 2D semantic segmentation
The 3D Network
Features of the 3D Network:
• 3D network part is composed of a series of 3D convolutions operating on a regular volumetric gird.
• The volumetric grid is a sub-volume of the voxelized 3D representation of the
scene. Each sub-volume is centered around a specific x-y location at a size of 31 x 31 x 62 voxels, with a voxel size of 4.8 cm.
• Takes these sub-volumes as input, and would predict the semantic labels for the centre columns of the respective sub-volume at a resolution of 1 x 1 x 62 voxels;
i.e., it simultaneously predicts labels for 62 voxels.
Salient Features - Backprojection Layer
• The authors assume a known 6-DoF pose alignments for the input RGB images with respect to each other and the 3D reconstruction.
• The layer is essentially a loop over every voxel in 3D sub-volume where a given image is associated to.
• For every voxel, the 3D-to-2D projection based on the corresponding camera pose, the camera intrinsic, and the world-to-grid transformation matrix are computed.
• In order to handle multiple 2D input streams, voxel-to-pixel associations with respect to each input view are computed.
• Some voxels will be associated with multiple pixels from different views. In order to
combine projected features from multiple input views, a voxel max-pooling operation is done that computes the maximum response on a per feature channel basis.
Salient Features - Joint 2D-3D Network
Features of the Joint-2D Network:
• The joint 2D-3D network combines 2D RGB features and 3D geometric features using the mapping from the backprojection layer.
• These two inputs are processed with a series of 3D convolutions, and then
concatenated together; the joined feature is then further processed with a set of 3D convolutions.
• Joint 2D-3D network operates on a per-chunk basis; i.e., it takes fixed subvolumes of a 3D scene as input and predicts labels for the voxels in the center column of the
given chunk.
Benchmark Testing
Future Work
• In research, one needs to often draw a line between ‘research’ and development. What we have done so far is implemented a research paper. Coding something that has already
been shown to work is development.
• But demonstrating that something CAN work is research! That 'something' is a novel idea, that we come up with after much thought and insight, and one can never know whether it will work or not.
• But that's what research is all about, to come up with new, interesting ideas and try to see if something can work! Failures will inevitably be present, but they will guide us towards success.
• For our project, we decided to go a little further than merely implementing 3DMV!
Transformation Estimation between Two Disparate Views
• We came up with a novel idea to use the 3D semantic scene segmentation given by the network to estimate rigid transformations between 2 disparate views - a
common problem in robotic vision and SLAM.
• Our goal is to use the 3D semantics obtained from 3DMV to assist the visual odometry in the SLAM pipeline by generating constraints which is done by estimating transformations between given 2 disparate views.
• Given any disparate view pair (the RGB-D pair), our goal is only to recover the precise transformation, even with low overlap.
Our Proposed Pipeline
Toy Results
• We could not completely implement the pipeline as it would require much more time, but we tried to and obtained some simple results on the
following pair of highly disparate views:
First Point Cloud Second Point Cloud
Final Registered Point Cloud
Results Visualisation
Challenges Faced
• One of the most crucial challenges we faced was with the ScanNet dataset, which is a richly annotated dataset with 3D reconstructions of indoor scenes.
• The dataset is HUGE and uses a very rare format of .sdf.ann for storing 3D data.
Trying to figure that out was indeed a big task! The documentation was incomplete and the authors had not mentioned how to use it properly.
• We tried emailing the authors when we had any queries, but to no avail. We
then started mailing their students and finally one of them, who is a MS student under Matthias Niessner himself got back to us!
• Thanks to his help, we were able to successfully get the network to run.
But yes, ideas matter and not the dataset!
What We Learnt
• Up until now, we had mostly been fine-tuning existing networks for custom
datasets. This really gave us some valuable experience in implementing papers from scratch and by ourselves.
• We noticed that often, just reading the paper is not enough. There are so many more things we learn when we go into the implementation level!
• We hope to continue working on the pipeline we proposed over the summer!