3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

(1)

3DMV: Joint

3D-Multi-View Prediction for 3D Semantic Scene

Segmentation

Team Artoo:

Abhinav Gupta Shubodh Sai Pawan Verma

Prof. Avinash Sharma

Aniket Joshi

(2)

Implementation Results

Testing some random scenes to generate RGB semantic segmentation. Unfortunately, the ScanNet dataset does NOT contain the ground truth for the test scenes

The 3D voxels Semantic Voxel labelling

(3)

Implementation Results

Original RGB-D Scans of Scene0707 from the ScaNet Dataset

(4)

Implementation Results (continued)

Point Cloud Semantic voxel labelling

Semantic Voxel Labeling

(5)

Outline:

• Recap

• Improving upon 2D semantic segmentation

• The 3D Network

• Salient Features: Joint 2D-3D network and Backprojection Layer

• Implementation Results

• Future Work

(6)

Why we chose this paper

• The most spectacular aspect of this paper was its immense advantage over other state-of-the-art methods! The existing methods used either geometry OR the RGB data as the input.

• But here, both the inputs are used in a joint, end-to-end fashion. The heart of the paper is the core idea of combining both geometric and RGB features in a joint network architecture. The combination of RGB and geometric information nicely complements each other.

• Secondly, semantic scene segmentation is a very useful task in computer and robotic vision. A robot must know ‘where’ the objects are in its environment!

• And thirdly, because it’s written by Matthias Niessner from TU Munich is an absolute legend and recently got 8 papers accepted to CVPR 2020. It was truly amazing to implement something he has worked on.

(7)

Recap: 3DMV

3DMV (3-dimensional multi-view reconstruction)

• A novel pipeline which combines 2D feature maps with 3D voxel data to generate a 3D semantic scene segmentation.

• Existing methods project colour data on the volumetric feature grid, this method provides better accuracy compared to existing volumetric

architectures.

CORE IDEA:

• Extract 2D feature maps from 2D images using the full-resolution RGB input.

• Down-sample features through convolutions in the 2D domain, and the resulting 2D feature map is subsequently back projected into 3D space.

• Key highlight is the formulation of a joint, end-to-end convolutional neural network which learns to infer 3D semantics from both 3D geometry and 2D RGB input.

(8)

Input: Reconstruction of the RGB-D scan as well as the RGB images used for the reconstruction Output: The 3D Semantic Segmentation! - as per voxel labels

(9)

Workﬂow overview

2 3

1

4

(10)

Recap: 2D Network Training

• The 2D network aims at extracting features from the input RGB images of the 3D scene.

• The 2D network architecture is based on ENet primarily due to its training speed and memory efﬁciency, which are essential when performing a joint 2D-3D analysis.

• ENet (Efﬁcient Neural Network) gives the ability to perform pixel-wise semantic segmentation in real-time.

Enet Architecture

• The ENet architecture comprises of a series of bottleneck modules. Each bottleneck module consists of:

• 1x1 projection that reduces the dimensionality

• A main convolution layer ( 3x3 )

• 1x1 expansion

• Batch Normalization and PReLU between all convolutional layers

• If the bottleneck is downsampling, a max pooling layer is added to the main branch. Also, the ﬁrst 1x1 projection is replaced with 2x2 convolution with stride=2.

(11)

Recap: Implementing the 2D Network

(12)

PHASE 2

Improving the 2D semantic segmentation

The 3D Network

(13)

Improving the 2D semantic segmentation

(14)

The 3D Network

Features of the 3D Network:

• 3D network part is composed of a series of 3D convolutions operating on a regular volumetric gird.

• The volumetric grid is a sub-volume of the voxelized 3D representation of the

scene. Each sub-volume is centered around a speciﬁc x-y location at a size of 31 x 31 x 62 voxels, with a voxel size of 4.8 cm.

• Takes these sub-volumes as input, and would predict the semantic labels for the centre columns of the respective sub-volume at a resolution of 1 x 1 x 62 voxels;

i.e., it simultaneously predicts labels for 62 voxels.

(15)

Salient Features - Backprojection Layer

• The authors assume a known 6-DoF pose alignments for the input RGB images with respect to each other and the 3D reconstruction.

• The layer is essentially a loop over every voxel in 3D sub-volume where a given image is associated to.

• For every voxel, the 3D-to-2D projection based on the corresponding camera pose, the camera intrinsic, and the world-to-grid transformation matrix are computed.

• In order to handle multiple 2D input streams, voxel-to-pixel associations with respect to each input view are computed.

• Some voxels will be associated with multiple pixels from different views. In order to

combine projected features from multiple input views, a voxel max-pooling operation is done that computes the maximum response on a per feature channel basis.

(16)

Salient Features - Joint 2D-3D Network

Features of the Joint-2D Network:

• The joint 2D-3D network combines 2D RGB features and 3D geometric features using the mapping from the backprojection layer.

• These two inputs are processed with a series of 3D convolutions, and then

concatenated together; the joined feature is then further processed with a set of 3D convolutions.

• Joint 2D-3D network operates on a per-chunk basis; i.e., it takes ﬁxed subvolumes of a 3D scene as input and predicts labels for the voxels in the center column of the

given chunk.

(17)

Benchmark Testing

(18)

Future Work

• In research, one needs to often draw a line between ‘research’ and development. What we have done so far is implemented a research paper. Coding something that has already

been shown to work is development.

• But demonstrating that something CAN work is research! That 'something' is a novel idea, that we come up with after much thought and insight, and one can never know whether it will work or not.

• But that's what research is all about, to come up with new, interesting ideas and try to see if something can work! Failures will inevitably be present, but they will guide us towards success.

• For our project, we decided to go a little further than merely implementing 3DMV!

(19)

Transformation Estimation between Two Disparate Views

• We came up with a novel idea to use the 3D semantic scene segmentation given by the network to estimate rigid transformations between 2 disparate views - a

common problem in robotic vision and SLAM.

• Our goal is to use the 3D semantics obtained from 3DMV to assist the visual odometry in the SLAM pipeline by generating constraints which is done by estimating transformations between given 2 disparate views.

• Given any disparate view pair (the RGB-D pair), our goal is only to recover the precise transformation, even with low overlap.

(20)

Our Proposed Pipeline

(21)

Toy Results

• We could not completely implement the pipeline as it would require much more time, but we tried to and obtained some simple results on the

following pair of highly disparate views:

(22)

First Point Cloud Second Point Cloud

Final Registered Point Cloud

Results Visualisation

(23)

Challenges Faced

• One of the most crucial challenges we faced was with the ScanNet dataset, which is a richly annotated dataset with 3D reconstructions of indoor scenes.

• The dataset is HUGE and uses a very rare format of .sdf.ann for storing 3D data.

Trying to ﬁgure that out was indeed a big task! The documentation was incomplete and the authors had not mentioned how to use it properly.

• We tried emailing the authors when we had any queries, but to no avail. We

then started mailing their students and ﬁnally one of them, who is a MS student under Matthias Niessner himself got back to us!

• Thanks to his help, we were able to successfully get the network to run.

(24)

But yes, ideas matter and not the dataset!

(25)

What We Learnt

• Up until now, we had mostly been ﬁne-tuning existing networks for custom

datasets. This really gave us some valuable experience in implementing papers from scratch and by ourselves.

• We noticed that often, just reading the paper is not enough. There are so many more things we learn when we go into the implementation level!

• We hope to continue working on the pipeline we proposed over the summer!

(26)

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

3DMV: Joint

3D-Multi-View Prediction for 3D Semantic Scene

Segmentation

Team Artoo:

Abhinav Gupta Shubodh Sai Pawan Verma

Prof. Avinash Sharma

Aniket Joshi

Implementation Results

Implementation Results

Implementation Results (continued)

Outline:

Why we chose this paper

Recap: 3DMV

Workﬂow overview

2 3

1

4

Recap: 2D Network Training

Recap: Implementing the 2D Network

PHASE 2

Improving the 2D semantic segmentation

The 3D Network

Improving the 2D semantic segmentation

The 3D Network

Salient Features - Backprojection Layer

Salient Features - Joint 2D-3D Network

Benchmark Testing

Future Work

Transformation Estimation between Two Disparate Views

Our Proposed Pipeline

Toy Results

Results Visualisation

Challenges Faced

But yes, ideas matter and not the dataset!

What We Learnt

Thank You

Very Much!