• No results found

3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation

N/A
N/A
Protected

Academic year: 2022

Share "3DMV: Joint 3D-Multi-View Prediction for 3D Semantic Scene Segmentation"

Copied!
26
0
0

Loading.... (view fulltext now)

Full text

(1)

3DMV: Joint

3D-Multi-View Prediction for 3D Semantic Scene

Segmentation

Team Artoo:

Abhinav Gupta Shubodh Sai  Pawan Verma

Prof. Avinash Sharma

Aniket Joshi

(2)

Implementation Results

Testing some random scenes to generate RGB semantic segmentation. Unfortunately, the ScanNet dataset does NOT contain the ground truth for the test scenes

The 3D voxels Semantic Voxel labelling

(3)

Implementation Results

Original RGB-D Scans of Scene0707 from the ScaNet Dataset

(4)

Implementation Results (continued)

Point Cloud Semantic voxel labelling

Semantic Voxel Labeling

(5)

Outline:

Recap

Improving upon 2D semantic segmentation

The 3D Network

Salient Features: Joint 2D-3D network and Backprojection Layer

Implementation Results

Future Work

(6)

Why we chose this paper

The most spectacular aspect of this paper was its immense advantage over other state-of-the-art methods! The existing methods used either geometry OR the RGB data as the input.

But here, both the inputs are used in a joint, end-to-end fashion. The heart of the paper is the core idea of combining both geometric and RGB features in a joint network architecture. The combination of RGB and geometric information nicely complements each other.

Secondly, semantic scene segmentation is a very useful task in computer and robotic vision. A robot must know ‘where’ the objects are in its environment!

And thirdly, because it’s written by Matthias Niessner from TU Munich is an absolute legend and recently got 8 papers accepted to CVPR 2020. It was truly amazing to implement something he has worked on.

(7)

Recap: 3DMV

3DMV (3-dimensional multi-view reconstruction)

A novel pipeline which combines 2D feature maps with 3D voxel data to generate a 3D semantic scene segmentation.

Existing methods project colour data on the volumetric feature grid, this method provides better accuracy compared to existing volumetric

architectures.

CORE IDEA:

Extract 2D feature maps from 2D images using the full-resolution RGB input.

Down-sample features through convolutions in the 2D domain, and the resulting 2D feature map is subsequently back projected into 3D space.

Key highlight is the formulation of a joint, end-to-end convolutional neural network which learns to infer 3D semantics from both 3D geometry and 2D RGB input.

(8)

Input: Reconstruction of the RGB-D scan as well as the RGB images used for the reconstruction Output: The 3D Semantic Segmentation! - as per voxel labels

(9)

Workflow overview

2 3

1

4

(10)

Recap: 2D Network Training

The 2D network aims at extracting features from the input RGB images of the 3D scene.

The 2D network architecture is based on ENet primarily due to its training speed and memory efficiency, which are essential when performing a joint 2D-3D analysis.

ENet (Efficient Neural Network) gives the ability to perform pixel-wise semantic segmentation in real-time.

Enet Architecture

The ENet architecture comprises of a series of bottleneck modules. Each bottleneck module consists of:

1x1 projection that reduces the dimensionality

A main convolution layer ( 3x3 )

1x1 expansion

Batch Normalization and PReLU between all convolutional layers

If the bottleneck is downsampling, a max pooling layer is added to the main branch. Also, the first 1x1 projection is replaced with 2x2 convolution with stride=2.

(11)

Recap: Implementing the 2D Network

(12)

PHASE 2

Improving the 2D semantic segmentation

The 3D Network

(13)

Improving the 2D semantic segmentation

(14)

The 3D Network

Features of the 3D Network:

3D network part is composed of a series of 3D convolutions operating on a regular volumetric gird.

The volumetric grid is a sub-volume of the voxelized 3D representation of the

scene. Each sub-volume is centered around a specific x-y location at a size of 31 x 31 x 62 voxels, with a voxel size of 4.8 cm.

Takes these sub-volumes as input, and would predict the semantic labels for the centre columns of the respective sub-volume at a resolution of 1 x 1 x 62 voxels;

i.e., it simultaneously predicts labels for 62 voxels.

(15)

Salient Features - Backprojection Layer

The authors assume a known 6-DoF pose alignments for the input RGB images with respect to each other and the 3D reconstruction.

The layer is essentially a loop over every voxel in 3D sub-volume where a given image is associated to.

For every voxel, the 3D-to-2D projection based on the corresponding camera pose, the camera intrinsic, and the world-to-grid transformation matrix are computed.

In order to handle multiple 2D input streams, voxel-to-pixel associations with respect to each input view are computed.

Some voxels will be associated with multiple pixels from different views. In order to

combine projected features from multiple input views, a voxel max-pooling operation is done that computes the maximum response on a per feature channel basis.

(16)

Salient Features - Joint 2D-3D Network

Features of the Joint-2D Network:

The joint 2D-3D network combines 2D RGB features and 3D geometric features using the mapping from the backprojection layer.

These two inputs are processed with a series of 3D convolutions, and then

concatenated together; the joined feature is then further processed with a set of 3D convolutions.

Joint 2D-3D network operates on a per-chunk basis; i.e., it takes fixed subvolumes of a 3D scene as input and predicts labels for the voxels in the center column of the

given chunk.

(17)

Benchmark Testing

(18)

Future Work

In research, one needs to often draw a line between ‘research’ and development. What we have done so far is implemented a research paper. Coding something that has already

been shown to work is development.

But demonstrating that something CAN work is research! That 'something' is a novel idea, that we come up with after much thought and insight, and one can never know whether it will work or not.

But that's what research is all about, to come up with new, interesting ideas and try to see if something can work! Failures will inevitably be present, but they will guide us towards success.

For our project, we decided to go a little further than merely implementing 3DMV!

(19)

Transformation Estimation between Two Disparate Views

We came up with a novel idea to use the 3D semantic scene segmentation given by the network to estimate rigid transformations between 2 disparate views - a

common problem in robotic vision and SLAM.

Our goal is to use the 3D semantics obtained from 3DMV to assist the visual odometry in the SLAM pipeline by generating constraints which is done by estimating transformations between given 2 disparate views.

Given any disparate view pair (the RGB-D pair), our goal is only to recover the precise transformation, even with low overlap.

(20)

Our Proposed Pipeline

(21)

Toy Results

We could not completely implement the pipeline as it would require much more time, but we tried to and obtained some simple results on the

following pair of highly disparate views:

(22)

First Point Cloud Second Point Cloud

Final Registered Point Cloud

Results Visualisation

(23)

Challenges Faced

One of the most crucial challenges we faced was with the ScanNet dataset, which is a richly annotated dataset with 3D reconstructions of indoor scenes.

The dataset is HUGE and uses a very rare format of .sdf.ann for storing 3D data.

Trying to figure that out was indeed a big task! The documentation was incomplete and the authors had not mentioned how to use it properly.

We tried emailing the authors when we had any queries, but to no avail. We

then started mailing their students and finally one of them, who is a MS student under Matthias Niessner himself got back to us!

Thanks to his help, we were able to successfully get the network to run.

(24)

But yes, ideas matter and not the dataset!

(25)

What We Learnt

Up until now, we had mostly been fine-tuning existing networks for custom

datasets. This really gave us some valuable experience in implementing papers from scratch and by ourselves.

We noticed that often, just reading the paper is not enough. There are so many more things we learn when we go into the implementation level!

We hope to continue working on the pipeline we proposed over the summer!

(26)

Thank You

Very Much!

References

Related documents

This relationship has been forged through a common interest in themes like induction, probability, confirmation, simplicity, non-ad hoc- ness, unification and, more

Aid to private fashion sector in Ghana, exemplified by the Ethical Fashion Initiative Programme, is a complex and multidimensional process, involving various

and their speaking and writing, but there is not significant correlation among students’ self assessment of discourse competence and structure and vocabulary, and

The aim of the survey was to assess impact of workplace violence, in the form of bullying and harassment, on nursing student’s experience during placement and to

The findings from this study are presented into three larger subsections: (a) balancing family and career as a Latina woman; (b) first-generation college student identity and

professors/instructors in college. It was hypothesized that students who reported a history of being bullied by teachers prior to college were more likely to report being bullied

Technical efficiency in the packaging industry appears to have declined between 1983 and 1988. None of the computed subsector TECs were within the efficient range in 1988. Even

In the Principled Technologies labs, we tested two VDI solutions, each with its own ADC, to see how easy each was to install and configure: VMware Horizon with View 5.3 with F5