• No results found

ANNOTATING OBJECT INSTANCES WITH A POLYGON-RNN

N/A
N/A
Protected

Academic year: 2021

Share "ANNOTATING OBJECT INSTANCES WITH A POLYGON-RNN"

Copied!
40
0
0

Loading.... (view fulltext now)

Full text

(1)

ANNOTATING OBJECT INSTANCES

WITH A POLYGON-RNN

Authors: Castrejon et.al.

(Dept of CS,

University of Toronto)

Presented by Mandar Pradhan

(2)

OBJECTIVE OF THE PAPER

● To find how to annotate instances in an image as fast as possible (AUTOMATIC ANNOTATION)

● To do the annotation as close to the ground truth as possible (POLYGON FOR ANNOTATION)

● To allow a scope for human intervention to correct automated annotations (AUTOMATIC SEMI-AUTOMATIC ANNOTATION)

(3)

MOTIVATION BEHIND THE IDEA

● More Data == More annotation == Time consuming and lots of hard work!! (if done by manual polygon annotation)

● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single point objects) - not as accurate as supervised methods (but an easier way to obtain ground truth)

● Need for human intervention to correct automated annotations to prevent model from breaking down

(4)

EARLIER RELATED WORKS

Semi automatic annotations:

○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples, not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )

(5)

EARLIER RELATED WORKS

Semi automatic annotations:

○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples, not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )

Drawbacks:

- Hard to incorporate shape priors - Labellings with holes

(6)

EARLIER RELATED WORKS

● Semi automatic annotations:

- Done at super pixel level

(7)

EARLIER RELATED WORKS

● Semi automatic annotations:

- Done at super pixel level

- May merge small objects or parts

● Object instance segmentation (**USED IN THIS PAPER)

- CNN used for box / patch for labelling

- Detect edges and link them to obtain coherent region

- Combine small polygons into object regions to label images

- HERE RNNS HAVE BEEN USED TO DIRECTLY PREDICT FINAL POLYGONS

(8)

Polygon - RNN (High level overview)

● Does automated annotation using CNN followed by RNN ● CNN extracts a Bounding Box output of the instance

● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1, t-2 + Initial Point (details in subsequent slide)

● RNN Output : “Polygon object” outlining the instance with a bounding box (Polygons are list of 2-D vertices)

● Trained end to end

● CNN are fine tuned to object boundaries, RNNs encode the priors on objects shapes

(9)

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)

(10)

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)

● Convention: Any starting point, Clockwise orientation

● Why are vertices from t-1 and t-2, both, fed into the RNN input??? ○ Account for the orientation

● Why is initial point of polygon fed into RNN input ??? ○ Decide when to close the polygon

(11)

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28

(12)

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

● Why skip connects??? - Pull out low level features like edges and corners) and semantics of the instance

● How to handle skip connections from multiple dimensions???

- Bilinear upsampling after additional convolution at the conv5 - 2X2 max-pooling before additional convolution at pool2

(13)

RNN Module for vertex prediction

● Aim of RNN - Capture history(previous edges) and predict the future(next edges/ polygon).

● Does coherent prediction for ambiguous cases (occlusion, shadows)

● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info from CNNs, reduce number of parameters to deal with

(14)
(15)

RNN Module for vertex prediction

● 2 layer RNN with 16 channels and 3X3 kernels

● Representation of output vertex - D X D+1 matrix (one hot encoded)

● The DXD dimensions represent the possible 2D coordinates of the vertices ● The additional dimension is used to denote the end of sequence token

(polygon is complete)

● At the input, apart from the CNN representation of the image, we have the one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.

(16)

RNN Module for vertex prediction

● Prediction of starting point

- Reuse the CNN architecture with 2 additional layers - The first layer predicts object boundaries

- The second branch takes first branch as well as the image features as inputs and gives the vertices

(17)

Training Details

● Loss - Cross Entropy

● Smoothening of target distribution (the D X D+1 grid is non binary) - To prevent over-penalising the incorrect predictions.

- Assigning non zero probability to locations in distance of 2 from target in grid

● Optimizer - Adam ● Batch size - 8

● Learning rate - 10-4 with decay by a factor of 10 every 10 epochs

● 𝜷1 = 0.9 , 𝜷2= 0.999 (Momentum constant) ● Use logistic regression

● Ground truth of object boundaries - edges of ground truth polygon ● Ground truth of vertex layer - vertices of the ground truth polygon ● GPU - NVIDIA TITAN-X

(18)

Implementational details

● How to choose the best vertex at each time step of RNN?? - look for the one with highest log-probs

● How does correction of vertex take place?? - Annotator feeds in the correct annotation at the next time step

● Inference time - 250 ms ● Polygon Simplification

- Eliminate 3 vertices in same line and 2 vertices in same grid cases ● Data augmentation:

- Flip image crop and annotation at random

- Randomly increase context (10-20% of the bounding box)

(19)

Results

● Datasets: KITTI, Cityscape ● Goals of the model :

- Polygon must be as accurate as possible

- Minimal number of clicks

● Yardsticks to gauge performance:

- Intersection over union measure

- No of vertex corrects needed to predict polygon

● Annotation of polygon done by inhouse detector, bounding box easy to obtain using AMT

(20)

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

(21)

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

● Inbuilt instance segmentation is both in terms of pixel labelling as well as polygons

● New Problem - Polygons in cityspace capture occlusion portion

● Solution - Depth ordering to remove the occluded part (we want only visible part)

(22)

Results : Cityscape

● What do we do about objects with multiple components due to occlusion??? ● The authors have treated each component as a single object

● So what happens if the RNN keeps adding new vertices without reaching a termination???

(23)

Results : Evaluation Metric

● Intersection of Union : Obtained prediction vs Ground Truth (Average over all instances)

● How to evaluate the Human Action (Corrections of vertices)??? - simulate the action of the annotators who correct the point each time predicted vertex

● Testing Gameplan : First do sanity check in PREDICTION mode (no

interaction of the annotators to correct). Then evaluate the amount of human intervention needed

(24)

Results : Baselines

● DeepMask : Uses CNN to output pixel labels, indifferent to class

● SharpMask : Improvise the DeepMask idea using upsampling of output to obtain improved resolution

● Performance is reported based on ground truth boxes

● Network structure: 50 layer ResNet architecture trained on COCO dataset ● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs

(25)

Results : Baselines

● SquareBox: Object is mapped to a bounding box (of reduced dimensions). Individual boxes for each component of the object

● Dilation10: Use segmentation dataset. Pixels are mapped to objects are grouped as instance masks

(26)
(27)

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG

(28)

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG

- Larger instances have advantage in larger objects like bus and train due to better resolution

(29)

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors

(30)

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors

● Also, show the resulting IoU to compare ● Methodology in a nutshell

- In the first method, pick 10 images per annotator and ask them to annotate freely without any cues or hint.

- In the second method, crop images and place blue markers on the

(31)
(32)

Results : Annotators in the loop

● Verdict

- Human annotator IoU: 69.5% in free viewing method and 78.60% for

cropped images

- Indicates need to collect multiple annotations to reduce variations and biases in the annotators

(33)

Results : Annotators in the loop

● Comparison with GRABCUT:

- 54 randomly chosen instances

● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU ● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU

● Verdict - Given model is faster as it needs lesser clicks for comparable inference time

(34)
(35)
(36)

Results : Final Verdict

Advantages

● Polygon RNN provides plausible annotations with relatively less latency ● Performance is good on smaller objects. This fact is visible in performance

over the different instances of varying sizes within the same datasets (in Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs larger objects in Cityscapes)

● Competes well with SharpMask which had ResNet based architecture

● Definitely reduces annotation cost for IoU comparable to human annotation ● Introduction of human intervention adds scope to avoid extremely bad

(37)

Results : Final Verdict

Disadvantages

● Lower resolution and associated quantization error manifest in segmentation of larger instances.

● Memory intensive - Polygons have more vertices to predict than a single bounding box which may add latency in return for more accuracy.

● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets which puts it at a disadvantage

(38)

Results : Final Verdict

Takeaways

● Tries to address issues of speed and accuracy of annotations

● The novelty of allowing human intervention allows it to not give very bad performance

● Performance is good for smaller objects but lowers as complexity reduces

● Scope to work improving resolution and ability to exploit Velodyne point cloud data to performance address issues in KITTI dataset

(39)

OTHER REFERENCES

[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016

[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.

(40)

References

Related documents

The mutations were selected such that one was at either end of the locus and the third was in the middle, As shown in Table 8, the number of recombinants observed when

(COHSASA). ISQua accredits a variety of programmes, including governmental, quasi-governmental, and NGO programmes, for a wide variety of services, including blood banks,

Normal prothrombinase activity, increased systemic thrombin activity, and lower antithrombin levels in patients with disseminated intravascular coagulation at an early phase of

17 In the agreement, both parties agree to an arbitration clause providing that, “all disputes arising from or relating to this contract or relationship between the aforesaid

We have been able to examine in detail, and by co-ordinated use of one and the same reference dataset (CALIPSO- CALIOP observations), the performance of the following three CMSAF cloud

Shared HLA Class I and II Alleles and Clonally Restricted Public and Private Brain- Infiltrating αβ T Cells in a Cohort of Rasmussen Encephalitis Surgery

unterschiedliche phänotypische Ausprägungen zeigten, was auf unbekannte modifizierende Faktoren hinweisen könnte, beispielsweise bei den 4 Patienten, bei denen die Mutation