ANNOTATING OBJECT INSTANCES WITH A POLYGON-RNN

(1)

ANNOTATING OBJECT INSTANCES

WITH A POLYGON-RNN

Authors: Castrejon et.al.

(Dept of CS,

University of Toronto)

Presented by Mandar Pradhan

(2)

OBJECTIVE OF THE PAPER

● To find how to annotate instances in an image as fast as possible (AUTOMATIC ANNOTATION)

● To do the annotation as close to the ground truth as possible (POLYGON FOR ANNOTATION)

● To allow a scope for human intervention to correct automated annotations (AUTOMATIC SEMI-AUTOMATIC ANNOTATION)

(3)

MOTIVATION BEHIND THE IDEA

● More Data == More annotation == Time consuming and lots of hard work!! (if done by manual polygon annotation)

● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single point objects) - not as accurate as supervised methods (but an easier way to obtain ground truth)

● Need for human intervention to correct automated annotations to prevent model from breaking down

(4)

EARLIER RELATED WORKS

● Semi automatic annotations:

○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples, not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )

(5)

EARLIER RELATED WORKS

● Semi automatic annotations:

○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining

appearance cues and a smoothness term (Additional layer of training examples, not accurate)

○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )

Drawbacks:

- Hard to incorporate shape priors - Labellings with holes

(6)

EARLIER RELATED WORKS

● Semi automatic annotations:

- Done at super pixel level

(7)

EARLIER RELATED WORKS

● Semi automatic annotations:

- Done at super pixel level

- May merge small objects or parts

● Object instance segmentation (**USED IN THIS PAPER)

- CNN used for box / patch for labelling

- Detect edges and link them to obtain coherent region

- Combine small polygons into object regions to label images

- HERE RNNS HAVE BEEN USED TO DIRECTLY PREDICT FINAL POLYGONS

(8)

Polygon - RNN (High level overview)

● Does automated annotation using CNN followed by RNN ● CNN extracts a Bounding Box output of the instance

● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1, t-2 + Initial Point (details in subsequent slide)

● RNN Output : “Polygon object” outlining the instance with a bounding box (Polygons are list of 2-D vertices)

● Trained end to end

● CNN are fine tuned to object boundaries, RNNs encode the priors on objects shapes

(9)

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)

(10)

Polygon - RNN (Some more details)

● “Polygon object” : List of vertices of bounding polygon

● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)

● Convention: Any starting point, Clockwise orientation

● Why are vertices from t-1 and t-2, both, fed into the RNN input??? ○ Account for the orientation

● Why is initial point of polygon fed into RNN input ??? ○ Decide when to close the polygon

(11)

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28

(12)

CNN Module - CNN + Skip connects

● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced

● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28

● Output is downsampled by a factor of 16

● Why skip connects??? - Pull out low level features like edges and corners) and semantics of the instance

● How to handle skip connections from multiple dimensions???

- Bilinear upsampling after additional convolution at the conv5 - 2X2 max-pooling before additional convolution at pool2

(13)

RNN Module for vertex prediction

● Aim of RNN - Capture history(previous edges) and predict the future(next edges/ polygon).

● Does coherent prediction for ambiguous cases (occlusion, shadows)

● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info from CNNs, reduce number of parameters to deal with

(14)

(15)

RNN Module for vertex prediction

● 2 layer RNN with 16 channels and 3X3 kernels

● Representation of output vertex - D X D+1 matrix (one hot encoded)

● The DXD dimensions represent the possible 2D coordinates of the vertices ● The additional dimension is used to denote the end of sequence token

(polygon is complete)

● At the input, apart from the CNN representation of the image, we have the one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.

(16)

RNN Module for vertex prediction

● Prediction of starting point

- Reuse the CNN architecture with 2 additional layers - The first layer predicts object boundaries

- The second branch takes first branch as well as the image features as inputs and gives the vertices

(17)

Training Details

● Loss - Cross Entropy

● Smoothening of target distribution (the D X D+1 grid is non binary) - To prevent over-penalising the incorrect predictions.

- Assigning non zero probability to locations in distance of 2 from target in grid

● Optimizer - Adam ● Batch size - 8

● Learning rate - 10-4 _{with decay by a factor of 10 every 10 epochs}

● 𝜷₁ = 0.9 , 𝜷₂= 0.999 (Momentum constant) ● Use logistic regression

● Ground truth of object boundaries - edges of ground truth polygon ● Ground truth of vertex layer - vertices of the ground truth polygon ● GPU - NVIDIA TITAN-X

(18)

Implementational details

● How to choose the best vertex at each time step of RNN?? - look for the one with highest log-probs

● How does correction of vertex take place?? - Annotator feeds in the correct annotation at the next time step

● Inference time - 250 ms ● Polygon Simplification

- Eliminate 3 vertices in same line and 2 vertices in same grid cases ● Data augmentation:

- Flip image crop and annotation at random

- Randomly increase context (10-20% of the bounding box)

(19)

Results

● Datasets: KITTI, Cityscape ● Goals of the model :

- Polygon must be as accurate as possible

- Minimal number of clicks

● Yardsticks to gauge performance:

- Intersection over union measure

- No of vertex corrects needed to predict polygon

● Annotation of polygon done by inhouse detector, bounding box easy to obtain using AMT

(20)

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

(21)

Results : Cityscape

● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances

● Solutions - 500 validation images are now test images

- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle

● Size of Instances - 28 -1792 pixel

● Inbuilt instance segmentation is both in terms of pixel labelling as well as polygons

● New Problem - Polygons in cityspace capture occlusion portion

● Solution - Depth ordering to remove the occluded part (we want only visible part)

(22)

Results : Cityscape

● What do we do about objects with multiple components due to occlusion??? ● The authors have treated each component as a single object

● So what happens if the RNN keeps adding new vertices without reaching a termination???

(23)

Results : Evaluation Metric

● Intersection of Union : Obtained prediction vs Ground Truth (Average over all instances)

● How to evaluate the Human Action (Corrections of vertices)??? - simulate the action of the annotators who correct the point each time predicted vertex

● Testing Gameplan : First do sanity check in PREDICTION mode (no

interaction of the annotators to correct). Then evaluate the amount of human intervention needed

(24)

Results : Baselines

● DeepMask : Uses CNN to output pixel labels, indifferent to class

● SharpMask : Improvise the DeepMask idea using upsampling of output to obtain improved resolution

● Performance is reported based on ground truth boxes

● Network structure: 50 layer ResNet architecture trained on COCO dataset ● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs

(25)

Results : Baselines

● SquareBox: Object is mapped to a bounding box (of reduced dimensions). Individual boxes for each component of the object

● Dilation10: Use segmentation dataset. Pixels are mapped to objects are grouped as instance masks

(26)

(27)

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG

(28)

Results : Baselines

● Verdict

- Baselines are hard to correct

- Better overall average and tops the charts in 6 / 8 categories

- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %

and 7% respectively

- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG

- Larger instances have advantage in larger objects like bus and train due to better resolution

(29)

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors

(30)

Results : Annotators in the loop

● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy

● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors

● Also, show the resulting IoU to compare ● Methodology in a nutshell

- In the first method, pick 10 images per annotator and ask them to annotate freely without any cues or hint.

- In the second method, crop images and place blue markers on the

(31)

(32)

Results : Annotators in the loop

● Verdict

- Human annotator IoU: 69.5% in free viewing method and 78.60% for

cropped images

- Indicates need to collect multiple annotations to reduce variations and biases in the annotators

(33)

Results : Annotators in the loop

● Comparison with GRABCUT:

- 54 randomly chosen instances

● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU ● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU

● Verdict - Given model is faster as it needs lesser clicks for comparable inference time

(34)

(35)

(36)

Results : Final Verdict

Advantages

● Polygon RNN provides plausible annotations with relatively less latency ● Performance is good on smaller objects. This fact is visible in performance

over the different instances of varying sizes within the same datasets (in Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs larger objects in Cityscapes)

● Competes well with SharpMask which had ResNet based architecture

● Definitely reduces annotation cost for IoU comparable to human annotation ● Introduction of human intervention adds scope to avoid extremely bad

(37)

Results : Final Verdict

Disadvantages

● Lower resolution and associated quantization error manifest in segmentation of larger instances.

● Memory intensive - Polygons have more vertices to predict than a single bounding box which may add latency in return for more accuracy.

● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets which puts it at a disadvantage

(38)

Results : Final Verdict

Takeaways

● Tries to address issues of speed and accuracy of annotations

● The novelty of allowing human intervention allows it to not give very bad performance

● Performance is good for smaller objects but lowers as complexity reduces

● Scope to work improving resolution and ability to exploit Velodyne point cloud data to performance address issues in KITTI dataset

(39)

OTHER REFERENCES

[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016

[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.

(40)