ANNOTATING OBJECT INSTANCES
WITH A POLYGON-RNN
Authors: Castrejon et.al.
(Dept of CS,
University of Toronto)
Presented by Mandar Pradhan
OBJECTIVE OF THE PAPER
● To find how to annotate instances in an image as fast as possible (AUTOMATIC ANNOTATION)
● To do the annotation as close to the ground truth as possible (POLYGON FOR ANNOTATION)
● To allow a scope for human intervention to correct automated annotations (AUTOMATIC SEMI-AUTOMATIC ANNOTATION)
MOTIVATION BEHIND THE IDEA
● More Data == More annotation == Time consuming and lots of hard work!! (if done by manual polygon annotation)
● Other automated methods (Images Tags, Bounding Boxes, Scribbles, Single point objects) - not as accurate as supervised methods (but an easier way to obtain ground truth)
● Need for human intervention to correct automated annotations to prevent model from breaking down
EARLIER RELATED WORKS
●
Semi automatic annotations:
○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining
appearance cues and a smoothness term (Additional layer of training examples, not accurate)
○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )
EARLIER RELATED WORKS
●
Semi automatic annotations:
○ Scribbles/ Multi Scribbles - segmentation using graph cut by combining
appearance cues and a smoothness term (Additional layer of training examples, not accurate)
○ GrabCut - Annotation as 2D bounding boxes + per pixel labelling using EM algorithm (Idea extended to 3D bounding boxes + point clouds )
Drawbacks:
- Hard to incorporate shape priors - Labellings with holes
EARLIER RELATED WORKS
● Semi automatic annotations:
- Done at super pixel level
EARLIER RELATED WORKS
● Semi automatic annotations:
- Done at super pixel level
- May merge small objects or parts
● Object instance segmentation (**USED IN THIS PAPER)
- CNN used for box / patch for labelling
- Detect edges and link them to obtain coherent region
- Combine small polygons into object regions to label images
- HERE RNNS HAVE BEEN USED TO DIRECTLY PREDICT FINAL POLYGONS
Polygon - RNN (High level overview)
● Does automated annotation using CNN followed by RNN ● CNN extracts a Bounding Box output of the instance
● RNN Input : Image crop inside the Bounding Box + List of Vertices at time t-1, t-2 + Initial Point (details in subsequent slide)
● RNN Output : “Polygon object” outlining the instance with a bounding box (Polygons are list of 2-D vertices)
● Trained end to end
● CNN are fine tuned to object boundaries, RNNs encode the priors on objects shapes
Polygon - RNN (Some more details)
● “Polygon object” : List of vertices of bounding polygon
● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)
Polygon - RNN (Some more details)
● “Polygon object” : List of vertices of bounding polygon
● Defining a specific polygon may involve multiple parameterizations. (We can choose any vertex as starting point and then move on to the next points using any orientation)
● Convention: Any starting point, Clockwise orientation
● Why are vertices from t-1 and t-2, both, fed into the RNN input??? ○ Account for the orientation
● Why is initial point of polygon fed into RNN input ??? ○ Decide when to close the polygon
CNN Module - CNN + Skip connects
● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced
● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28
CNN Module - CNN + Skip connects
● Based on VGG16 architecture with fully connected layer and last max pooling layer removed and replaced
● We stack all skip connects from the lower layers, after they pass through 3X3 convolutional layer + ReLU and upscaling them to 28 X 28
● Output is downsampled by a factor of 16
● Why skip connects??? - Pull out low level features like edges and corners) and semantics of the instance
● How to handle skip connections from multiple dimensions???
- Bilinear upsampling after additional convolution at the conv5 - 2X2 max-pooling before additional convolution at pool2
RNN Module for vertex prediction
● Aim of RNN - Capture history(previous edges) and predict the future(next edges/ polygon).
● Does coherent prediction for ambiguous cases (occlusion, shadows)
● Units : Convolutional LSTMS - they operate in 2D and preserve spatial info from CNNs, reduce number of parameters to deal with
RNN Module for vertex prediction
● 2 layer RNN with 16 channels and 3X3 kernels
● Representation of output vertex - D X D+1 matrix (one hot encoded)
● The DXD dimensions represent the possible 2D coordinates of the vertices ● The additional dimension is used to denote the end of sequence token
(polygon is complete)
● At the input, apart from the CNN representation of the image, we have the one hot encoded forms of vertices at t-1 and t-2 along with initial vertex.
RNN Module for vertex prediction
● Prediction of starting point
- Reuse the CNN architecture with 2 additional layers - The first layer predicts object boundaries
- The second branch takes first branch as well as the image features as inputs and gives the vertices
Training Details
● Loss - Cross Entropy
● Smoothening of target distribution (the D X D+1 grid is non binary) - To prevent over-penalising the incorrect predictions.
- Assigning non zero probability to locations in distance of 2 from target in grid
● Optimizer - Adam ● Batch size - 8
● Learning rate - 10-4 with decay by a factor of 10 every 10 epochs
● 𝜷1 = 0.9 , 𝜷2= 0.999 (Momentum constant) ● Use logistic regression
● Ground truth of object boundaries - edges of ground truth polygon ● Ground truth of vertex layer - vertices of the ground truth polygon ● GPU - NVIDIA TITAN-X
Implementational details
● How to choose the best vertex at each time step of RNN?? - look for the one with highest log-probs
● How does correction of vertex take place?? - Annotator feeds in the correct annotation at the next time step
● Inference time - 250 ms ● Polygon Simplification
- Eliminate 3 vertices in same line and 2 vertices in same grid cases ● Data augmentation:
- Flip image crop and annotation at random
- Randomly increase context (10-20% of the bounding box)
Results
● Datasets: KITTI, Cityscape ● Goals of the model :
- Polygon must be as accurate as possible
- Minimal number of clicks
● Yardsticks to gauge performance:
- Intersection over union measure
- No of vertex corrects needed to predict polygon
● Annotation of polygon done by inhouse detector, bounding box easy to obtain using AMT
Results : Cityscape
● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances
● Solutions - 500 validation images are now test images
- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle
Results : Cityscape
● What in this dataset ?? - 27 cities, 2950 train images, 500 valid, 1525 test ● Issue faced - Test set has no ground truth instances
● Solutions - 500 validation images are now test images
- The images from the Weimar and Zurich are the validation sets ● Labels - person, Car, Rider, truck, Bus, Train and Motorcycle
● Size of Instances - 28 -1792 pixel
● Inbuilt instance segmentation is both in terms of pixel labelling as well as polygons
● New Problem - Polygons in cityspace capture occlusion portion
● Solution - Depth ordering to remove the occluded part (we want only visible part)
Results : Cityscape
● What do we do about objects with multiple components due to occlusion??? ● The authors have treated each component as a single object
● So what happens if the RNN keeps adding new vertices without reaching a termination???
Results : Evaluation Metric
● Intersection of Union : Obtained prediction vs Ground Truth (Average over all instances)
● How to evaluate the Human Action (Corrections of vertices)??? - simulate the action of the annotators who correct the point each time predicted vertex
● Testing Gameplan : First do sanity check in PREDICTION mode (no
interaction of the annotators to correct). Then evaluate the amount of human intervention needed
Results : Baselines
● DeepMask : Uses CNN to output pixel labels, indifferent to class
● SharpMask : Improvise the DeepMask idea using upsampling of output to obtain improved resolution
● Performance is reported based on ground truth boxes
● Network structure: 50 layer ResNet architecture trained on COCO dataset ● For DeepMask and SharpMask, the ResNet part is trained for 150 epochs
Results : Baselines
● SquareBox: Object is mapped to a bounding box (of reduced dimensions). Individual boxes for each component of the object
● Dilation10: Use segmentation dataset. Pixels are mapped to objects are grouped as instance masks
Results : Baselines
● Verdict
- Baselines are hard to correct
- Better overall average and tops the charts in 6 / 8 categories
- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %
and 7% respectively
- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG
Results : Baselines
● Verdict
- Baselines are hard to correct
- Better overall average and tops the charts in 6 / 8 categories
- Outperforming SharpMask in Car, Rider, Person classes by 12%, 6 %
and 7% respectively
- Why is the previous point worth noting - SharpMask uses ResNet architecture which is much powerful vs VGG
- Larger instances have advantage in larger objects like bus and train due to better resolution
Results : Annotators in the loop
● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy
● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors
Results : Annotators in the loop
● How is the quality of annotation and amount of human intervention quantified??? - No. of mouses clicks needed to get different levels of accuracy
● What do they mean by different “levels” of segmentation accuracy ??? -chessboard metric of distance of the errors
● Also, show the resulting IoU to compare ● Methodology in a nutshell
- In the first method, pick 10 images per annotator and ask them to annotate freely without any cues or hint.
- In the second method, crop images and place blue markers on the
Results : Annotators in the loop
● Verdict
- Human annotator IoU: 69.5% in free viewing method and 78.60% for
cropped images
- Indicates need to collect multiple annotations to reduce variations and biases in the annotators
Results : Annotators in the loop
● Comparison with GRABCUT:
- 54 randomly chosen instances
● Grabcut stats: 42.2s and 17.5 clicks per instance, 70.7% IoU ● Given model’s stats: 5-9.6 clicks per instance, 77.6% IoU
● Verdict - Given model is faster as it needs lesser clicks for comparable inference time
Results : Final Verdict
Advantages
● Polygon RNN provides plausible annotations with relatively less latency ● Performance is good on smaller objects. This fact is visible in performance
over the different instances of varying sizes within the same datasets (in Cityscape) as well as in between 2 datasets (smaller objects in KITTI vs larger objects in Cityscapes)
● Competes well with SharpMask which had ResNet based architecture
● Definitely reduces annotation cost for IoU comparable to human annotation ● Introduction of human intervention adds scope to avoid extremely bad
Results : Final Verdict
Disadvantages
● Lower resolution and associated quantization error manifest in segmentation of larger instances.
● Memory intensive - Polygons have more vertices to predict than a single bounding box which may add latency in return for more accuracy.
● Cannot exploit Velodyne point clouds in KITTI dataset like other datasets which puts it at a disadvantage
Results : Final Verdict
Takeaways
● Tries to address issues of speed and accuracy of annotations
● The novelty of allowing human intervention allows it to not give very bad performance
● Performance is good for smaller objects but lowers as complexity reduces
● Scope to work improving resolution and ability to exploit Velodyne point cloud data to performance address issues in KITTI dataset
OTHER REFERENCES
[1]D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribblesup:Scribble-supervised convolutional networks for semantic segmentation. In CVPR, 2016
[2]C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. In SIGGRAPH, 2004.