IMP: Instance Mask Projection - IMP: INSTANCE MASK PROJECTION FOR HIGH ACCURACY

CHAPTER 5: IMP: INSTANCE MASK PROJECTION FOR HIGH ACCURACY

5.1.1 IMP: Instance Mask Projection

The Instance Mask Projection operator projects the segmentation masks from an instance mask prediction, defined on a detection bounding box, onto a canvas defined over the whole image. This canvas is then used as an input feature layer for semantic segmentation1_.

Each predicted instance mask has a Class, Score, BBox location, andh_×wMask2_{. First, the}

score for each pixel in the Mask is scaled by the object Score for the Class. Then, locations in the canvas layer for the Class are sampled from the scaled mask. Note that the canvas is updated only if the scaled mask value is larger than the current canvas value. This is illustrated in Figure 5.1 where a “dress” is detected by Mask R-CNN and then projected onto the canvas in its detected BBox location. The projected layer shows the low resolution instance mask which predicts an outline of the dress, while the next step of semantic segmentation uses some of the FPN feature layers as well as the canvas as features to produce a more accurate parse.

1_{The resolution of the canvas can be chosen according to which feature layer is attached.} 2_{The resolution of Mask is 28}

Algorithm 1:CUDA implementation of Forward of IMP

Input: (C, P, M, B): Mask R-CNN Results, ClassC:[N], ProbabilityP:[N], Mask M:[N,28,28], and BBoxB:[N], whereN is the number of detections

Output: (F):The projected feature map denoted byF:[D,H,W], whereDis the Class, and H,W are the height and width of the feature map

FunctionIMP(C, P, M, B):

forcellci ∈M ask : [N,28,28]do in parallel

ni, maskhi, maskwi ←DecodeIndexes(ci);

vi =M[ni, maskhi, maskwi]∗P[ni];

x_min, y_min, x_max, y_max_←ProjectRegion(B[ni], maskhi, maskwi);

foreachpixelpj ∈F[C[ni], ymin:ymax, xmin :xmax]do

pj ←atomicMax(pj, vi);

returnF;

The IMP operator can be implemented efficiently using custom CUDA kernels, see Al- gorithm 1. The input parameters are instance segmentation results, ClassC:[N], Probability P:[N], MaskM:[N,28,28], and BBoxB:[N], whereN is the number of masks. For each cellci in MaskM, it first identifies its indices in the Mask using the DecodeIndexes function

and then obtains projected valueviby multiplying its value and probabilityP[ni]. The pro-

jected regionxmin, ymin, xmax, ymaxcan be calculated using BBox locationB[ni]and its indexes

maskhi, maskwiin Mask. In the projected regionF[C[ni], ymin : ymax, xmin : xmax], we use the

atomicMax operation to update the value of each pixel. Each cell runs concurrently in the CUDA kernel and the atomicMax operation guarantees only the max value will be kept when multiple cells project to the same pixel.

We concatenate the IMP canvas with the feature layer(s) (either P2 or P2-5) to let the net- work use this as a strong prior for object location, allowing the semantic segmentation part of the model to focus on making improvements to the instance predictions during learning.

4x up Instance Detection

Semantic Segmentation Result

FPN + Mask R-CNN Instance Mask Projection

1xCx(1/4) P2:1x256x(1/4) P3:1x256x(1/8) P4: 1x256x(1/16) P5:1x256x(1/32) 1x N x28x28 …

BBox, Class, Score Instance Mask

concat Instance Mask

Project

(a) Mask R-CNN-IMP

Instance Detection

Semantic Segmentation Result FPN + Mask R-CNN

Semantic Segmentation Module

P2:1x256x(1/4) P3:1x256x(1/8) P4: 1x256x(1/16) P5:1x256x(1/32) 1x N x28x28 …

BBox, Class, Score Instance Mask 4x conv3 + 1x conv1 + 4x up Semantic Segmentation Head 1x256x(1/4) (b) Panoptic-P2 4x conv3 + 1x conv1 + 4x up Semantic Segmentation Head Instance Detection

Semantic Segmentation Result

FPN + Mask R-CNN Instance Mask Projection

P2:1x256x(1/4) P3:1x256x(1/8) P4: 1x256x(1/16) P5:1x256x(1/32) 1x N x28x28 …

BBox, Class, Score Instance Mask

concat Instance Mask

Project

Semantic Segmentation Module

concat

1xCx(1/4)

1x256x(1/4)

Instance Detection

Semantic Segmentation Result

FPN + Mask R-CNN

Semantic Segmentation Module

1x128x(1/4) 1x128x(1/4) 1x128x(1/4) 1x128x(1/4) 1x128x(1/4) P2:1x256x(1/4) P3:1x256x(1/8) P4: 1x256x(1/16) P5:1x256x(1/32) 1x N x28x28 …

BBox, Class, Score Instance Mask sum 4x conv3 + 1x conv1 + 4x up Semantic Segmentation Head (d) Panoptic-FPN

Figure 5.2: Variants of models we used in the experiments. (a) Mask R-CNN-IMP Uses the IMP to generate the semantic segmentation prediction directly without any learning parameters. (b) Panoptic-P2 uses the P2 layer in FPN to generate semantic segmentation, which is the minimal way to add semantic segmentation in FPN architecture. (c) Panoptic-P2-IMP demonstrates how to apply IMP on Panoptic-P2. (d) Panoptic-FPN combines the features layers_{P2, P3, P4, P5_} for semantic segmentation. See Figure 5.3 for Panoptic-FPN-IMP.

Instance Detection

Instance Mask Project

Semantic Segmentation Result FPN + Mask R-CNN

Semantic Segmentation Module Instance Mask Projection

1x128x(1/4) 1x128x(1/4) 1x128x(1/4) 1x128x(1/4) 1x128x(1/4) 1xCx(1/4) P2:1x256x(1/4) P3:1x256x(1/8) P4: 1x256x(1/16) P5:1x256x(1/32) 1x N x28x28 … BBox, Class, Score Instance Mask sum 4x conv3 + 1x conv1 + 4x up Semantic Segmentation Head concat

Figure 5.3: Architecture: Panoptic-FPN-IMP: Our full model contains four parts. The first part is FPN + Mask R-CNN which is used for Object/Instance Detection. The Instance Mask Projection Module takes the output of instance detection to generate the feature layer(1xCx1/4). For the Semantic Segmentation Moduel, we adopts the Panoptic FPN [5] which upsamples and transforms_{P2, P3, P4, P5_}to 1x128x1/4 and sums them. Then we concatenate the results of instance mask projection and semantic segmentation module and forward to the semantic segmentation prediction head. See Figure 5.2 for other models.

In document Fu_unc_0153D_19025.pdf (Page 80-83)