• No results found

Object Segment Proposal Generation

2.2 Early Stage Object Proposal Generation

2.2.2 Object Segment Proposal Generation

Compared to bounding box proposals, object segment proposals are more informa- tive, for providing better object localization and boundary alignment. Also, segment proposals can be easily transformed into bounding box proposals, simply by taking the tight bounding box enclosing the segment. However, generating segment propos- als is more challenging than producing bounding box proposals. Early segment pro- posal methods generally take a perceptual grouping paradigm, where homogeneous perceptual pixels are grouped to form a segment candidate. The pixel grouping can

§2.2 Early Stage Object Proposal Generation 21

Figure 2.5: The pipeline of multiscale combinatorial grouping (MCG) [5].

be achieved by solving multiple segmentations with diverse seeds or by merging superpixels in multiple over-segmentations.

CPMC [31] places multiple seeds on an image grid and solves a constrained para- metric min-cut problem for each seed to compute figure-ground object hypotheses. To reduce the redundancy of the proposals, they remove a large part of propos- als through maximum relevance ranking and diversification. This approach outper- forms significantly previous low-level segmentation methods, but solving a sequence of CPMC problems is quite inefficient.

Endres[66] generates a segmentation hierarchy from occlusion boundaries and itera-

tively performs graph-cut on superpixel affinity graphs with a variety of parameters to produce segments. These initial segments are then ranked based on multiple cues. This method achieves good boundary alignment but it is time-consuming.

SelectiveSearch [40] iteratively groups superpixels into a segmentation hierarchy

and selects the segments from the hierarchy as object proposals. In order to diver- sify the pool of segments, they generate multiple segmentation hierarchies using a variety of grouping criteria and complementary colour spaces. Because of its good performance in terms of recall and quality, it has been broadly used in many top- performing object detection algorithms as the proposal method. But this method lacks an effective way of estimating proposal importance.

RandomizedPrim’s [67] can be seen as a modified version of SelectiveSearch. It

further introduces a randomised superpixel merging process which learns superpixel similarity measures to diversify the grouping results. However, it cannot attain high object recalls.

GOP [42] begins with a segmentation hierarchy and places multiple seeds for a geodesic distance transform on the image by learned classifiers. For each seed they generate foreground and background masks, for which they then compute a signed geodesic distance transform over the image. Finally, they extract a set of object pro-

22 Literature Review

posals by identifying certain level sets of the distance transforms. This method, however, is unable to assign scores to proposals.

Rantalankila [68] combines the superpixel merging with graph cut segmentations.

They first generate a part set of proposals by taking regions from a segmentation hierarchy and then perform CPMC [31] on an intermediate level of the hierarchy to obtain a complementary set of proposals. The results, however, are inferior to state-of-the-art.

Rigor[69] is similar to CPMC [31] but speeds it by pre-computing a graph for solving

multiple grape cut problems. But the higher efficiency is obtained at the cost of lower object recalls.

LGO [70] shares the same spirit of SelectiveSearch [40], except that they use differ- ent features and train a Random Forest to guide the grouping process. This brings better object recalls, indicating learning a grouping process is beneficial to proposal generation.

MCG [5] generates multi-scale segmentation hierarchies based on [71] and takes a subset of singletons, pairs, triplets and 4-tuples from the hierarchies as object pro- posals. MCG shows high performance in terms of recall and quality and is widely used as the proposal method in many instance segmentation systems. However, this algorithm runs slowly. The overview is shown in Figure 2.5.

Lee[72] proposes a parametric energy function for structured multiple output learn- ing and combines multiple mid-level cues to yield segment proposals by grouping superpixels under this framework. But the performance of this method is inferior to the state-of-the-art in terms of object recall rate and object proposal accuracy.

Wang [73] follows the similar paradigm to SelectiveSearch, but learns complemen- tary superpixel merging strategies so that errors generated in one merging strategy can be corrected by the others. Results show that additional learned grouping pro- cess enables this method to achieve better object recalls.

LPO[41] trains an ensemble of complementary figure-ground segmentation models and performs bottom-up segmentation by applying each model on the image to pro- duce segment proposals. This algorithm can recall more small objects at the cost of complex learning process.

Bleyer [74] designs an iterative labeling strategy to segment object proposals from

stereo images. It makes use of depth information to identify the extent of objects. It can work with stereo images, however, it is computationally expensive.

Similar to bounding box proposal methods, except the work by Bleyer et al. [74], early segment proposal approaches also rarely exploit the geometric cue and se- mantic information to further improve the quality of the proposals. Besides, most superpixel grouping methods mainly rely on low-level image features to compute the affinity between superpixels, which may lead to inaccuracy. Using color, size

§2.3 3D Reconstruction 23

and edge etc. low-level features to represent a superpixel tends to bring ambigu- ity into the perceptual grouping. For example, an image of a cat with a colourful coat may be segmented into many meaningless parts just according to color infor- mation and these parts are hardly to regroup into a cat just using low-level image features. On the other hand, high-level semantic information provides more stable cue to merge superpixels into a meaningful object. Although the parts of the coat of the cat have different colours, they all belong to this cat. Therefore, incorporat- ing high-level semantic features into the descriptor of a superpixel will benefit the grouping process. In addition, the early superpixel grouping methods seldom di- rectly learn the similarity measures of superpixels from the data. Instead, most of them use pre-defined similarity metrics, such as boundary strengths, to guide the grouping process, which lacks global context support and may be ineffective. We believe that a learning method for computing the similarity will be more effective, as it can better adapt to the data distribution. From these arguments, we propose a deep learning-based segment proposal approach that learns to group neighboring image regions into meaningful objects for stereo images in our second work.

2.3

3D Reconstruction

To extract geometric cues for an object in stereo images, we need first reconstruct the scene from stereo images. As this is not our focus, we briefly introduce the methods used in our work to estimate the disparity map and simply describe the calculation of converting the disparity map into the point cloud.