Segmentation - Visual Perception For Robotic Spatial Understanding

Segmentation is a well-known preprocessing step for many algorithms in computer

vision [67], [123], [189], [191], [198] and robotics [2], [34], [51], [94], [226] 6_{. However,}

as discussed in the first section, it is exteremely difficult to get an automatic segmentation that captures what we want [16], [138]. Therefore we tend to generate more segments than objects, i.e. over-segment the scene, and then design algorithms to group these segments back together for some particular application. These segments

are often called superpixels, since they tend to group many pixels of similar color,

texture, normal and/or depth together. Superpixels are useful to help reduce the complexity of scene parsing and object recognition algorithms by considering fewer pairwise similarities or classification evaluations [145], [175].

Figure 2.8 Levinshtein et al. develop the TurboPixels oversegmentation algorithm (image (a) above). TurboPixels is a form of superpixel algorithm, used to group individual pixels into larger (and easier to compute with) groups. Normalized cuts [187] output is shown in image (b). Image reproduced from [124], ©2009 IEEE.

There are multiple well-known methods that automatically compute superpixel- like segmentations (see Fig. 2.8). Normalized cuts [188], a spectral graph-based segmentation method that generates partitions that approximately maximize the similarity within a segment and minimize the similarity between segments, is a common choice for generating superpixels. The method by Felzenszwalb and Huttenlocher [55] (FH) is also used by many algorithms due to its comparitively low computational complexity. Turbopixels [124], a method due to Levinshtein et al., utilizes geometric flow from uniformly distributed seeds in order to maintain compactness and preserve boundaries. Simple linear iterative clustering (SLIC) [1] is an extension to $k$-means for fast superpixel segmentation. Like Turbopixels, it’s initialized with uniformly distributed seeds which are then associated and updated in a manner similar to k-means clustering.

While the previous methods originally operated on images, there have been exten- sions to point clouds, including the algorithm we extend in this paper. Depth-adaptive superpixels (DASP) [210] is an algorithm for generating superpixels on RGB-D image pairs where each segment covers roughly the same surface area, independent of the distance from the camera. A more recent paper directly modifies the SLIC algorithm

Figure 2.9 Hu et al. apply a fixed volumetric discretization to generate an efficient over-segmentation in streaming point clouds. Their approach focuses on producing usable segments for their semantic labeling algorithm. Image reproduced from [97],

to include depth in order to reduce undersegmentation error over depth discontinuities [147].

Incremental segment generation has primarily been applied in the video segmentation and analysis realm. Segmentation of video [25], [65], [76], [224], [225] comes in many forms, but the main focus is foreground/background segmentation, moving object tracking (often non-rigid), and multi-class semantic labeling. The common thread in these approaches is to generate segmentations that are consistent across the

video frames; i.e. if a pixel belonged to one class at time T1, then the corresponding

pixel (if still visible) at timeTn belongs to the same class. While these algorithms aim

for the same consistency constraints, they are only able to utilize images and therefore their techniques are necessarily more time consuming. On the other hand, they try to propagate and constrain higher level features within both static and dynamic scenes, while we explicitly take advantage of estimated camera motion and leave the semantic processing to a higher-level component.

Other research related to segmentation in streaming point clouds includes a si- multaneous localization and mapping (SLAM) algorithm for outdoor multi-line laser scanners that directly incorporates segmentation and moving object detection with spatial and temporal constraints [226]. In a similar vein (although not integrated with SLAM), Hu et al. [97] apply a fixed volumetric segmentation scheme and 2.5D spatial

Figure 2.10 Finman et al. apply a modified version of Felzenszwalb and Huttenlocher efficient graph based segmentation to slices of a TSDF model volume, then merge the segments as additional slices are output. Image reproduced from [59],©2014 IEEE.

data structure to aid in incremental scene classification in streaming point clouds. In that work, they are less concerned about the segmentation being proper (i.e. segments can overlap, and do not directly correspond to a distance or similarity function) and rather more concerned about efficient access and subdivision of the cloud in order to apply their scene labeling algorithm. Henry et al. [86] also apply FH segmentation to point clouds during ego-motion estimation and model building, however their goal is to use the segments as swappable and manipulable components for the purpose of generating a graph for loop closing and GPU memory management.

of FH segmentation to slices of a model output by the latest Kintinuous ego-motion estimation and model generation algorithm. In this case, non-overlapping segments of the world are generated by the modeling algorithm and they are segmented using the FH graph-based method. Segments are merged and recomputed as necessary according to the segment border set and the computed segment thresholds using an iterative voting scheme. The main difference in our work is our goal of consistent over-segmentation and the capability of handling overlapping clouds without a TSDF volume.

In document Visual Perception For Robotic Spatial Understanding (Page 41-45)