Summary - Context-driven Object Detection and Segmentation with Auxiliary Information

Object detection and segmentation have wide application in computer vision and robotics. For object detection, our task is to infer a bounding box-based parametrization of an object hypothesis. We reviewed two broad groups of methods based on sliding windows and the Hough transform respectively. Most importantly, the availability of RGBD data allows depth information to be incorporated both in terms of feature engineering and model design. Our focus in this thesis is to build an object detection system with better context and occlusion reasoning made possible by the addition of depth data. In particular, due to the limited availability of RGBD data compared to RGB imagery, we are interested in the scenario where depth data are only available during model training.

For object segmentation, our task is to infer a pixelwise foreground object mask. We reviewed relevant methods with a focus on those based on MRFs. In addition, we discussed two main issues in MRF-based object segmentation: context modeling and inference. The focus of our work in this thesis is the glass object segmentation problem, therefore we then discussed related work in the literature. Our work is among the first to leverage the additional depth data and the partial depth readings caused by irregular refractive properties of the glass surface. Also, to the best of our knowledge, we are the first to explore nonparametric label transfer for glass object segmentation.

Finally, we reviewed work on semi-supervised learning and boosting algorithms. We showed that boosting algorithms are an essential component of many object detection and segmentation systems. In addition, we revisited the MDBoost algorithm that directly optimizes the margin distribution. Its formulation provides us the flexibility to incorporate manifold reg- ularization and to extend the algorithm to a semi-supervised learning scenario.

Despite the progress discussed in this chapter, many object detection and segmentation models have certain limitations when only partial information is available during either the model training or testing stage. Three main issues remain, although the auxiliary depth information provides promising outlook for resolving these limitations. The issues are partial object observation, incomplete and imperfect data modalities, and partial ground-truth annotation. A key problem here is depth-aware context modeling in the presence of occlusion and under varying levels of depth information availability. In this thesis, we are interested in utilizing auxiliary depth information to model the spatial context for localizing both generic and glass objects. Particularly, glass objects exhibit large appearance variations and depth information obtained with RGBD cameras can be noisy and incomplete near glass boundaries. In addition, it is important to incorporate unlabeled data for object detection and segmentation when precise and complete ground-truth annotations are expensive to obtain. This thesis proposes a series of context-driven object detection and segmentation approaches to address these issues.

Structured Hough Voting for Joint

Object Detection and Occlusion

Prediction

3.1 Introduction

Object detection remains a challenging task for cluttered/crowded scenes, such as indoor environments, where objects are frequently occluded by neighboring objects or the viewing win- dow [53, 206]. The partial objects being observed usually provide limited information on the object position and pose, so many previous object detection approaches are prone to failure as they solely rely on image cues from objects themselves.

It is widely acknowledged that contextual information plays an important role in detect- ing and localizing objects in such adverse conditions. Many context-aware object detection methods have been proposed recently [219, 201, 127, 16]. However, most existing contextual models focus on 2D spatial relationships between objects on the image plane and fewer works have extended the modeling to 3D scenarios [8, 193]. One main difficulty in modeling 3D context was the lack of accessible 3D data. With recent progress in consumer-level depth sensors (e.g., Kinect), however, it becomes feasible to collect a large amount of high quality depth and registered color images for indoor environments [77, 145].

Modeling context from a 3D perspective has several advantages over its 2D counterpart conceptually. Firstly, spatial relationships have smaller variations and are easier to interpret semantically; in addition, more spatial relationships in physical world can be captured, instead of being limited to relative positions on the image plane. In particular, occlusion can be viewed as a special type of contextual relationship in 3D, which would become an intrinsic component of object and scene models. Finally, joint modeling of an object class and its 3D context may provide effective constraints on the object’s scope on the image plane and lead to a coarse-level object segmentation. See Figure 3.1 for an example.

Our work aims to utilize RGBD datasets to learn a context-aware object detection model 53

(a) (b) (c)

(d) (e) (f)

Figure 3.1: Illustration of structured Hough voting. (a) RGB frame with object bounding box (red) and visible part bounding box (green). (b) Object centroid voting from multiple layers. (c) Combined object centroid voting results. (d) Detector output (red) with visibility pattern prediction (green). (e) Object visibility pattern prediction results. (f) Final segmentation re-

sults.

which encodes depth cues and a coarse level of 3D relationships. We focus on training a depth- dependent appearance model for each object class and its context. The learned depth-encoded object and context model is then applied to 2D images during test so it can be used to facilitate generic object detection [195].

Specifically, we propose a structured Hough voting method that incorporates depth-dependent contexts into a codebook-based object detection model. Our model generalizes the traditional Hough voting detection methods in three ways. First, we design a multi-layer representation of image contextfor indoor scenes that captures the layout structure of scenes. An image region contributes to each object hypothesis in a different manner based on its depth layer. Secondly, we define a new object hypothesis space in which both the object’s center and its visibility mask will be predicted. Each image patch will generate a weighted vote to a joint score of the object center and its support mask in the image. Finally, we view occlusion as special contextual information, which could provide cues for localizing objects and help with reasoning about visibility of object parts. The overall output of our approach is a simultaneous object detection and coarse segmentation.

Our detection and segmentation are achieved by maximizing the joint score of object center and visibility mask. We derive an efficient alternating ascent method to search modes of the Hough voting score maps. To learn the model from partially labeled RGBD data, we adopt an approximate learning procedure based on the max-margin Hough transform [129]. We evaluate our approach on two public RGBD datasets and demonstrate its efficiency.

are introduced in Section 3.2. Section 3.3 describes the inference procedure in our structured Hough voting, followed by the max-margin learning for model estimation. Details on experi- mental evaluation are reported in Section 3.4 and Section 3.5 summarizes this chapter.

In document Context-driven Object Detection and Segmentation with Auxiliary Information (Page 73-77)