1.7 Major contributions
2.1.4 Occlusion reasoning for object detection
Most object detection methods introduced in the previous sections rely on one important as- sumption: the majority of images used for both training and testing should only include fully visible views of an object. There is no special handling for partially visible objects. Therefore these objects could negatively impact the training and testing process. This is because the al-
gorithms can confuse between the very different appearances of a fully visible object and that of a partially visible one. See Figure 2.3 for an example. In Figure 2.3 (c), the appearance of the table in front of the chair is very different from the chair seat and base being occluded, yet it could still be predictive of the presence of a chair behind as these table-chair configurations are commonly found in an office scene.
(a) (b) (c) (d)
Figure 2.3: The frontal views of two visually similar chairs (cropped). For each chair the original image is shown on the left ( (a) and (c) ), with the visualized HOG feature map [208] on the right ( (b) and (d) ). For the partially occluded chair, the seat and the base are occluded
by a table in the front. See text for details.
One possible way to deal with this partial visibility problem is to require all bounding box annotations to include only visible object parts, and treat those partially visible objects as separate subcategories using methods such as mixture models. For example, in the chair cate- gory we may have a dedicated subcategory that detects backrests. In fact, this simple strategy has been proven effective in state-of-the-art object detection systems such as the DPM [46]. The downside, however, is that it requires more training data to cover all typical viewpoint variations. Conceptually, it is preferable to treat the backrests of the two chairs in Figure 2.3 as a single object part, and build an object model that allows certain parts of an object to be occluded.
It should also be noted that the partial observation issue is more prevalent in indoor object detection problems. This is primarily due to two underlying facts that produce two typical partial observation scenarios. Firstly, due to the compact nature of indoor spaces, many objects have to be arranged closely to each other. In particular, some objects are arranged in functional groups to facilitate human interactions. Examples include the typical configurations of table and chairs, and the various components of a desktop computer (e.g., a monitor, a keyboard and a mouse). We refer to this scenario where one object blocks the view of another object as occlusion. Another typical scenario is when the viewer (or camera) is too close to the object so that the object is unable to fit in the viewing window. This results in a partially visible object truncated by image boundaries. We refer to this case as truncation.
The presence of occlusion and truncation makes object detection more challenging. For detectors not explicitly reasoning about occlusion and truncation, it is likely that inconsistent
part appearances or geometric distributions will be mixed up with regular ones, resulting in much larger intraclass appearance variations. The models introduced in the previous sections could easily fail in the presence of occlusion, as features from the occluded parts will adversely contribute to the score of object hypotheses. In this regard, explicit occlusion reasoning is necessary for objects that are frequently being occluded.
Because of its prevalence in many real-world applications, occlusion has been well studied in the computer vision literature. One basic strategy is to allow object detectors identify partial occlusion so that the occluder would not adversely affect the score of an object hypothesis. For the simple template matching based sliding window detector in Section 2.1.1, we can use the scores of individual HOG cells to infer occlusion [213]. For part-based models, Girshick et al. [59] use an occluder part in their grammar model when all parts cannot be placed. Tang et al. [197] leverage the fact the occlusions often form characteristic patterns and extend the DPM for joint person detection and tracking. Wojek et al. [218] combine object and part detectors based on their expected visibility using a 3D scene model. Wu and Nevatia [220] maximize a joint likelihood that involves responses of multiple part detectors for multiple, partially occluded humans. Li et al. [111] present a method for detecting partially occluded cars based on And-Or models. Brox et al. [26] use a part-based poselet detector and align the corresponding part masks to image boundary cues. Another work that also reasoned about occlusion within bounding boxes for object detectors is [53]. The bounding box representation was augmented with a set of latent variables to generate a binary occlusion pattern. In addition, they enforce consistency between visibility patterns of multiple objects and their relative depth ordering. This is inspired by an earlier paper that uses structured output regression for detection with partial truncation [206]. To reduce noise in occlusion classifications, local coherency of regions is often enforced [50]. One common feature for the papers mentioned above is that they mainly focus on modeling occlusion without complex reasoning about the underlying 3D scene, partially due to the fact that depth data is not easily accessible, making it difficult to study the real 3D configuration of objects in a scene.
Recently with accessible 3D data collected from affordable RGBD sensors, there has been an increasing amount of work on occlusion reasoning in 3D. For example, Meger et al. [136] use depth inconsistency from 3D sensor data to classify occlusions. Pepik et al. [157] leverage fine-grained 3D annotated urban street scenes to mine distinctive, reoccurring occlusion pat- terns. Detectors based on DPM with explicit occluder parts are then trained for each of these patterns. Zia et al. [241] model occlusions on a 3D geometric object class model by enumer- ating a finite number of occlusion patterns. Hsiao and Hebert [74] explicitly model occlusions by reasoning about 3D interactions of objects. These works reason about 3D geometric con- figurations of parts, objects and cameras in 3D that help to explain occlusions more naturally. In addition, Bonde et al. [20] address the problem of object instance recognition in clutter that allows them to learn discriminative 3D shape features for individual object instances. Simi-
(a) (b) (c)
Figure 2.4: Example of indoor scenes. Note how objects are occluded or truncated by image boundaries. Groups of objects are also arranged together to facilitate human interactions.
larly, Tejani et al. [198] propose a latent-class Hough forest in which the class distributions at leaf nodes are treated as latent variables. Unlike our work, their method focuses on 3D pose estimation where a dense 3D model of each object instance is needed.
Despite the progress, 3D occlusion reasoning in general is less studied due to poorer data availability. As discussed in Section 2.1.3, although there have been a few large publicly available RGBD datasets, most imagery data available and being created nowadays are color images only. Therefore, one key issue here is to train a better occlusion-aware object detector with auxiliary depth information and apply it to a test scenario without depth. Another issue is to integrate the depth-aware occlusion reasoning into a coherent object detection framework. In Chapter 3, we present an object detection system that aims at resolving these issues.