2.2 Object segmentation in computer vision
2.2.4 Glass object segmentation
So far we have discussed generic foreground object segmentation with a focus on related work based on MRFs. In this thesis, we are particularly interested in the glass object segmentation
Color frames
Depth frames
Figure 2.8: Example RGBD image pairs containing glass objects. Note the distinctive but irregular missing patterns in and around glass regions. See text for details.
problem. We make this choice because glass objects play an important role in daily human activities and are commonly found in indoor environments such as home, office and laboratory. Therefore, it is essential for a visual recognition system to be able to localize them.
Despite the progress in generic object segmentation, the segmentation of glass objects re- mains a particularly challenging problem in scene understanding [75, 137]. The main difficulty in detecting glass objects lies in the semi-transparent nature of glass surface that results in very large appearance variations depending on the background. Therefore, there is a lack of locally discriminative visual features to capture the appearance variations at glass regions and bound- aries [135, 51]. For example, visual cues commonly used for image labeling such as color and texture are less effective due to the changing background. In fact, a glass surface can be seen as an overlay on the background so relative features that identify the difference between two image regions may better help localize glass boundaries. In addition, glass objects are usually made for a specific use, and could come in very different and irregular shapes. It is therefore difficult to assume shape templates for glass objects.
In this thesis, we are interested in pixelwise segmentation for semi-transparent objects (in- cluding not only glass but also some plastic objects, for example), and we use the term glass objectsand semi-transparent objects interchangeably. In particular, we are interested in mak- ing use of RGBD data to localize glass objects. See Figure 2.8 for example RGBD image pairs containing glass objects. Note how the appearances of glass objects in color images are affected by background clutter, and the various overlay effects in glass regions such as blur- ring, texture distortion, and saturation changes. In addition, notice the distinctive but irregular missing patterns (shown in white) in depth images resulting from attenuation of structured light signals passing through glass. Although these patterns may roughly tell us about the presence of glass, the missing pattern could either be dilated or eroded based on local refrac- tive properties. Moreover, these patterns could spatially overlap with missing patterns caused by other reasons such as occlusion boundaries. These missing patterns can be a nuisance for RGBD imaging but, as we will show in our work, can also be used as an effective feature for glass object segmentation. In this section, we review related work on glass object detection,
segmentation and pose estimation.
Localizing glass objects with color images. We begin our discussion with related work on lo- calizing glass objects with color images only. In general, there are two major problems. Firstly, we have to obtain effective visual features to identify glass regions and boundaries locally. Secondly, we need to build an object model in order to piece together the local estimates and suppress any local noise if possible. For the first problem, as it is difficult to design features to identify a glass region by itself, most previous work has focused on detecting special properties of the glass surfaces and their interactions with the opaque environment in images [151, 144]. Metelli [138] is among the first to study the perception of transparency in terms of spatial and intensity relations of light reflected from a relatively wide field. See [188] for a review and study on the theory of perceptual transparency from the psychology community. One of the early works by Adelson and Anandan [3] in the computer vision community introduces a linear model for the intensity of a transparent surface:
I = αIB+ e (2.26)
where IB is the intensity of the background, α is a blending factor, and e is the emission
of the semi-transparent surface. They relate the characteristics of visual transparency to the characteristics of the X junctions resulting from patterns on overlapping distinct layers. In addition to this overlay model, highlights are another useful cue as glass is known to be highly specular, and highlights can be found in color images by assuming a dichromatic reflection model [86]. In particular, McHenry, Ponce and Forsyth [135] design a classifier that attempts to find a glass/non-glass boundary based on a combination of visual cues. They compute relative features at both sides of a boundary fragment to partially address the appearance variation issue. Similar cues are also used in [91]. The cues used in their papers include:
• Color similarity: the color tends to be similar of both sides of a glass boundary; • Blurring: the texture on the glass side is blurrier;
• Overlay consistency: the intensity distribution on the glass side is constrained by the intensity distribution on the non-glass side. In particular, pixels on the glass side usually have a lower saturation value;
• Texture distortion: the texture on the glass side is slightly different;
• Highlights and caustics: the presence of highlights and caustics increases the probabil- ity of a possible transparent material around;
• Cross-correlation: distortion produced by a semi-transparent object can also be cap- tured by region analysis, e.g., a cross-correlation measure.
Usually these cues are considered as noise and discarded in object detection and segmen- tation. However, they are characteristic of glass/non-glass boundaries. In particular, Osadchy et al. [151] recognize objects from specular reflections using knowledge of their 3D shapes.
In terms of object models, McHenry and Ponce [134] propose two complementary mea- sures of affinity and another of discrepancy between regions to group image regions into glass/non-glass surfaces. The local predictions are combined using the geodesic active con- tour framework [29]. Their work focuses on the binary criteria that answer if two regions are made of the same material, and do not consider the unary region estimates.
Fritz et al. [51] model local patch appearances with an additive model of latent factors in order to detect transparent visual words, and then use latent topic activations to generate object hypotheses. The basic idea behind the additive latent model is that the appearance of a glass region is a combination of factors including background and one or more patterns that have been affected by refraction effects. Their method uses a sliding-window based approach to infer latent topic activations based on linear SVMs. Therefore, it only generates bounding boxes for likely glass object locations instead of a pixelwise segmentation.
Localizing glass objects with multimodal data. The challenging nature of glass object detec- tion and segmentation encouraged researchers to utilize additional sensory information beyond single-view visual cues. In most cases, range (depth) cameras are employed to detect semi- transparent objects, in which the attenuation of signal intensities is exploited.
Klank, Carton and Beetz [84] use two images from a time-of-flight camera to detect and reconstruct transparent objects. Their active infrared camera is robust to illumination changes, however has a shadow-like behavior for glass objects. To deal with this, they adopt a two-step reconstruction scheme and assume glass objects as piecewise planar to get an initial recon- struction. Lee and Shim [105] use a stereo time-of-flight camera setup and derive a gener- alized depth imaging formulation for translucent objects. They find that the depth readings of a time-of-flight camera with semi-transparent objects present a systematic distortion and that the distorted depth values can be refined using an iterative optimization. Phillips and colleagues [159] use a stereo camera and exploit the fact that glass objects generate anoma- lies in the stereo inverse perspective map. Glass objects are assumed to be standing on a flat supporting plane. The plane needs to be somewhat textured to facilitate 2D homography estimation. Their method identifies extruding points from textured surfaces that violate the inverse perspective mapping, and use a dataset of 3D models to generate shape templates for detailed localization. In particular, they use a similarity score that maximizes the homography inconsistency inside the shape template while minimizing the inconsistency in the neighbor- hood around the template. Wallace and Csakany [209] develop a time-of-flight laser sensor based on photon counts to measure 3D data from transparent surfaces. Liu et al. [116] propose a frequency-based 3D reconstruction method, which incorporates a frequency-based matting method that is similar to structured light methods. Ma et al. [125] derive a formulation of light
transport in refractive media using light fields and the transport of intensity equation. Ye et al. [227] augment a Kinect camera with an ultra-sonic sensor that is able to measure distance to any object, including transparent surfaces. Xu et al. [222] use linearity in light-field images to estimate the likelihood of a pixel belonging to a transparent object or a Lambertian back- ground. Lei et al. [108] use a LIDAR device along with a registered RGB camera for glass object segmentation. Object candidates are proposed by highlight spots in RGB images and refined by running GrabCut [173] on depth and laser reflectance intensity images. In addition, when viewpoint is fixed, Han et al. [67] develop an approach for dense transparent surface reconstruction based on refraction of light.
The closest to our work is from Lysenkov, et al. [123] in the sense that they also use an RGBD camera for glass object detection and pose estimation. They propose a model taking into account both silhouette and surface edges, and perform CAD-based pose estimation. An extension to this work from the same group [124] focuses on pose estimation in transparent clutter. Another extension proposed by Luo et al. [122] improves the method by integrating visual cues so that non-transparent objects that produce unknown depth values would not be considered as transparent objects. However, these methods require 3D models of objects ob- tained by covering transparent objects with paint, in order to make their surface Lambertian. In our work, we wanted to make our method more flexible with unseen objects and avoid using strong shape priors. Albrecht and Marsland [4] also propose a detection and reconstruction method for glass objects from point cloud data. Their method utilizes the shadows in RGBD images that are left in two or more distinct viewpoints to facilitate reconstruction. In our work, however, we are interested in glass object segmentation from a single viewpoint.