Comparing Monocular and Binocular Segmentation

6.3 Experimental Results and Discussion

6.3.2 Comparing Monocular and Binocular Segmentation

A binocular camera system will always outperform a monocular system, simply because more information is available. However, in many situations a monocular system is able to detect in- dependent motion and segment the moving objects in the scene. In this section we demonstrate the segmentation of independently moving objects using a monocular and a binocular camera system and discuss the results.

In a monocular setting, motion which is aligned with the epipolar lines cannot be detected without prior knowledge about the scene. Amongst other motion patterns, this includes objects

6.4 Ideas for Future Research (Flow Cut) 90

moving parallel to the camera motion. For a camera moving in depth this includes all (directly) preceding objects and (directly) approaching objects. The PreceedingCar and HillSide sequences in Figure 6.13 show such constellations.

Using the ground plane assumption in the monocular setting (no virtually triangulated point is allowed to be located below the road surface) facilitates the detection of preceding objects. This can be seen in the PreceedingCar experiment, where the lower parts of the car become visible. If compared to the stereo settings, which does not use any information about scene structure, the motion likelihood for the lower part of the preceding car is more discriminative. However, if parts of the scene are truly located below the ground plane, as the landscape at the right in the HillSide experiment, these will always be detected as moving, too. Additionally, this does not help to detect approaching objects. Both situations are solved using a binocular camera.

If objects do not move parallel to the camera motion, they are essentially detectable in the monocular setting (Bushes and Running sequences in Figure 6.14). However, the motion likelihood using a binocular system is more discriminative. This is due to the fact that the three- dimensional position of an image point is known from the stereo disparity. Thus, the complete viewing ray for a pixel does not need to be tested for apparent motion in the images, as in the monocular setting. In the unconstrained setting (not considering the ground plane assumption), the stereo motion likelihood therefore is more restrictive than the monocular motion likelihood. Note, that non-rigid objects (as in the Running sequence in Figure 6.14) are detected as well as rigid objects and do not limit the detection and segmentation at any stage.

6.4 Ideas for Future Research

Building on the approach to scene flow estimation presented in Chapter 5, I proposed in this chapter an energy minimization method to detect and segment independently moving objects filmed in two synchronised video cameras installed in a driving car. The central idea is to assign, to each pixel in the image plane, a motion likelihood which specifies whether based on 3D structure and motion, the point is likely to be part of an independently moving object. Subsequently, these local likelihoods are fused in an MRF framework and a globally optimal spatially coherent labelling is computed using the min-cut max-flow duality. In challenging real world scenarios where traditional background subtraction techniques would not work (because the whole image content is moving), this approach enables to accurately localize independently moving objects. The results of the presented algorithm could directly be employed for automatic driver assistance. Further research should focus on feedback loops in the whole motion estimation and segmentation process. Certainly, if motion boundaries and the segmentation of moving rigid objects are known, this provides additional cues for the motion estimation step. The work in [57] estimates piecewise parametric motion fields and a meaningful motion segmentation of the image in a joint approach. A similar approach could segment moving objects in the input images.

Another possibility for a feedback loop is the segmentation content itself for the segmentation stage. In the presented approach every motion, as long as it is not induced by the stationary scene, is segmented. Multiple objects may end up in the same segment and noise may influence segmentation boundaries. An individual object usually moves into the same direction and covers only a small disparity range. Such information could directly be used to iteratively improve the segmentation boundary and to separate individual objects.

Certainly, there are numerous other areas of research, such as the speeding up of the segmentation process (see also Appendix D) or using multiple frames of the sequence and propagating segmentation boundaries over time. This chapter presented a first application of scene flow; the next chapter will sketch concepts for three more applications and hopefully inspire the reader to come up with many more ideas for future research.

PreceedingCar

HillSide

Optical Flow Result

Monocular Segmentation of Independently Moving Objects

Binocular Segmentation of Independently Moving Objects

Figure 6.13: The figure shows the energy images and the segmentation results for objects moving parallel to the camera movement. This movement cannot be detected monocularly without additional constraints, such as a planar ground assumption. Moreover if this assumption is violated, this yields errors (as in the HillSide sequence). In a stereo setting prior knowledge is not needed to solve the segmentation task in these two scenes.

6.4 Ideas for Future Research (Flow Cut) 92

Bushes

Running

Optical Flow Result

Monocular Segmentation of Independently Moving Objects

Binocular Segmentation of Independently Moving Objects

Figure 6.14: The figure shows the energy images and the segmentation results for objects which move not parallel to the camera motion. In such constellations a monocular as well as a binocular segmentation approach is successful. However, one can see in the energy images and in the more accurate segmentation results (the head of the person in the Running sequence) that stereo is more discriminative. Note, that the also non-rigid independently moving objects are segmented.

Research Ideas using Visual Kinesthesia

The future is in your hands.

under the creative commons license, MichaelMarlatt, www.flickr.com

Contents

7.1 Extending Warp Cut . . . 94 7.2 Motion from Motion . . . 96 7.3 Dynamic Free Space . . . 97

In this chapter, I discuss ideas for future research, building on the results obtained in this thesis. The precise estimation of optical flow and scene flow allows for novel approaches to many applications and problems arising in driver assistance. Here, concepts for three such applications in driver assistance are sketched:

1. Object Segmentation from scene flow vectors is only one way to detect moving objects. Especially if additional sensors are available (such as radar or lidar), information such as relative speed and distance to objects can be directly employed. The detection of stationary and moving objects then becomes applicable solely using the gray values and image warping. A joint approach, employing motion vectors and gray values in the segmentation process is within the scope of future research.

2. Object Motion Estimation is another research topic. In Chapter 3 methods were

presented to estimate the motion of the camera from stationary points. The same principle can be used to estimate the relative motion between the camera and another (segmented) moving object. Then, the motion of the object can be inferred from the motion difference. 3. Dynamic Free Space has greatly been ignored in computer vision; it is important to note, that relative motion is a major issue though when using a radar sensor. In robotics, depth maps are commonly established from depth correspondences only. This might be motivated by history as laser and ultra-sound devices were used before stereo cameras became more popular. Using a camera system, tracking becomes possible and I am convinced that motion cues will one day be at least as important as depth cues. The idea of a dynamic free-space representation is outlined as another application of scene flow.

7.1 Extending Warp Cut (Research Ideas using Visual Kinesthesia) 94

Figure 7.1: The Warp Cut Segmentation algorithm models the world with three planes: the ground plane and the plane at infinity, stitched together at the horizon, and the obstacle plane. Segmentation of the object is then performed on the brightness differences between the original first frame and the backward- motion-warped second frame for the respective planes (see Fig. 7.2). The approach is essentially limited because (additional) objects which violate the three-plane-assumption affect the results. The direct use of optical flow vectors may overcome some of these difficulties and yield an increase in robustness.

7.1 Extending Warp Cut

Autonomous collision avoidance in vehicles requires an accurate separation of obstacles from the background, particularly near the focus of expansion. In [15], I presented a technique for fast segmentation of stationary obstacles from video, recorded by a single camera that is installed in a moving vehicle. In [7], this approach was generalized for moving objects, where the distance and relative speed of the object was known from radar measurements.

The basic idea is illustrated in Figure 7.1. The input image is divided into three motion segments consisting of the ground plane, the background, and the obstacle. Due to the given scenario the following assumptions were imposed:

1. The street is approximately planar. Hence, the image motion in this area is described by a homography Hs. The homography can be approximated from the known camera motion

and the camera parameters.

2. Visible object points on distant obstacles have approximately the same depth. Applying the weak perspective camera model, the motion field in the obstacle region is affine, which can be expressed by another homography Ho.

3. Finally, the background region, i.e., the region above the horizon can be approximated as a plane at infinity, which leads to a third homography Hb.

Consequently, there are three regions, each with a different motion model between frames. This constrained scenario allows for good initial estimates of the motion models for each segment, which are iteratively refined during segmentation. The separation of the obstacle region from the other two regions is done by the sought segmentation of the obstacle.

The street and background regions are separated a-priori by a horizontal line y = yhor that can

be derived analytically from the camera parameters, which leaves us with a binary partitioning problem. This enables accurate obstacle segmentation without prior knowledge about the size, the shape, or the base point of obstacles.

Similar to the flow cut approach in Chapter 6, graph cut segmentation was used to segment the obstacle within the image, where the binary labeling Lt(x) of each pixel x= (x, y)> is

Lt(x) =

1 ifxdepicts the obstacle

Figure 7.2: Motion-compensated difference images. From left to right: original gray value input images and difference images for foreground (Ef) and background (Eb) based on the squared difference

between the motion-compensated image It−1and image Itafter the last iteration. The camera translation

is 1.9 m. Hf and Hb denote foreground and background motion.

Segmentation by grouping similar gray values is not suitable in the presented context because the gray value of obstacles is not fixed and may be similar to the gray value of the street. Thus, the labeling is based on motion information. The classical approach to segmentation minimizes a joint energy on the labeling of the form

E(Lt) = EData(Lt) + αESmooth(Lt) . (7.2)

I will now review the data term in more detail: The key idea of distinguishing between obstacles and background is to penalize the difference between the current frame and the motion- compensated (warped ) previous frame. Separate motion predictions are computed for the obstacle region (Ho) and the non-obstacle regions (Hb for background and Hs for the street re-

gion). Notice, that for the presented application this is much more sensible than the approaches described in [58, 128] as it allows one to drop the assumption of a static camera. The motion- compensated images are composed as follows (recall, that yhor is the horizon in the image):

Background: I_0,t−1mc (x) = It−1(Hb(x)) y < yhor It−1(Hs(x)) y ≥ yhor , (7.3) Object: I_1,t−1mc (x) = It−1(Ho(x)) . (7.4)

Values between grid points are determined by bi-linear interpolation. Figure 7.2 shows the motion-compensated difference images for an example situation. The data term evaluates the consistency between the warped previous image and the current image. It consists of the sum over the squared differences between both images (recall, that Lt(x) ∈ {0, 1} denotes the la-

belling): EData(Lt) = X x∈Ω It(x) − I_Lmc_t₍_x_),t−1(x) 2 . (7.5)

Such an approach is essentially limited, because (additional) objects which violate the three- plane-assumption affect the results. The direct use of optical flow vectors, combining the warp cut approach and the flow cut approach, may take away some of these limitations. On the other hand, gray value information can help to detect inconsistencies in the segmentation result derived from optical flow vectors (as presented in Chapter 6). Furthermore, the warp cut approach can be extended to stereo vision and combined with the flow cut approach employing scene flow vectors. Both ideas are expected to yield an increase in robustness and accuracy.

In document 3D Motion Analysis via Energy Minimization (Page 96-103)