Supervised Techniques - Video object segmentation from long term trajectories

2.4 Supervised Techniques

As previously defined asupervised VO segmentation technique utilizes user supplied information in order to generate high quality segmentations. The user may supply for some frames in a sequence partially labelledmattes calledtrimaps, or fully segmented binaryalpha mattes, which we define askey frames. Also the user may just supply a few foreground and background paint strokes on selected frames.

In atrimap, image regions of definite foreground (αf(s) = 1) and background (αf(s) = 0) are labelled along with unknown regions (αf(s) = ?). The segmentation system is subsequently required to generatealphavalues for the pixels in these unknown region (αf(s)∈[0,1]), thus pro- ducing a continuousalpha matte. Note that binary and continuousalpha mattes are sometimes referred to as ‘hard’ and ‘soft’alpha mattes respectively in the literature. Also VO segmentation techniques produce ‘soft’alpha mattesfor the frames in a sequence are calledvideo matting techniques. For the rest of this discussion it should be assumed that all VO segmentation techniques mentioned produce binary (‘hard’) alpha mattes unless we specify otherwise.

2.4.1 Adaptation of Image Segmentation Techniques

Some supervised VO segmentation techniques [10, 27, 52] to date are simply adaptations of successful interactive image (supervised) segmentation techniques [24, 66, 71, 102, 103, 134] to the problem of segmenting video sequences. Wang and Cohen [125] provide a comprehensive review of these interactive image segmentation approaches.

Operating on each individual frame of a video independently using only an interactive image segmentation (IIS) technique would create some undesired problems. These IIS techniques [21, 26, 67, 71, 102, 103, 124, 134] usually require that the user supplies for an image a trimap or paint foreground and background strokes. Doing this for every frame in a video would be a tedious task. An additional problem is that slight differences in the extraction of the foreground from frame to frame would lead to obvious temporal inconsistencies, especially along the edges of the foreground.

Unlike still image segmentation techniques, VO segmentation techniques can exploit motion information. Each video frame is temporally correlated with the nearby frames. Hence the challenge forsupervised VO segmentation techniques is to accurately identify foreground regions with minimal user assistance by utilizing all available motion and appearance information.

Chuang et al. [27] in 2002 proposed a video matting technique based on a Bayesian image matting technique [134] they presented earlier. In this technique [27] optical flow was used to propagate a set of user drawn trimap to the unspecified frames in a sequence. A clean background plate was estimated as well to assist in the segmentation process. With a trimap

propagated to every frame, the Bayesian image matting technique [134] which utilized this clean plate for estimating background colours was then used to generatealpha mattes for each frame in a sequence. The performance of this VO segmentation technique is limited by the accuracy of

the optical flow estimation, which is often quite erroneous. Furthermore, the background plate estimation assumes the background undergoes only planar-perspective transformation, which is not true of most real world sequences of interest. This technique would fail if the objects in the background are moving.

2.4.2 Propagation of Binary Mattes

A subcategory of supervised VO segmentation techniques [11, 12, 70, 100, 123] require the user to supply fully segmented binary alpha mattes (key frames) for selected frames in a sequence. The segmentation system then automatically produces binary alpha mattes for the remaining frames in the sequence by propagating the labels in these key frame in some way. The binary

alpha mattes are then generally used to automatically generate trimaps for each frame in a sequence. Once these trimaps are estimated, an image matting technique can then be used to create continuous (‘soft’) alpha mattes. These VO segmentation techniques [11, 12, 70, 100, 123] reduces the work load of the user, astrimapsdo not have to be specified manually for every frame in a sequence. Hence less user interactions are required compared to the previously discussed approaches that are adaptations of image segmentation techniques.

Liet al.[70] in 2005 extended the pixel-level 3D Graph cut technique proposed by Boykov [26] to better handle video sequences. They use three main steps in their technique, which are a 3D Graph cut binary segmentation, tracking user defined local image windows for correcting the binary segmentation, and extracting continuous alpha mattes with an image matting technique [107]. The first two steps are used to produce accurate binaryalpha mattes, from whichtrimaps

are generated for the final image matting step. The user initially supplies fully segmented binary

key frames for some of the frames in a sequence. These key frames are articulated in the 3D Graph cut segmentation step, and the user then corrects any undesired segmentation errors.

In the next two sections we will discuss the Video SnapCut [12] and Feature-Cut [100] techniques, as we compare our work to both techniques later in this thesis. Both techniques proposed in 2009 are state of the artunsupervised VO segmentation techniques which propagate user defined binaryalpha mattes.

2.4.2.1 Video SnapCut

TheVideo SnapCutsegmentation algorithm was proposed by Baiet al.[12]. An implementation of this algorithm is bundled into Adobe After Effects CS5 [1] as a matting tool called Roto Brush.

This algorithm uses local motion information to propagate appearance information from a segmented frame t to the current unsegmented frame t+ 1. This appearance information is captured in overlapping rectangular windows of fixed sizes placed along the contour of the foreground. These local windows are defined aslocal classifiers in the paper [12]. The top row

of fig. 2.7 shows examples of these local classifiers Wt

2.4. Supervised Techniques 23

Figure 2.7: This figure is taken from the paper by Bai et al. [12] which is an illustration of the local classifiers proposed in the paper. (a): Overlapping classifiers (yellow squares) are initialized at frame t along the contour of the foreground (red curve). (b): These classifiers are then propagated onto the next frame using local motion information. (c): Each classifier contains a local colour and shape model, which are initialized at frametand refined if necessary at frame t+ 1. (d): Local foreground/background classification results are then combined to generate a global likelihood for all the pixels in frame t+ 1. (e): The final segmentation for frame t+ 1. Video courtesy of Artbeats.

kin frame t(left) and propagated to the current framet+ 1 (right) using motion information obtained from optical flow [53] and SIFT [73] feature matching. Here the local classifiers W_kt

(window) in frametcorrespond toW_kt+1 in framet+ 1. In eachlocal classifier colour and shape models (middle row of fig. 2.7) are generated for the foreground/background image region inside the corresponding window. These models are then used in the design of a posterior distribution for the alpha labels for the current frame t+ 1. Finally, a MAP estimate for the binary alpha matte for this framet+1 is generated using a Graph cut [135] solution. Thebottom row of fig. 2.7 shows an example of the likelihood probability map on the left (d) for the frame segmented on the right (e).

Figure 2.8: This figure is taken from the paper [100] by Ring. This is an example of propagating a user define alpha matte for frame 1 to the an unsegmented 5th frame in a sequence of a Polo player. Using feature correspondences (green lines) between these frames, regions of the known matte at frame 1 are ‘pushed’ into frame 5. Each of the blueand red circles represent the neighbourhoods of the respective SIFT features that have been matched in both frames. The partial matte in frame 5 is missing some foreground regions, but the subsequent optimization process (MAP estimation for labels) usually is able to fill in these missing region.

key frame) in a sequence, and all subsequent frames are processed in the order 2 toT (whereT

is the number of frames in the sequence). The user is allowed to correct any segmentation errors generated for the current frame t+ 1, and by default these changes only affect the frames after

t+ 1 (forward propagation), i.e. frames t+ 2, . . . , T. However the user can make the current frame t+ 1 a key frame, and allow previous frames (2 to t) to be influenced by this key frame

(backward propagation).

It will be shown later in this thesis that this technique generally performs well. However, it has some issues with distinguishing revealed background regions from legitimate foreground regions.

2.4.2.2 Feature-Cut

The Feature-Cut segmentation algorithm proposed by Ring [100], uses SIFT features matches between frames to propagate a collection of user key frames to the unsegmented frames in a sequence. Feature matches are used to identify corresponding image regions between a key frame and the current unsegmented frame of interest. These key frames contain foreground (αf(s) = 1) and background labels (αf(s) = 0), and the image region correspondences are used to propagate these labels to the unsegmented frames. Fig. 2.8 demonstrates this idea, using two

2.5. Sparse Video Object Segmentation 25

In document Video object segmentation from long term trajectories (Page 33-37)