Human-guided video segmentation methods accept human input in the first frame or a subset of frames, then propagate the information to the remaining frames (Nagaraja et al., 2015; Tsai et al., 2016; Badrinarayanan et al., 2010; Jain and Grauman, 2014; Wen et al., 2015; Perazzi et al., 2015; Maerki et al., 2016; Tsai et al., 2010).
Mask propagation techniques. Among this group are semi-supervised or semi- automatic approaches, which assume an object mask in the first frame is known, and the objective is to track the object mask throughout the video. Appearance similarity and motion smoothness across time is used to propagate the first frame
annotation across the video (Maerki et al., 2016; Wang and Shen, 2017; Tsai et al., 2016). These methods usually leverage optical flow and long term trajectories. Existing approaches focus on propagating superpixels (Wen et al., 2015; Jain and Grauman, 2014), constructing graphical models (Maerki et al., 2016; Tsai et al., 2016) or utilizing object proposals (Perazzi et al., 2015).
The label propagation method of (Badrinarayanan et al., 2010) jointly models appearance and semantic information. The key idea is to influence the learning of frame to frame patch correlations as a function of both appearance and class labels. This method was extended to include correlations between non-successive frames using a decision forest classifier by Budvytis et al. (2011). Tsai et al. (2010) propose to jointly optimize for temporal motion and semantic labels in an energy minimization framework. A sliding window approach is used to process overlapping n-frame grids for efficiency reasons. The result of one n-frame grid is employed as a hard constraint in the next grid and so on.
Fathi et al. (2011) use active learning for video segmentation. Each unlabelled pixel is provided a confidence measure based on its distance to a labelled point, computed on a neighbourhood graph. These confidences are used to recommend frames in which more interaction is desired. In the work of Nagaraja et al. (2015) video object segmentation is formulated as a spatio-temporal markov random field optimization problem, with a cost function including user input, motion and appearance cues, and spatio-temporal consistency.
Tsai et al. (2016) build a graph over pixels and superpixels, uses convnet based ap- pearance terms, and interleave video segmentation with optical flow estimation. For the segmentation model, they construct a multi-level graphical model that consists of pixels and superpixels, each of which plays different roles for segmentation. At the superpixel level, each superpixel is likely to contain pixels from the foreground and background as the object boundary may not be clear. At the pixel level, each pixel is less informative although it can be used for more accurate estimation of motion and segmentation. With the combination of these two levels, the object boundary can be better identified by exploiting both statistics contained in superpixels and details in the pixel level.
Wen et al. (2015) construct a graph over neighboring frames connecting super- pixels and (generic) object parts to solve the video labeling task. Perazzi et al. (2015) propose to build a global graph structure over object proposal segments, and then infer a consistent segmentation. A limitation of methods utilizing long-range con- nections is that they have to operate on larger image regions such as superpixels or object proposals for acceptable speed and memory usage, compromising on their ability to handle fine details. In contrast, the systems introduced in Chapter 9 and Chapter 10 are efficient at test time due to its feed-forward architecture, operate on a pixel level and generate high quality results in a single pass over the video, without the need for considering more than one frame at a time.
Instead of using superpixels or proposals, Maerki et al. (2016) formulate a fully- connected pixel-level graph between frames and efficiently infer the labeling over the vertices of a spatio-temporal bilateral grid (Chen et al., 2007). Because this method
2.4 video segmentation 29
propagates information only across neighboring frames it has difficulties ensuring globally consistent segmentation. On the contrary, our approaches in Chapters 9 and 10 learn the specific appearance of the object of interest via online tuning and therefore produce temporally consistent results.
Box tracking. Classic work on video object tracking focused on bounding box tracking. Many of the insights from these works have been re-used for mask tracking. Some previous works have investigated approaches that improve segmentation quality by leveraging box-level tracking and vice versa (Ren and Malik, 2007; Godec et al., 2011; Duffner and Garcia, 2013; Chockalingam et al., 2009).
Traditional box tracking smoothly updates across time a linear model over hand- crafted features (Henriques et al., 2012; Breitenstein et al., 2009; Kristan et al., 2014). Since then, convnets have been used as improved features (Danelljan et al., 2016, 2015; Ma et al., 2015; Wang et al., 2015a), and eventually to drive the tracking itself (Held et al., 2016; Bertinetto et al., 2016; Tao et al., 2016; Nam et al., 2016; Nam and Han, 2016). Convnet-based approaches need data for pre-training and learning the task.
In Chapter 9 we propose a mask tracking method, which is closely related to convnet-based box trackers of Held et al. (2016) and Nam and Han (2016). Held et al. (2016) propose to train offline a convnet so as to directly regress the bounding box in the current frame based on the object position and appearance in the previous frame. Nam and Han (2016) propose to use online fine-tuning of a convnet to model the object appearance. Our training strategy in Chapter 9 is inspired by Held et al. (2016) for the offline part, and Nam and Han (2016) for the online stage. Compared to the aforementioned methods our approach operates at pixel level masks instead of boxes. Differently from Nam and Han (2016), we do not replace the domain-specific layers, instead fine-tuning all the layers on the available annotations for each individual video sequence.
Convnet-based mask tracking. Following the trend in box-level tracking, recently convnets have been proposed for mask tracking. What makes convnets particularly suitable for the task, is that they can learn what are the common statistics of appearance and motion patterns of objects, as well as what makes them distinctive from the background, and exploit this knowledge when tracking a single particular object. This aspect gives convnets an edge over traditional techniques based on low-level features.
Caelles et al. (2017b) train a generic object saliency network, and fine-tune it per-video using the first frame annotation to make the output sensitive to the specific object instance being tracked. The resulting fine-tuned network is then applied on each frame of the video individually. Differently from our approach in Chapters 9 and 10 their segmentation is not guided, and therefore it cannot distinguish multiple instances of the same object. Instead, they incorporate the notion of the object to be segmented based solely on the first frame annotation, which might result in performance decay over time, as the object appearance diverges from the initial
frame. Furthermore, it relies on expensive dense video annotations for pre-training, while we employ static images.
Caelles et al. (2017a) extend the work of Caelles et al. (2017b) by incorporating the semantic information of an instance segmentation method into the video object segmentation pipeline. More recently, Voigtlaender and Leibe (2017b) have proposed to integrate an online adaptation mechanism into the pipeline of Caelles et al. (2017b). To adapt to the object appearance changes they update the network per-frame based on training examples selected online. In order to avoid drift, training examples are carefully selected by choosing pixels for which the network is very certain that they belong to the object of interest as positive examples, and pixels which are far away from the previous frame mask as negative examples.
Jampani et al. (2016a) mix convnets with ideas of bilateral filtering. They introduce a Video Propagation Network (VPN) that propagates information forward through video data. The VPN architecture is composed of two components. The first one is a temporal bilateral network that performs image adaptive spatio-temporal dense filtering. The bilateral network allows to connect densely all pixels from current and previous frames and to propagate associated pixel information to the current frame. This is then followed by a standard spatial CNN on the bilateral network output to re-fine and predict the mask for the present video frame.
To cope with frequent occlusions and appearance variations in dynamic scenes, most recently Li et al. (2017b) have proposed to employ an adaptive object re- identification module along with our mask propagation introduced in Chapter 9 to retrieve missing instances. Specifically, when missing instances are re-identified with high confidence, they are assigned with a higher priority to be recovered during the mask propagation process. For each retrieved instance, its frame is taken as the starting point and the mask propagation is applied bi-directionally. Both mask propagation and re-identification modules are iteratively applied to the whole video sequence until no more high confidence instances can be found. Following our work in Chapter 10 they employ a two-stream convnet with a RGB and optical flow magnitude branches for mask propagation. However, they adopt the much deeper ResNet network (He et al., 2016) with atrous spatial pyramid pooling and multi-scale testing (Chen et al., 2017a) to increase the model capacity and the resolution of prediction.
The network architecture employed in Chapter 10 is similar to Caelles et al. (2017b) and Jain et al. (2017). Other than implementation details, there are two differentiating factors. One, we use a different strategy for training: while other works (Caelles et al., 2017b; Jampani et al., 2016a; Voigtlaender and Leibe, 2017b) all rely on consecutive video training frames and/or use an external image datasets (Voigtlaender and Leibe, 2017b; Perazzi et al., 2017; Li et al., 2017b), our approach focuses on using the first frame annotations provided with each targeted video benchmark without relying on external annotations. Two, our approach exploits optical flow more effectively than these previous methods.
2.4 video segmentation 31
Interactive video segmentation. Applications such as video editing for movie production often require a level of accuracy beyond the current state of the art. Thus several works have also considered video segmentation with variable annotation effort, leveraging a human in the loop to provide guidance or correct errors, e.g. (Jain and Grauman, 2016; Fan et al., 2015; Nagaraja et al., 2015). Several methods employ flexible user inputs, enabling human interaction using clicks (Jain and Grauman, 2016; Spina and Falcão, 2016; Wang et al., 2014b) or strokes (Bai et al., 2009; Zhong et al., 2012; Fan et al., 2015).
Albeit our techniques in Chapters 9 and 10 can be adapted for more flexible inputs, we focus on maximizing quality for the non-interactive case with no-additional hints along the video.