1. Introduction and Background
1.3 Semantic Image Segmentation
Semantic image segmentation is an important vision task, which is essential for comprehensive scene understanding. Given an input image, a semantic image segmentation model should pro- duce a dense segmentation mask, which partitions the image into semantic regions. Figure 1.6 illustrates typical inputs and outputs of image segmentation models. The set of semantic cate- gories is typically fixed and there are two common types of categories: things (e.g. cars, cups,
horses, etc.) and staff (e.g. sky, grass, road, etc.). Some formulations of the semantic image
segmentation task include a special background class, which represents all image pixels that do not belong to the current set of semantic classes.
Image segmentation with structured models.Image segmentation is known to be a very complex task as it requires dense semantic analysis of high-dimensional raw visual data. The first relatively successful framework for tackling image segmentation tasks emerged in the 2000s. This framework relies on supervised machine learning techniques, hand-crafted lo- cal image feature descriptors and learning/inference techniques for structured models [95]. Prominent and widely used examples of structured models are Conditional Random Field (or CRFs) [78] and structured support vector machines [140].
Local image feature descriptors are designed to provide concise high-level information about raw visual data. Typically these features describe local distribution of image gradients. The examples of widely used descriptors are SIFT [86], HOG [30] or SURF [6]. These descrip- tors can be enhanced by concatenation of additional information, such as distribution of local image colors or relative image coordinates.
Structured models allow to model high-order segmentation properties, such as local smooth- ness of segmentation masks [116], connectivity [94] or convexity [47] of segmented regions and many others. Note, that inference in structured models is computationally intractable in general, so approximate inference techniques are often used.
The aforementioned semantic image segmentation framework can be summarized in four main steps:
1. Employ a hand-crafted image descriptor (e.g. SIFT) to compute dense image feature representation.
2. Formulate a parametrized structured model (e.g. CRF), which models the dependency between local image descriptors and semantic labels as well as higher-order label inter- actions, such as smoothness.
3. Learn unknown parameters using available supervised data.
4. At the test time for any given input image use the learned probabilistic graphical model (e.g. CRF) to produce the most probable segmentation mask.
A prominent example of an approach, which follows this framework, is [131]. There are numerous improvements of this general approach in the literature. For instance, [48] proposes a technique that dynamically groups neighboring pixels into semantically and geometrically consistent regions, which are then labeled by semantic labels. Another example is [109], which proposes to use global image semantic context in order to rule out semantic labels that do not fit into the overall image context.
Nevertheless, structured prediction models have multiple crucial drawbacks. First, they rely on hand-crafted image descriptors, which often do not carry enough information to discriminate certain semantic categories. Second, inference and parameter learning in structured probabilis- tic models is often slow or even computationally intractable, thus requiring careful develop- ment of specialized approximate learning or inference techniques. Overall, these models are quite slow in practice and their performance falls far behind the performance of human visual systems.
Image segmentation with deep neural networks.Recently, a new paradigm based on deep convolutional neural networks became a dominant choice for solving semantic segmentation
task for natural images. It relies on end-to-end training of a convolutional neural network, which maps input images to segmentation masks. Unlike structured models based on hand- crafted image descriptors, deep neural networks learn image representation automatically using available data.
Authors of [85] demonstrate that modern convolutional deep neural networks significantly outperform structured models. Subsequently, their model was revised and substantially im- proved. A paper [22] proposes to use dilated convolutions, which drastically increase the size of network’s receptive field without increasing the number of free parameters. In [170] authors develop a methodology for fusing deep neural networks and the fully-connected CRF model from [74] in the end-to-end fashion during both training and evaluation phases.
1.3.1
Weakly-supervised semantic image segmentation
What makes semantic image segmentation especially challenging is the cost of producing la- beled data in a form of dense segmentation masks. Each image requires at least a few minutes of a trained human annotator to be fully annotated. This makes the task of producing a large image segmentation dataset to be prohibitively expensive in many realistic situations. Thus, a large body of previous research is devoted to weakly-supervised models that can learn from much weaker (and cheaper to produce) forms of annotation. In particular, image-level labels are much cheaper to produce and, thus, this type of weak annotation have attracted a lot of attention in the computer vision research community.
One of the first successful attempts to learn semantic image segmentation model from image-level labels is [149]. It is based on a probabilistic structured model, which combines a signal from image-level labels with the prior assumption on label smoothness. Smoothness is enforced for spatially neighboring superpixels within one image as well as for similarly looking superpixels across images. Moreover, the authors utilize objectness prior [1] to further improve the segmentation quality. A follow-up paper [150], extends this approach by introducing addi- tional hyperparameters for controlling importance of various terms in a structured segmentation model. Importantly, authors also propose a novel criteria for selecting these hyperparameters, which does not require fully-labeled segmentation data to be evaluated.
Emergence of powerful deep convolutional neural networks sparked a rapid progress in the performance of weakly-supervised image segmentation models. In [104] authors derive weakly-supervised segmentation model by combining MIL framework [2] with a deep convo- lutional neural network for image segmentation. A follow-up work [105] improves the design of a MIL loss function and, additionally, introduces image segmentation priors, which result in substantial performance gains. Another line of work [100, 102] utilizes variants of Expec- tation Maximization algorithm [31] (or EM-algorithm). The main idea is to iterate between
two steps: expectation step and maximization step. At the expectation step the learner com- bines image-level labels and segmentation predictions produced by the current approximation of a segmentation model in order to produce surrogate ground-truth segmentation masks. At the maximization step the current segmentation model is updated in order to better match the surrogate ground-truth segmentation masks. The proposed EM-algorithm based models mostly differ in the exact way of how the expectation step is performed.
Overall, recently proposed weakly-supervised segmentation models deliver roughly 60% of the performance of fully-supervised analogues. In this thesis we aim to make this performance gap much smaller.