Salient Object Detection from Appearance - Structural Salient Object Detection

2.2 Structural Salient Object Detection

2.2.1 Salient Object Detection from Appearance

The majority of salient object detection methods operate on RGB input. Early interest in computational modeling of visual attention was sparked by the seminal work of Itti and Koch [68]. The major insight of this work was that areas of the image that exhibit high local center-surround contrast were more likely to be salient to the human visual system. An example is shown in Figure 2-1, where the salient object has high contrast with its local surroundings. Since then, colour contrast measurement has formed the foundation of many salient object detection methods

[4, 93, 80, 29, 30].

There are a wide variety of techniques for measuring contrast and detecting salient object regions in an RGB image. Achanta et al. [4] present a frequency-tuned model to detect salient objects, where contrast is measured as the pixel-wise difference between the average image colour and a Gaussian filtered image. Liu et al. [93] measure regional contrast through the Chi squared histogram difference between a rectangular image region and its surrounding region. Klein et al. [80] compute this quantity in an information theoretic way, using the Kullbeck Lebecker divergence metric. Cheng et al. [29] measure the global contrast between a superpixel and all other superpixels, taking into account spatial coherence. Cheng et al. [30] perform saliency computation from a soft abstraction of the image, allowing a larger spatial support and more uniform highlighting of objects compared to many superpixel based methods. Shen et al. [132] formulate saliency as a low-rank matrix recovery problem, where the background regions correspond to a low-rank matrix and the salient regions appear as sparse noise.

Prior knowledge about the task and human visual system plays an important role in salient object detection. The most widely used prior is the spatial prior, which performs re-weighting of the saliency score of a pixel according to the spatial distance between the pixel and the image center. This prior is used in almost all existing saliency systems, and is based on the biological tendency of the human visual system to focus on central image regions [146]. Similarly, many methods also make use of a background prior for saliency estimation, which exploits the idea that the borders of the image are more likely to contain background. Methods using the background prior compute saliency as contrast with the border region of the image, referred to as the ‘pseudo-background’. Wei et al. [153] construct an undirected weighted graph where each superpixel and the pseudo-background are nodes, and saliency is computed as the geodesic distance between the superpixel and the pseudo-background. Li et al. [88] compute saliency as dense and sparse reconstruction errors with respect to the pseudo-background. Jiang et al. [70] use an absorbing Markov chain to compute saliency, in which border superpixels are absorbing nodes and non-border superpixels

are transient nodes. In this approach, the saliency of a superpixel is computed as the absorbed time from the transient node to the absorbing nodes.

Aside from spatial and background priors, many other heuristics have been used to improve salient object detection performance. For example, Liu et al. [93] observe that the spatial distribution of colour within an image correlates with saliency, since background colours are more likely to be spread out. Chang et al. [25] take advantage of the objectness prior, fusing the object proposal generation model of Alexe et al. [5] and region saliency detection in a graph-based framework.

Several approaches have employed machine learning methods to more closely model the properties that lead to an object being perceived as a salient. Liu et al. [93] learn a CRF model to segment salient objects based on a set of image features. Li et al. [87] use a SVM to predict saliency based on the difference between a target region and its local surroundings in feature space. Lu et al. [96] use a large margin framework to classify salient regions. While these methods offer good performance, the properties that make an object appear salient can be difficult to capture with linear classifiers. As such, subsequent approaches have used boosted decision trees [105], random forests [70], and a mix of linear SVMs [77] to measure saliency according to non-linear classification of regional descriptors.

Recently, deep learning methods have produced state-of-the-art results for salient object detection. Zhao et al. [162] propose a multi-contextual CNN for salient object detection, which jointly models local and global contexts of superpixels. This model is pretrained on the ImageNet dataset, due to the insufficient size of existing saliency datasets. Wang et al. [152] compute a local saliency map using a CNN and objectness based refinement, and then use a fully connected CNN to produce the final saliency map from global features of the object proposals. More recent methods have taken advantage of learned features from fully convolutional object detection networks such as VGG16 [135] and GoogleNet [142], which provide strong performance when fine- tuned for saliency detection [86, 92, 83]. Li and Yu [86] combine VGG16 with a region- based CNN to better model saliency discontinuities along object boundaries. Liu and Han [92] propose a hierarchical recurrent CNN based on VGG16, which predicts

saliency in a coarse-to-fine manner. Lee et al. [83] combine the high level features from VGG16 with a low level map which encodes distances between superpixel features. Fully convolutional object detection networks used in conjunction with optimisations for segmentation accuracy are the current state-of-the-art methods for appearance- based salient object detection.

In document RGB D Scene Representations for Prosthetic Vision (Page 36-39)