• No results found

SBIR descriptors: from model fitting to deep learning

2.2 Sketch based image retrieval

2.2.1 SBIR descriptors: from model fitting to deep learning

Shallow features

The earliest SBIR studies dated back in the 90s with the work of Katoet al. [85] and Bimbo and Pala [12]. Katoet al. studied cross-correlation between neighbour patches in sketches and image edgemaps. Bimbo and Pala used elastic matching to fit the sketched query to a candidate image minimising deformation energy. The energy required to fit the sketch to image is then used as the ranking score. These approaches, to some extent, can compensate

human imprecision when drawing sketch. However, model fitting must be computedon the flyso it is not desirable for real time SBIR.

Several approaches work on blob-based sketches – a type of drawing that describes objects via coarse attributes of colour, texture and shape usinge.g. paint brush (see Fig. 3.1 (a) in Chapter. 3 for examples of this sketch type). As blob-based sketches look similar to photo images, the same image descriptors can be used on sketches. Jacobet al. [77] used Haar wavelet decomposition, keeping only the average colour together with sign and indices of few largest Haar coefficients. TheQBICsystem [48] utilises global colour histogram to extract colour, moment for shape, and coarseness and contrast for texture description. The

VisualSEEksearch engine [160] encodes colour sets together with size and location of the blobs as well as their spatial distances, then performs exhaust search to find the best matched image. InDrawSearch[36], colour is averaged on a 4x4 grid and shape is extracted using Fourier descriptor. DrawSearchalso supports relevant feedback by iteratively updating the query’s signature close towards the positive examples in the feature space.

Major SBIR studies were dedicated to line-art sketches (see Fig. 3.1 (b,c) for examples of this type). To deal with the cross-domain problem, early approaches chose to extract edgemaps from photo images then match against sketches assuming a unified domain between them. Extensive research has been focused on extracting structural information from sketches and edgemaps. Matusiaket al. [111] applied Curvature Scale Space (CSS) to describe object contour. CSS is affine invariant and can be applied in real time system, however it assumes that the image is already segmented and represented by its contour. Rajendranet al. [136] also achieved affine invariance by performing multi-scale decomposition on the strongest edges before extracting curvature direction histogram from it. These approaches both use global descriptors which are not robust against spatial deformation, commonly found in sketches.

Following the introduction of the MPEG-7 standards, Edge Histogram Descriptor (EHD) and its variants were quickly adopted in SBIR. Under EHD-related methods, images are gridded and features are extracted from individual cells and concatenated to form a global descriptor. Eitzet al. [41] expresses each cell as a single unit vector with dominant orientation using structure tensors. Chalechaleet al. [22] grids the images into angular partitions and collects 1-D discrete Fourier Transform of the polar edge points in every partition, thus able to achieve affine invariance.

Following the success of BoVW framework for photographic search in the mid-2000s, BoVW was first extended to SBIR by Huet al. [71]. Huet al. [71, 72] extrapolates edge orientations from strokes to produce a dense gradient field, computing multi-scale HoG descriptors (GF-HOG) local to strokes that is later quantised within a hard-assignment

BoVW pipeline. This method strengthens the power of HOG descriptor and compensates for the drawbacks of the BoVW framework by encapsulating spatial-coherence information in a smooth gradient field. Another contribution from their work is the Flickr15K benchmark containing 15k images and 330 sketches from 33 categories. This dataset has become a defacto benchmark in class-level SBIR and has been employed in more than 20 publications to date (Table. 2.1).

Alternatively, Eitzet al. [42] explored several local descriptors including Shape Context (log-polar distribution of edge points relative to the centre point), Spark (similar to Shape Context except that the centre point is not on the edgemap and only the properties of the nearest edge point are recorded for each angle), and SHOG (similar to HOG but only the dominant orientation in each cell is computed), all encoded with BoVW reporting SHOG the highest performance.

Qi et al. [131] took a different route in solving SBIR problems. Instead of focusing on feature description, the authors paid more attention to edge extraction from database images and came up with a perceptual-grouping edge technique. Their perceptual edgemap is constructed using RankSVM (to estimate the likelihood two edge points belong to the same object), and graph-cut (to group salient edges together). The edgemap is described by standard HOG, yielding top retrieval accuracy among shallow-learning approaches.

Several authors addressed the difference between image’s edgemaps and sketches. Saave- dra et al. [144, 143, 146] proposed a family of HELO descriptors in which their latest approach (RST-SP-HELO) uses a mid-level contour detector (SketchToken [98]) to extract sketch-like edgemaps from photo images. Next, representative edge orientations are extracted locally for each cell in the gridded sketches and image’s tokens. Finally the local descriptors are concatenated in an EHD fashion to form a global one. The same authors in another publi- cation [147] proposed a BoVW approach based on a set of 6 predefined mid-level features named KeyShapes. The method was later improved in [145], allowing the KeyShapes to be automatically learned.

Approaches based on computing similarity between a sketch and a candidate image via direct edgemap alignment has also been investigated. This type of method is similar to model fitting however significant efforts were invested in computing image descriptors offline (rather than comparing sketch and image models online as with model fitting approaches). Caoet al. in MindFinder [21, 20] came up with an indexable version of Oriented Chamfer Matching (OCM). Their approach is memory- and speed-effective, however users must draw a query as close to the image’s edgemaps as possible because OCM is neither translation, rotation nor scaling invariant. Tolias and Chum [174] proposed Asymmetric Feature Maps (AFM) that stores sketch/edgemap structure information such as pixel position, edge orientation and

Method Dim. mAP (%) Query-adaptive re-ranking CNN [11] 5120 32.30 Quadruplet_MT [151] 1024 32.16 Asymmetric feature map (AFM) [174] 243 30.40 Learned KeyShapes (LKS) [145] 1350 24.50 Rst-SP-SHELO [144] 3060 20.05 Siamese with Contrastive Loss1[132] 64 19.54 Perceptual Edge [131] 3780 18.37 HLR+S+C+R [186] 2000 17.10 SHELO [143] 1296 12.36 GF-HoG [72] 3500 12.22 SHoG [42] 1000 10.93 HELO [146] 1296 9.67 SSIM [153] 500 9.57 SIFT [104] 1000 9.11 Shape Context [9] 3500 8.14 Structure Tensor [41] 500 7.98

Table 2.1: Performance of different SBIR algorithms on the Flickr15K benchmark.

edge strength under a set of stationary kernels. These kernels are approximated to ensure fixed database descriptors. To reduce querying time, a set of candidate images are short-listed followed by a local refinement and re-ranking step within the candidates. AFM currently enjoys state-of-art performance among the non-deeplearning approaches on the Flickr15K benchmark.

Deep features

Following the hype of deep learning in object recognition and CBIR, approaches based on deep CNNs are quickly adopted for sketch analysis and particularly for SBIR. Yuet al. [202] and Wanget al. [183] built the first deep sketch models although not directly for SBIR. The former surpasses human performance in sketch recognition using SketchANet – a light-weight version of AlexNet optimised for sketches. SketchANet also forms a component of the very recent work of Bhattacharjee et al. [11], coupled with a complex pipeline including object proposals, and query expansion to achieve state-of-art reported performance on Flickr15K. The latter worked on sketch-based 3D shape retrieval using a two-branch CNN (siamese2) regularised by a contrastive loss function. Here, domain gap is narrowed

1The authors of this work split the Flickr15K into two equal parts, trained on one and test on the other. Their

result is therefore not directly comparable.

2There are two different notations forsiamesein the literature. One defines siamese asidenticalwhile the

other refers its meaning simply astwo branches. In this thesis we use the latter notation for siamese, in contrast with the notation of triplet which indicatesthree branches.

using two independently learned CNNs, one for sketches and the another for 2D views of the 3D objects. Siamese networks are also used in the SBIR work of Qi et al. [132] to match sketches with image’s edgemaps, however its performance is far behind state-of-art on Flickr15K by a large margin (Table. 2.1).

Similar to siamese CNNs, triplet CNNs are another alternative of deep learning with regression. A triplet network employs a three-term loss function: (i) an anchor term which represents the reference object, (ii) a positive term representing objects similar to the reference one and (iii) a negative term representing dissimilar objects. The triplet loss function measures the distance cost considering the relationship between the three terms. Triplet CNNs have recently been explored by Yu et al. [201] in order to refine search within a single object class (e.g. fine-grain search within a dataset of shoes). Similarly, a fine-grained approach to SBIR was adopted by the recent Sketchy system of Sangkloyet al. [148] in which careful reproduction of stroke detail is invited for object instance search. In the former work [201], the authors train one model for each target category, and the embedding is learned using an edgemap extracted from a relatively clutter-free image. They report that using a fully-shared network was better than use two branches without weight sharing. However, the authors in [148] suggest it is more beneficial to avoid sharing any layers at all in a cross-category retrieval context.

The work in [201, 148] are examples of fine-grain SBIR – relevant images are required to have exact matching with the sketch queryi.e. same view point, pose, scale and sub-type other than just category. Seddatiet al. [151] improved the Sketchy system with a quadruplet loss that combines both category-level and instance-level triplet losses in dealing with inter- and intra-class variance. Sarvadevabhatlaet al. [149] proposed SketchParser that integrates a super-category classifier to redirect input sketches/images to finer category-dependent models. Songet al. [163] introduced an attention layer to model subtle features in sketches/images . The same authors in another work [162] added text as the fourth branch in their quadruplet networks, claiming sketch-photo and text-photo matchings can benefit each other if modelled jointly. All these fine-grain SBIR approaches rely heavily on expensive human-annotated groundtruth in training the modelse.g. labels of object parts in [149], exact-matching sketch- photo pairs in [201, 148, 151, 163] and fine-grain sketch-photo-text triplet annotations in [162].

Recently a few novel approaches based on generative models have emerged in the SBIR literature. These approaches deliver representations of a sketch/image through the recon- struction process of the input sketch/image itself. Cross-domain embedding is achieved by unifying domain of the reconstructed target. Panget al. [123] employed a triplet CNN where each branch is an autoencoder converting images to sketches. The network is regularised

using a combination of triplet loss, reconstruction loss and classification loss. On the other hand, Guoet al. [60] took an opposite direction by using a conditional generative adversary network (CGAN) to generate an imaginary photo image from an input sketch. Feature extracted from the generated image is used as the sketch’s representation, effectively turning SBIR into a single domain CBIR problem. Again, these approaches all require instance-level labels to work.