Generalisation - Sketch based image retrieval on big visual data.

4.4 Experiments

4.4.2 Generalisation

We first report the results of the generalisation capability of our triplet networks when varying the amount of training data. A series of experiments were conducted to investigate the ability of our models to generalise beyond its training data for category-level SBIR. We constructed training sets with a different number of categories: 20, 40, 80, 130 and 250 sampled arbitrarily at random from the TU-Berlin-Class/Flickr25K datasets. We also varied the number of training sketches per category: 20, 40, 60 and the whole 80 sketches/category were randomly sampled for training while the remaining sketches of the chosen categories were used for validation. Additionally, we experimented with different sharing levels between sketch and image branches. We compared the proposed partially-shared networks with the fully-shared and no-shared ones. Details to which layer being shared are discussed in subsecs 4.4.3–4.4.4, here we reported just one partial sharing configuration. Having trained the CNN, mAP over the Flickr15k test dataset was evaluated, in order to measure how well the embedding was able to generalise to unseen categories.

For simplicity we used SketchANet-SketchANet for sketch-edgemap and SketchANet- AlexNet for sketch-image experiments. In the case of sketch-image matching, we modified the SketchANet design to enable sharing with AlexNet. Specifically, layers 1-3 of the sketch branch have SketchANet architecture, layers 6-7 mirror AlexNet while the middle layers

(a)

(b) (c)

Figure 4.11: Experiments with generalisation capability of our learned models w.r.t. (a) number of training categories (20 sketches per category); (b) number of training sketches per category (250 categories); (c) fixed training volume (fixed 4800 training samples); tested on the Flickr15K benchmark.

4-5 we have modified from SketchANet as a hybridization of the two designs. The modified sketch branch is trained from scratch while the image branch is initialized using the ImageNet pre-trained model [91].

Fig. 4.11 (a) shows that the performance is benefited by increasing the number of training categories. All five network designs achieved near-linear improvement of retrieval performance against Flickr15k benchmark (discarding the four intersecting categories with the training set) with exposure to more diverse category set during training. The mAP of all models jumped by∼20% when raising training data from 20 to 250 categories. Fig. 4.11 (b) has a similar trend when we keep the number of training categories fixed at 250s and vary the number of training sketches per category. As the results of seeing more data during training, all models achieve an improvement of up to 4% mAP on Flickr15k. Fig. 4.11 (c) depicts that number of training samples is not the only factor that matters most. Here we increase the number of categories from 20s to 80s while at the same time decreasing per category samples, keeping the training volume fixed at 4800 sketches. The general trend is

(a) 20-class training (b) 250-class training .

Figure 4.12: T-SNe visualization of the Flickr15k dataset’s data distribution for the training with (a) 20 classes and (b) 250 classes. Each color represents a semantic class in Flickrk15k.

an improvement as the number of categories increase. We conclude that category diversity is crucial for training a generalised network.

To visualise the effects of categorical diversity in training, we plot the distribution of the first 6 categories in the Flickr15k dataset whose embedding features are extracted from 20 and 250 category sketch-edgemap trained models respectively in Fig. 4.12. A qualitative improvement in inter-class separation is observable using the larger number of training categories, mirroring the performance gains observed in mAP as category count increases. Quantitatively after increasing the number of training categories from 20 to 250, the inter- class distance of the embedding features is pushed further away by 33%, while the average intra-class distance is reduced by 10%.

All three plots in Fig. 4.11 report the superior performance of the partially shared triplet architecture against the no-share and fully shared networks regardless of its matching formats (sketch-edgemap or sketch-photo). This behaviour can be explained in two ways. First, regression loss (contrastive/triplet) in general has looser regularisation than classification loss (softmax) since they regulate the relative distance between network’s outputs. Thus, Siamese/triplet network is more prone to overfitting and harder to train. In this sense, layer sharing can be interpreted as a way to combat overfitting since it reduces the number of training parameters. Second, a degree of information sharing intuitively makes sense. Even if two domains are visually different, at a certain level of interpretation they must share some common traitse.g. a “duck” object often has a long neck, wings and two legs regardless its media (domain) being sketch, photo or edgemap. By sharing top layers, we assume these layers are responsible for learning such high-level common features. On the other hand,

(a) sketch-edgemap: sketch conv1 (b) sketch-edgemap: edge conv1

Figure 4.13: Visualisation of the first convolution layer for the sketch-edgemap (a-b) and sketch-image (c-d) models. The green and red boxes in (a-b) highlight single and double edge filters which correspond to the differences between sketch and edgemap in Fig. 4.6. The composition of the filter banks of each branch are quite different, supporting independent learning of the early stage layers in our partial-share framework.

bottom layers should be left non-shared to capture low-level and domain-specific features (Fig. 4.13).

In document Sketch based image retrieval on big visual data. (Page 91-94)