Spatial Grounding for Improving Model Classification at Training Time

Dropout was first introduced by Hintonet al. [40] and Srivastavaet al. [112] as a way to prevent neural units from co-adapting too much on the training data by randomly omitting subsets of neurons at each iteration of the training phase.

Some follow-up works have explored different schemes for determining how much dropout is applied to neurons/weights. Wageret al. [120] described the dropout mechanism in terms of an adaptive regularization, establishing a connection to the AdaGrad algorithm. Inspired by information theoretic principles, Achille and Soatto [2] propose Information Dropout, a generalization dropout which can be automatically adapted to the data. Kingmaet al. [59] showed that a relationship between dropout and Bayesian inference can be extended when the dropout rates are directly learned from the data. Kang et al. [52] introduces Shakeout which instead of randomly discarding units as dropout does, it randomly enhances or reverses each unit’s contribution to the next layer. Wanet al. [121] introduced the DropConnect framework, adding dynamic sparsity on the weights of a deep model. DropConnect generalized Standard Dropout by randomly dropping the weights rather than the neuron activations in the network. Rennieet al. [99] proposed a time scheduling for the retaining probability for the neurons in the network. The presented adaptive regularization scheme smoothly decreased in time the number of neurons turned off during training. Recently, Morerioet al. [83] proposed Curriculum Dropout to adjust the dropout rate in the opposite direction, exponentially increasing unit suppression rate during training, leading to a better generalization on unseen data.

Other works focus on which neurons to drop out. Dropout is usually applied to fully- connected layers of a deep network. Conversely, Wu and Gu [136] studied the effect of dropout in convolutional and pooling layers. The selection of neurons to drop depends on

the layer where they reside. In contrast, we select neurons within a layer based on their contribution. Wang and Manning [129] demonstrate that sampling neurons from a Gaussian approximation gave an order of magnitude speedup and more stability during training. Liet al. [72] proposed to use multinomial sampling for dropout,i.e. keeping neurons according to a multinomial distribution with specific probabilities for different neurons. Ba and Frey [4] jointly trained a binary belief network with a neural network to regularize its hidden units by selectively setting activations to zero accordingly to their magnitude. While this takes into consideration the magnitude of the forward activations, it does not take into consideration the relationship of these activations to the ground-truth. In contrast, we drop neurons based on how they contribute to a network’s decision.

We compare our results against Morerioet al. [83], which is the current state-of-the-art. To the best of our knowledge, we are the first to probabilistically select neurons to dropout based on their task-relevance.

2.4 Spatial Grounding for Improving Model Classification at Test Time

Fine-grained classification is an important model classification problem for which a large number of approaches have been proposed. The key module in fine-grained classification is finding discriminative parts. Some approaches use supervision to find such discriminative features,i.e.use annotation for whole object and/or for semantic parts. Zhanget al.[149] train part models such that the head/body can be compared, however this requires a lot of annotation of parts. Krauseet al.[62] use whole annotations and no part annotations. Bransonet al.[13] normalize pose of object parts before computing a deep representation for them. Zhanget al. [146] introduce part-abstraction layers in the deep classification model, enabling weight sharing between the two tasks. Huang et al. [41] introduce a part-stacked CNN which encodes part and whole object cues in parallel based on supervised

part localization. Wanget al.[130] retrieve neighboring images from the dataset, those having similar object pose, and automatically mine discriminative triplets of patches with geometric constraints as the image representation. Denget al.[21] include humans in the loop to help select discriminative features. Subsequent work of Krauseet al. [63] does not use whole or part annotations, but augments fine-grained datasets by collecting web images and experimenting with filtered and unfiltered versions of them. Wanget al.[122] use the ontology tree to obtain hierarchical multi-granularity labels. In contrast to such approaches, we do not require any whole or part annotations at train or test time and do not use additional data or hierarchical labels.

Other approaches are weakly supervised. Such approaches only require an image label, and our approach lies in this category. Linet al. [73] demonstrate the applicability of a bilinear CNN model in the fine-grained classification task. Sunet al.[115] implement an attention module that learn to localize different parts and a correlation module to coherently enforce correlations among different parts in training. Fuet al.[31] learn where to focus by recurrently zooming into one location from coarse to fine using a recurrent attention CNN. In contrast, we are able to zoom into multiple image locations. Zhanget al.[151] use convolutional filters as part detectors since the responses of distinctive filters usually focus on consistent parts. Zhaoet al.[152] use a recurrent soft attention mechanism that focuses on different parts of the image at every time step. This work enforces a constraint to minimize the overlap of attention maps used in adjacent time steps to increase the diversity of part selection. Zhenget al.[154] implement a multiple attention convolutional neural network with a final fully-connected layer combining the softmax for each part with one classification loss function. Cuiet al.[19] introduce a kernel pooling scheme and also demonstrate benefit to the fine-grained classification task. Jaderberget al.[45] introduce spatial transformers for convolutional neural networks which results in models which learn

invariance to translation, scale, rotation and more generic warping, showing improvement for the task of fine-grained classification.

In contrast, our approach assesses whether the network evidence used to make a prediction is reasonable, i.e. if it is coherent with the evidence of correctly classified training examples of the same class. We use multiple salient regions eliminating error propagation from incorrect initial saliency localization, and implicitly enforce part-label correlations enabling the model to make more informed predictions at test time.

In document Grounding deep models of visual data (Page 38-41)