Noisy data - FACILITATING LEARNING IN CHALLENGING DATA SETTINGS

2.3 FACILITATING LEARNING IN CHALLENGING DATA SETTINGS

2.3.3 Noisy data

Noisy or outlier data is a perennial problem in computer vision when relying on data automatically harvested from the web. Even when crowdsourcing is used to obtain human annotations on such webly-harvested data, they too often contain some noise and require special handling [33] or acquiring multiple annotations per image, which increases expense. All of the datasets studied in this dissertation contain some level of noise due to automatic

harvesting. One way uncertain or noisy data can be mitigated is through the use of noise- robust losses [235, 36, 351, 275, 410]. For example, [410] propose generalized cross entropy which down-weighs the gradient on highly incorrect samples. Others [158, 296, 196] predict weights for samples based on the estimated reliability of the data, but require some clean data. [122, 203] leverage self-learning approaches where models first train on noisy labels, then predictpseudo-labels on the data which are also used for training. However, self-learning can suffer from error-amplification if the model incorrectly learns from the initial noisy labels. Moreover, in our settings the problem of noisy labels is more pronounced due to the fact that the semantics we seek to model are latent within the data and lack a consistent visual appearance across the class. This makes it more difficult for models to learn the concept, particularly in the presence of label noise, due to the visual incoherency within the class. We indirectly handle noise in the case of our photographic style and visual persuasion projects by restricting the types of features the model can learn (for photographs) or by restricting the data by training only on faces detected within the images (for ads). We explicitly handle noise for our task of modeling political bias by applying automated techniques to clean the data.

In our work in modeling multimodal political bias (Chapter 5), we propose a two-stage approach where the text is used to guide the visual model towards semantics of interest. Then, in a second stage, we remove the requirement of text and learn to make purely visual predictions. Our method thus leverages the text domain as a form of guidance to contend with high noise and visual diversity. Simiarly, our work on learning general abstract semantics in multimedia also address noise. Because of the challenge of cross-modal image-text matching, approaches for contending with label noise have also been adopted in the retrieval setting. [244] learn image-text embeddings on noisy web data by exploiting metadata (tags) while [421] conditions a generative model on noisy texts. In contrast, both of our methods for learning abstract semantics require no annotations or metadata beyond image-text co- occurrence. Our first approach (Chapter 6) relies on the image-text complementarity found in communicative multimedia and makes use of semantically neighboring images, which is inherently robust to noise. Our second method (Chapter 7) also relies on complementarity and explicitly handles noise by enhancing semantically informative samples, while down-

weighting samples suspected to be outliers. Most similar to ours, [7] estimate density by computing the correlation between samples from different modalities. We too aim to detect outliers, but we model density in both the image and text spaces independently through modality-specific variational Gaussian mixtures [28]. This has the benefit of taking global statistics into account, e.g. a sample from a small tight cluster of outliers would be weighted low by our approach, but high by [7]. We show our approach outperforms [7].

Weakly supervised learning. Recently, weakly supervised approaches have been proposed for classic topics such as object detection [265, 55, 414, 375, 391], action localization [365, 299], etc. Researchers have also developed techniques for learning from potentially noisy web data, e.g. [51]. Also related to our work is work in unsupervised discovery of patterns and topic modeling. For example, [323, 416] use an iterative clustering-detection pipeline to discover patterns that occur frequently but are discriminative. [199, 206] and [319] leverage deep networks to mine discriminative patterns. [152] and [72] discover patterns informative for the architectural style of a city or the evolving design of cars over the decades. Both of these rely on finding clusters of image patches that are compact in terms of the top-level weak label (e.g. “Paris” or “1950s car”), i.e. clusters that primarily contain samples from a given label, and ignore clusters with near-uniform label distribution.

Our work is related to weakly supervised discovery methods in the sense that other than often noisy labels, our method does not receive information about what makes an image contain the latent visual concepts we seek to model. In contrast to these weakly supervised discovery works though, the problems we study exhibit much larger within-class variance (e.g. with the classes being photographer’s identities, types of ads, or whether an image is politically biased). Unlike objects and styles, the differences between our classes live in semantic space as much (if not moreso) than they do in visual space, thus these methods do not guarantee success. Nevertheless, we borrow intuitions from these methods and help our methods by focusing them on the higher-level semantics of the problem, such as by injecting external semantics or leveraging guided training.

Curriculum learning. Several of our methods use multi-stage training as a strategy to facilitate learning (Table 2). Thus, also relevant to our work are self-paced and curriculum learning approaches [157, 282, 398, 409, 158]. These attempt to simplify learning by finding

“easy” examples to learn with first or by leveraging multi-stage training procedures. Several of our methods employ a type of curriculum learning. For example, we first train a multimodal classifier to predict bias, using the assumption that the relation between text and bias is more direct. We then leverage this model as a feature extractor by adding an image- only politics classifier on top of it. Thus, our method focuses the model on relevant visual concepts using text. Related work by [164] and [108] both learn semantic concepts on a separate, auxiliary training task, which aid the classifier in performing inference on the target task. Because prior work [266, 128] has shown that using a larger-batch size improves classification performance on noisy data by smoothing the gradient, we compare against a baseline curriculum-learning approach designed to alleviate the problem of noisy minibatches in our work on photographic style and predicting political bias. To do so, we freeze the lower- layers of the model after training and then perform a second stage of training of just the classifier using all features in the train set for optimization, which we show slightly improves performance on both of these problems.

In document Modeling Visual Rhetoric and Semantics in Multimedia (Page 63-66)