2.2 MULTIMODAL LEARNING
2.2.2 Learning visual semantic embeddings
A fundamental problem in cross-modal inference is the creation of a shared semantic manifold on which multiple modalities may be represented. The goal is to learn a space where content about related semantics (e.g. images of “border wall” and text about “border wall”) projects close by, regardless of which modality it comes from. We note that such embeddings are generally task-agnostic, that is, they seek to learn a representation preserving cross-modal semantics, in the absence of any particular applied task (e.g. bias detection). However, such visual-semantic embeddings (VSE) have received tremendous interest due to their broad down-stream applications such as retrieval [43, 328], captioning [169, 384], tagging [92], and visual question answering [383]. Most VSE approaches learn a joint visual- text space where some distance metric between embedded samples reflects their semantic relationship [377]. Following the early deep VSE models [92, 241] research has focused on improving the learning objectives [353, 364, 113, 360, 361], e.g. to preserve order [353] rather than distance, to preserve structure within modalities [364], to ground embeddings via generation [113], or to provide modality invariance [360, 361]. Others leveraged properties of text to improve the visual representation, e.g. through cross-modal attention techniques [195, 253] which consider all possible alignments between detected regions and words. [147] extract visual concepts from images and organize them semantically using the paired text (to determine their correct semantic order).
Unlike the above approaches which rely on additional tasks, losses, and may require ex- tra annotated data, our approaches exploit the structure of each unimodal space (image and text) by leveraging the semantic complemetarity found in communicative multimedia. We propose two approaches for learning task-agnostic visual semantic embeddings, one relying on a complementarity-based loss which imposes constraints to preserve intra and inter-modal semantics, and another relying on a sample weighting strategy which leverages complemen- tarity between the image and text modalities to assess whether samples are semantically informative. Both our methods use traditional, well-understood two-stream visual semantic embedding models trained via ranking losses, such as [92, 85, 328].
embeddings (Chapter 6). Most image-text embedding methods rely on a two-stream archi- tecture, with one stream handling visual content (e.g. captured by a CNN) and the other stream handling textual content (e.g. through an RNN). Both streams are trained with paired data, e.g. an image and its captions, and a variety of loss functions are used to en- courage both streams to produce similar embeddings for paired data. One common loss used to train such retrieval models is triplet loss, which originates in the (single-modality) metric learning literature, e.g. for learning face representations [309]. In cross-modal retrieval, the triplet loss has been used broadly [252, 418, 245, 271, 390, 85]. Alternative choices include angular loss [363], N-pairs loss [326], hierarchical loss [99], and clustering loss [259]. Triplet loss [309, 134] takes into account the relative similarity of positives and negatives, such that positive pairs are closer to each other than positives are to negatives.[408] generalize triplet loss by fusing it with classification loss. [260] propose a lifted structure loss which integrates all positive and negative pairs within a minibatch, such that all pair combinations are up- dated jointly rather than independently. [364] propose a structural loss, which pulls multiple pieces of text paired with the same image together, but requires more than one ground truth caption per image (which most datasets lack). In contrast, our approach pulls semantically similar images and text together and only requires a single caption per image.
While single-modality losses like triplet, angular and N-pairs have been used across and within modalities, they are not sufficient for cross-modal retrieval. First, these losses do not ensure that the general semantics of the text are preserved; thus, the cross-modal matching task might distort them too much. This phenomenon resembles forgetting [207, 111] but in the cross-modal retrieval domain. Second, these losses do not exploit the complementary relationship between images and text found in communicative multimedia. In particular, two images might depict substantially different visual content but nonetheless be semantically related. For example, one image of a wedding might show a couple dancing, and another show a large number of guests eating at several tables; these images are visually diverse but still semantically related. However, there is no component in standard metric learning losses that enforces this semantic coherence at the image level. This is less of a problem in the case of traditional image captioning datasets featuring literal image-text descriptive relationships. In contrast, in real-world communicative multimedia, the complementarity of image and text
is much more pronounced. Note that we do not propose new models for image-text alignment, but instead propose cross-modal embedding constraints or weighting metrics which can be used to train any such model. For example, we compare to Song et al. [328]’s recent polysemous visual semantic embedding (PVSE) model, which uses global and local features to compute self-attention residuals. Our loss and weighting based approaches improve upon [328]’s performance. Our work is also related to cross-modal knowledge distillation [92, 325, 119, 103], which transfers supervision across modalities. None of these approaches exploit cross-modal complementarity, e.g. the semantic signal that text neighborhoods carry for the image space, to constrain a learned metric space as we do. Finally, [406, 185] detect different types of image-text relationships (e.g. parallel, complementary) but do not retrieve across modalities.
We propose a second approach (Chapter 7) for learning semantically robust embed- dings in communicative multimedia which relies on weighting samples judged to be abstract (i.e. exhibiting latent visual concepts) and therefore important for learning. Our work again exploits image-text complementarity in order to estimate the emphasis the model should pay to a given sample. Our work is thus related to work on sample mining and weighting-based methods. For example, it has long been known that triplet loss can be challenging to train [129] due to the difficulty of choosing informative dissimilar samples. Many have exploited hard negative mining [85, 130, 309, 320, 397, 142], while others have tackled issues stemming from negative sample choice [362, 272, 60, 326, 395], e.g. by pushing multiple negatives away [326]. For example, [326] push multiple negatives away at a time, lessening the need to pick a single hard negative, while [395] correct the distribution shift on the chosen triplets relative to the dataset. Other approaches [43, 360, 413, 65] rely on the use of classification labels or metadata, e.g. to ensure negatives in the triplet belong to different classes than the pos- itive. Unlike these, our approach works in self-supervised settings without the requirement of additional labels. Rather than hard selection of negatives, others have used soft weights over samples. In [243], positive samples which violate the margin but are still correctly retrieved are weighted less, while others incur a larger penalty. [212] use sample weights to address hubness (a phenomenon where a small number of embeddings remain undesirably close to many others), such that samples which are hubs receive more attention. Our weights
are designed to improve the semantic properties of the learned space by emphasizing sam- ples where the relation between image and text is abstract, not necessarily “hard” samples. This is an important distinction, because some “hard” samples may actually be noisy; we found using hard negative mining prevented methods from training successfully on several of our challenging datasets. Our method outperforms [243] and [212]. We show that our method significantly better preserves challenging, abstract and latent semantic concepts such as “justice” or “freedom” in real-world multimedia in a task-agnostic, data-driven manner.