4.5 Additional features of simulator architecture
5.1.1 Unified hyperparameter setup
Figure 5.1 illustrates the basic architecture layout shared by all of the evaluated models: (a) an image module processes the (raw or pretrained) input image features and outputs a map of positional image embeddings; (b) a language module embeds and processes the input ques- tion/caption words and outputs a language embedding; (c) a core module combines positional image and language features and outputs a fusion embedding; and (d) a classifier module maps the fusion vector to a softmax distribution over responses. In the following, each module is described in detail, including layer sizes and other hyperparameter choices.
Image module. The visual input is processed by a sequence of convolutional layers. Each layer consists of a convolution operation (LeCun et al., 1989) with kernel size 3×3 and stride size 1 or 2, followed by two-rank batch normalisation (Ioffe and Szegedy, 2015) and subsequently a rectified linear unit (Glorot et al., 2011). The output image features are of size w × h × c, that is, w · h positional image embeddings of c dimensions. The number of convolutional layers, number of kernels and stride size per layer – and consequently the size of the image features – depends on data and architecture:
• CLEVR with pretrained ResNet-101 image features: Following (Johnson et al., 2017a; Johnson et al., 2017b; Perez et al., 2018), after resizing to images of size 224×224×3, features of size 14 × 14 × 1024 are extracted from the conv4 layer of a pretrained ResNet-101 (He et al., 2016). In this case, the image module consists of only one convolutional layer with 128 kernels and stride 1, yielding output features of size 14 × 14 × 128. The CNN+LSTM+REL model uses a stride size of 2 instead, yielding an output of size 7 × 7 × 128 (to keep the number of pairwise combinations moderate).
• CLEVR from raw images: Similar to (Santoro et al., 2017; Malinowski and Doersch, 2018; Perez et al., 2018), but without resizing images, four convolutional layers with 128 kernels and stride 2 are applied to the input, thus reducing the initial size of 320 × 240 × 3 to 30 × 20 × 128. The CNN+LSTM+REL model adds an additional convolutional layer, yielding an output of 15 × 10 × 128 (again, to keep the number of pairwise combinations moderate).
• ShapeWorld from raw images: The same configuration as above is used, but with only three instead of four convolutional layers, due to the smaller input image dimensions of 64 × 64 × 3. The output image features are thus of size 8 × 8 × 128.
Language module. The words of the input question/caption are mapped to 128-dimensional word embeddings and processed by an LSTM (Hochreiter and Schmidhuber, 1997) – or GRU (Cho et al., 2014) in case of CNN+GRU+FILM – of size 512, with the final processed word as the 512-dimensional language embedding output. The CNN+LSTM+REL model uses an LSTM of size 128 instead (to keep the size of the pairwise combination embeddings moderate).
Core modules.
• CNN baseline: No language input. A 128-dimensional linear transformation is applied to the 128-dimensional positional image embeddings, with subsequent max-pooling over all embeddings to obtain a 128-dimensional output fusion embedding.
• LSTM baseline: No image input. The 512-dimensional language feature vector is passed on as the output fusion embedding.
• CNN+LSTM baseline, combination of the two modules above, which combines visual and language features after global pooling: a linear transformation is applied to the 128- dimensional positional image embeddings before max-pooling, and the resulting vector is concatenated with the 512-dimensional language embedding to yield the 640-dimensional output fusion embedding.
• CNN+LSTM+SA model (stacked attention, Yang et al. (2016)), which infuses the language features before global pooling by conditioning a series of attention maps over image features: An initial 256-dimensional linear transformation is applied to the 128-dimensional positional image embeddings and the 512-dimensional language embedding, respectively. Subsequently, two stacked attention layers process the input. For each layer, a 256-dimensional linear transformation processes the positional image and language embeddings, respectively, before adding them and applying a tanh activation function. Another 1-dimensional linear trans- formation with subsequent softmax operation gives the multiplicative attention map over positional image embeddings. The resulting 256-dimensional weighted sum of positional image embeddings is added to either the transformed input language embedding or the output of the previous layer. This yields a final 256-dimensional output fusion embedding.
• CNN+LSTM+REL model (relation module, Santoro et al. (2017)), which combines pairs of positional image with language features and processes them in a series of additional fully- connected layers before global pooling: An initial 32-dimensional linear transformation turns the 128-dimensional positional image embeddings into 32-dimensional image features, which then are concatenated with a 2-dimensional map of relative spatial coordinates. Subsequently, each pair of 34-dimensional embeddings plus a copy of the 128-dimensional language embed- ding are concatenated, and processed by four 256-dimensional fully-connected layers with
rectified linear units. The resulting (w · h)2256-dimensional vectors are sum-pooled to obtain a single 256-dimensional output fusion embedding.
• CNN+LSTM+MC model (multimodal core, Malinowski and Doersch (2018)), which com- bines visual and language features and processes them in a series of additional fully-connected layers before global pooling: Each 128-dimensional positional image embedding is concat- enated with a copy of the 512-dimensional language embedding. Following two-rank batch normalisation, each positional vector is processed by four 256-dimensional fully-connected layers with rectified linear units. Finally, sum-pooling the resulting 256-dimensional vectors yields a single 256-dimensional output fusion embedding.
• CNN+GRU+FILM model (feature-wise linear modulation, Perez et al. (2018)), which infuses the language features before global pooling by conditioning the modulation values following batch normalisation in a series of additional convolutional layers: The image fea- tures are processed by four FiLM layers. For each layer, the 128-dimensional positional input embeddings are concatenated with a 2-dimensional map of relative spatial coordinates and processed by a 128-dimensional fully-connected layer with rectified linear unit. Subsequently, a convolution operation with 128 kernels of size 3 × 3 and stride size 1 is applied, followed by two-rank batch normalisation, however, instead of learned scale and offset values, these are obtained via two 128-dimensional linear transformations from the 512-dimensional language embedding. A rectified linear unit is applied to the output before being added as residual to the vectors before the convolution operation. Finally, the 128-dimensional positional vectors of the fourth FiLM layer are, again, concatenated with a spatial coordinate map and processed by a final 128-dimensional linear transformation, followed by two-rank batch normalisation, rectified linear unit, and then max-pooled to a 128-dimensional output fusion embedding.
Classifier module. The 128-, 256- or 640-dimensional output fusion embedding of the core module is processed by a fully-connected layer of size 1024, followed by one-rank batch normalisation and a rectified linear unit, before being mapped to answer logits by another linear layer and passed through a softmax operation to retrieve a distribution over answers.
Optimisation. Models are trained by Adam (Kingma and Ba, 2015), a first-order gradient- based stochastic optimiser, with a learning rate of 3 · 10−4 and mini-batches of size 64.
Codebase and contribution. My implementation(s) can be found as part of the GitHub repos- itory under https://github.com/AlexKuhnle/film, which extends and modifies the FiLM (Perez et al., 2018) repository under https://github.com/ethanjperez/film, which itself is based on the original PG+EE repository under https://github.com/ facebookresearch/clevr-iep(Johnson et al., 2017b). Besides modifying the existing
code to support the unified hyperparameter setup and to make it compatible with ShapeWorld, I added the implementation of the CNN+LSTM+REL and CNN+LSTM+MC model.
Effective differences between unified models. The core modules differ in a variety of aspects. First, all models but the CNN+LSTM baseline rely on early as opposed to late fusion, that is, they combine visual and language information before pooling all positional embeddings into a single fusion embedding. In case of early fusion, the fusion mechanism is applied either pointwise to, or pairwise between all positional embeddings. Language and (pairs of) positional image embeddings are combined either by concatenation, pointwise addition or an affine operation (multiplication plus addition). Some core modules add a map of relative spatial coordinates to the positional image embeddings before processing them. The core module itself consists either of one or more fully-connected layers applied per position, or a residual convolutional layer applied to a local window of embeddings. The entire process may be repeated multiple times. Finally, the processed positional embeddings are pooled to a global embedding either via concatenation/flattening, global sum- or max-pooling, or by weighted attention. The following table summarises the differences between the core modules with respect to these key characteristics.
Multimodal fusion Coord Module
Depth Positional
When Type Operation map operation pooling
CNN+LSTM late – concat no – 1 concat
. . . +SA early pointwise additive no fc 2 attention
. . . +REL early pairwise concat yes 4 × fc 1 sum
. . . +MC early pointwise concat no 4 × fc 1 sum
. . . +FILM early pointwise affine yes res conv 4 max