The CLEVR dataset (Johnson et al., 2017a) has had a big impact on research around visual question answering and inspired a range of new models, including most of the ones presented in section 5.1. Moreover, CLEVR consists of abstract data and is supposed to serve for diagnostic evaluation purposes – two aspects which make it similar to the nature of data and motivation behind the ShapeWorld framework. It is thus a natural starting point for the experimental part to assess model performance on CLEVR.
Previous research has mostly followed the precedent of Johnson et al. (2017a) and trained models on visual features extracted from a pretrained ResNet model. Only in some cases, a simple feature extractor for raw images was trained as part of the architecture (Santoro et al., 2017; Malinowski and Doersch, 2018; Perez et al., 2018). I conduct experiments using both pretrained and raw image features, and also compare my implementation of a unified hyperparameter version with the original model configurations. The purpose is mainly to connect later results on ShapeWorld – which will not use a pretrained feature extractor and only focus on the unified version – with results in the literature, and to present a complete overview of the different experimental variants.
5.2.1
Data
The CLEVR visual question answering dataset consists of rendered images of abstract three- dimensional scenes, associated with questions and their ground-truth answer which are generated from a variety of templates. Figure 5.2 shows an example instance. Similar to ShapeWorld, CLEVR’s internal world models are defined by a list of objects located on a two-dimensional plane, although rendered in three dimensions. Objects have one of three shape types (“cube”, “sphere”or “cylinder”), two discrete sizes (“small” or “large”), two materials (“shiny metal” or “matte rubber”), and eight colours. Questions are categorised into different types, depending on the required ability to correctly answer them: existential or counting questions, questions asking to compare object numbers (“equal”, “less” or “more”), or questions either querying or asking to compare the attribute of an object (“shape”, “size”, “material” or “colour”). Overall, there are 28 answers, of which a subset are applicable depending on the question type: yes/no, numbers from 0 to 10, and the 15 attribute values.
• How many small spheres are there? – 2 • What number of cubes are small things or red
metal objects? – 2
• Does the metal sphere have the same colour as the metal cylinder? – Yes
• Are there more small cylinders than metal things? – No
• There is a cylinder that is on the right side of the large yellow object behind the blue ball; is there a shiny cube in front of it? – Yes
Figure 5.2: An image and five example questions plus corresponding answers from the CLEVR dataset.
The CLEVR training set consists of 70,000 images with 10 questions each, thus overall 700k training instances. In the following, only the number of training iterations is reported – given the batch size of 64 used in all experiments, 100k iterations are equivalent to roughly 9.1 epochs. The validation set contains another 15,000 images with 10 questions each, summing up to 150k validation instances. Accuracy is always measured on the entire validation set, every 2,000 iterations for the first 10k and every 5,000 iterations afterwards.
5.2.2
Results
Performance of baseline models. The vision-only CNN baseline achieves an accuracy of slightly above 20% (see figure 5.3). The language-only LSTM baseline reaches around 47% accuracy in accordance with Johnson et al. (2017a), and learning already plateaus after only 5-10k iterations. The multimodal CNN+LSTM baseline obtains around 56% when using pretrained image features, saturating after roughly 50k iterations, and 58% when learning from raw images, plateauing after 80-100k iterations. This is slightly better than the 52.3% reported by Johnson et al. (2017a) and subsequent papers.
Performance of original models. The learning curve for the CNN+GRU+FILM model reaches around 96% accuracy after 200k iterations using pretrained image features (see figure 5.3), and the same score for raw images after 300k iterations, without indicating saturation in either case, which corresponds to the results reported by Perez et al. (2018). The CNN+LSTM+MC model with pretrained features obtains around 86% accuracy in the same time and, surprisingly, only around 71% when using raw images. These accuracy levels diverge from the better results of Malinowski and Doersch (2018), and may either be simply a matter of training the models for longer, or due to insufficient details on implementation and hyperparameters in their paper. Performance of the CNN+LSTM+SA model using pretrained image features stays below the CNN+LSTM baseline, in stark contrast to the 68.5% of Johnson et al. (2017a) despite the fact
Pretrained image features 0 20 40 60 80 100 120 140 160 180 200 0.4 0.5 0.6 0.7 0.8 0.9 1 CNN LSTM CNN+LSTM . . . +SA . . . +REL . . . +MC . . . +FILM Raw images 0 50 100 150 200 250 300 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5.3: Performance of original models on the CLEVR dataset (x-axis: iterations in 1000, y-axis: accuracy).
that my codebase is based on theirs (to be precise: the one of Johnson et al. (2017b)) with default hyperparameters. Interestingly, CNN+LSTM+SA obtains 66% accuracy after 300k iterations for raw images, which fits better with the results of Johnson et al. (2017a) for pretrained features. Finally, the CNN+LSTM+REL model does not improve upon the CNN+LSTM baseline in either case, which is not consistent with the results of Santoro et al. (2017).
Performance of models with unified hyperparameters. The learning curve for most mod- els changes substantially when moving to the unified configuration (see figure 5.4). For the CNN+LSTM+MC model, performance stays almost the same – unsurprisingly, as its archi- tecture changes comparatively little in the unified setup. The CNN+GRU+FILM model is the only one whose performance decreases substantially, reaching only 88% after 200k iterations with pretrained features, and 84% after 300k using raw images. The other two models both improve upon the CNN+LSTM baseline in this setup. In the case of pretrained image features, all models are roughly on par, with accuracies between 83-88%, whereas performance levels vary between 68%-84% using raw images. The 74% accuracy of the CNN+LSTM+SA model is similar to the 76.6% of the (unpublished) implementation of Santoro et al. (2017), which is “trained fully end-to-end”, so presumably learned from raw images. Interestingly, accuracy of the
same model with pretrained features is much better than any reported result for this model.
5.2.3
Conclusion
Unfortunately, only some of the experimental results on CLEVR are in accordance with the literature. While only the CNN+GRU+FILM model reaches roughly the expected accuracy level in all cases, the unified hyperparameter setting makes a big difference for CNN+LSTM+REL and CNN+LSTM+SA. This may be due to the use of batch normalisation.
Pretrained image features 0 20 40 60 80 100 120 140 160 180 200 0.4 0.5 0.6 0.7 0.8 0.9 1 CNN+LSTM . . . +SA . . . +REL . . . +MC . . . +FILM Raw images 0 50 100 150 200 250 300 0.4 0.5 0.6 0.7 0.8 0.9 1
Figure 5.4: Performance of models with unified hyperparameters on the CLEVR dataset (x-axis: iterations in 1000, y-axis: accuracy).
It is well-known that even simple reproduction of machine learning results can be problematic, as discussed in section 2.1. The situation here is aggravated by the fact that open-source code was not available for all of the evaluated models, and in some cases details of the architecture were not sufficiently specified in the corresponding paper. However, since the aim of this section is not to tune models for optimal performance on CLEVR, but just to compare my implementations, these issues are not further investigated. Importantly, though, the results generally confirm that the unified model variants obtain good results: they learn to handle CLEVR instances substantially better than the baselines, and often even better than my implementation of their original version.