This section describes the various configurations for ShapeWorld data which form the basis for later experiments. Each configuration focuses on one type of caption pattern like, for instance, statements containing spatial relations. The following overview categorises the different types of datasets (see section 4.2.1 for the corresponding caption components and their interpretation).
EXISTENTIAL
SINGLE- EXISTENTIAL ONE/TWO/THREE-SHAPE
EXISTENTIAL EXISTENTIAL FULL
DOUBLE- RELATIONAL-TRIVIAL
EXISTENTIAL LOGICAL
QUANTIFICATION NUMBERS
QUANTIFIERS
RELATIONAL
NON-SPATIAL ATTRIBUTE-EQUALITY
ATTRIBUTE-RELATIVE
SPATIAL
SPATIAL-EXPLICIT
SPATIAL- SPATIAL-COMPARATIVE
Since data generation takes much longer than model training for ShapeWorld data, suffi- ciently big datasets are produced once and reused for all experiments. Training datasets consist of 500k instances. Validation and test sets each consist of additional 10k instances, with validation instances following the same configuration as the training data, while test instances may addition- ally exhibit withheld numbers of objects and attribute combinations. Scenes generally contain 1/4/5 to 10/15 objects depending on the dataset, to encourage ‘interesting’ non-trivial situations2.
For all datasets, 5, 10 and 15 are withheld numbers of objects which are only generated for test data. Special cases of datasets with instances consisting of only one, two or three shapes are indicated by the labelsONE/TWO/THREE-SHAPE, and whether overlapping objects are avoided is indicated byCOLLISION-FREE.
The following object attribute combinations are withheld for training and validation data: “red square”, “green triangle”, “blue circle”, “yellow rectangle”, “magenta cross”, “cyan ellipse”, “red/green/blue/grey pentagon”, “grey square/triangle/circle/pentagon”. The fact that shape and colour attributes appear in multiple combinations and within varying caption patterns encourages systems to disentangle the two properties in a factored attribute-level representation. Test instances and their withheld combinations consequently evaluate whether such a representation is learned, as otherwise it would not be possible to generalise to these unseen object descriptions. In particular “pentagons” and “grey shapes”, for which only about half of the possible combinations are seen during training, test the degree of robustness of this generalisation ability.
Existential statements. SINGLE-EXISTENTIALdatasets consist of simple statements referring
to the existence of (at least) one object of a certain description, which may be partially underspe- cified, that is, only mention either the shape or colour of an object. The following list illustrates the different possible surface statements referring to a red square in a scene.
• “There is a square.” • “There is a red shape.” • “A shape is a square.” • “A shape is red.”
• “There is a red square.” • “A shape is a red square.” • “A square is red.”
• “A red shape is a square.”
SINGLE-EXISTENTIALdatasets can be seen as a language-variant of the object recognition vision task. The language representation of the ‘object category’ incentivises to learn attribute- factored representations as opposed to independent classes, which allow a system to generalise to unseen combinations.
Statements with logical connectives. TheLOGICALdataset combines two existential state- ments with one of the following logical connectives: “and”, “or”, “if ” or “if and only if ”.
2SINGLE-EXISTENTIAL: 1-10 objects;DOUBLE-EXISTENTIALandRELATIONAL: 4-10 objects;QUANTIFICA-
The existential components each refer to a different object, and either of them may be partially underspecified. The following list contains an example for each connective.
• “There is a square and a shape is a circle.” • “There is a square or a circle is green.”
• “A square is red if there is a circle.”
• “A square is red if and only if there is a green circle.”
TheLOGICALdataset requires to detect the existence or non-existence of two independent descriptions of objects. The connective determines which combinations of non-/existence are considered correct. Note that it is not necessary to keep track of both sets of objects simultaneously. For instance, in case of an “or” statement, either the second part can be ignored if the first description already applies, or the first can be forgotten if it does not apply. This distinguishes the dataset from theRELATIONALdatasets below.
Statements with numbers or quantifiers. The QUANTIFICATION datasets both consist of
quantified statements about a set of objects. In the case ofNUMBERS, the quantification is based
on an absolute number, whereasQUANTIFIERS statements specify the fraction relative to the
total number of objects of a description. In addition, one of the following comparing modifiers defines the quantification more precisely: “more than”, “at least”, “exactly”, “not”, “at most” or “less than”. A variant of the dataset without different modifiers, just “exactly”, is indicated by the suffix -EXACT.
TheNUMBERSdataset uses numbers from “zero” to “five”, with one example per number
given in the following list3
• “More than zero shapes are squares.” • “At least one shape is red.”
• “Exactly two shapes are red squares.”
• “Not three squares are red.”
• “At most four red shapes are squares.” • “Less than five shapes are red squares.”
The QUANTIFIERS dataset is based on the fractions “half ”, “third” and “quarter”, in
addition to the ‘trivial’ fractions “no” and “all”. An example for each fraction can be found in the following list.
• “More than no shape is a square.” • “At least a quarter of the shapes is red.” • “Exactly a third of the shapes is a red square.” • “Not half the squares are red.”
• “At most two thirds of the red shapes are squares.”
• “Less than three quarters of the shapes are red.”
• “Not all red shapes are squares.”
3Some of the sentences may sound unnatural to English speakers, however, I decided to treat numbers/quantifiers
and modifiers as fully compositional in ShapeWorld. Note that models in this thesis are trained from scratch on the resulting data, but the captioner can be configured to exclude unnatural combinations, for instance, when using pretrained word embeddings or language models.
The crucial difference between NUMBERS and QUANTIFIERS in terms of quantification complexity is that NUMBERS statements can be correctly answered solely by counting the number of objects satisfying the combined description of noun and verb phrase (“Two squares are red.” → “red squares”), while QUANTIFIERS statements generally require to compare
the cardinality of this object set relative to the number of objects in agreement with only the noun phrase part of the description (“Half of the squares are red.” → “red squares” relative to “squares”). Note also that the -EXACTversion with only one modifier, while less complex
in terms of linguistic variety, does not contain approximate modifiers like “at most”, and thus requires more precise recognition of numbers.
Relational statements. The dataset categoryRELATIONALcomprises various relational state- ments between two or more objects. Where the relation requires an additional comparison object – for instance, “closer to. . . than” – this description is constrained to unambiguously refer to a single object in the scene. RELATIONALdatasets are further distinguished between the type of relation they contain, which are described in the following paragraphs.
First, theRELATIONAL-TRIVIALdataset consists of ‘trivial’ statements without relational content beyond the co-existence of two objects.
• “A square exists besides a green shape.”
The ATTRIBUTE-EQUALITY dataset comprises relational statements which compare the
shape or colour of two objects, whether they are the same or different.
• “A red shape is the same shape as a green shape.”
• “A square is the same colour as a circle.”
• “A red shape is a different shape from a green shape.”
• “A square is a different colour from a circle.”
These instances do not mention the shape/colour in question, as otherwise they would effectively reduce to a kind of existential statement: for instance, “A red shape is the same shape as a green square.” reduces to “There is a red square.”, since only the shape information in “green square”is relevant for the first part of the sentence.
TheATTRIBUTE-RELATIVE dataset contains relations comparing either the size of a shape or the shade of a colour of two objects. Note that these relations implicitly require the same shape/colour to avoid ambiguous comparisons, which is why the corresponding attribute is only mentioned once.
• “A red shape is smaller than a green circle.” • “A red shape is bigger than a green circle.”
• “A square is darker than a green circle.” • “A square is lighter than a green circle.”
TheSPATIAL-EXPLICITdataset involves various spatial relations, including two relying on a third comparison object for relative distances. The two relations “behind” and “in front of ”, which require overlapping objects, are excluded in the case of COLLISION-FREEdatasets.
• “A square is to the left of a circle.” • “A red square is above a circle.” • “A red square is behind a circle.”
• “A square is closer to the triangle than a circle.”
• “A square is to the right of a green circle.” • “A red square is below a green circle.” • “A red square is in front of a green circle.” • “A square is farther from the triangle than a
green circle.”
Besides these ‘explicitly’ relational statements, ShapeWorld supports two other forms of implicit spatial statements, which consist of an adjectival form of one of the relations above. The
SPATIAL-COMPARATIVEdataset comprises statements with adjectives in positive/comparative
form. They require the object set described by the noun phrase to contain exactly two objects, between which the spatial relation selects the referred target.
• “The left square is red.”
• “The right red shape is a square.” • “The upper circle is green.”
• “The lower green shape is a circle.”
• “The red shape closer to the triangle is a square.”
• “The square farther from the triangle is red.”
TheSPATIAL-SUPERLATIVEdataset consists of similar statements with adjectives in super-
lative form. Here, the noun phrase refers to at least two objects, of which the one ‘maximally’ satisfying the spatial relation – that is, all pairwise comparisons with the other objects under consideration – is selected.
• “The leftmost square is red.”
• “The rightmost red shape is a square.” • “The uppermost circle is green.”
• “The lowermost green shape is a circle.”
• “The red shape closest to the triangle is a square.”
• “The square farthest from the triangle is red.”