5.2 Model Definition
5.2.2 Deriving an Incremental Model
What the leftmost figure in Figure 5.3 shows is a model based on an entire RE, but what we want is a model that works incrementally, word for word. That is, given an intended object I that is constant across an entireREU , we want to model the contribution of each word and, as mentioned above, we assume that the properties in R correspond more directly to words in a
RE. This is represented graphically in the centre figure of Figure 5.3 which spans 3 words. This of necessity alters the formulation in (5.4), where I is conditioned on a joint distri- bution of all other variables with a corresponding Ukand Rkfor each word in aRE. Clearly,
this isn’t quite what we want as it would require a different formulation for each length ofRE. On a more practical level, as a reference resolution component implemented to reflect such a model processes word by word, it would of necessity compute the entire REuntil that point, i.e., it would recompute parts that have already been computed in a previous increment. This is the restart-incremental variant of incremental processing as explained in Chapter 2. An update-
incremental model has the pleasing result of saving processing time during resolution, but we show presently that it allows for a simpler derivation (given some additional assumptions) than the restart-incremental version would have. We show this for a two-wordREbut it is trivial to show that it works forREs of all lengths. In the update-incremental model, we treat I as differ- ent variables at each increment, where I in the current step is dependent on all other variables in the current step and the previous step (i.e., word increment):
P (I2|I1, U1, U2, R1, R2) =
P (I1, I2, U1, U2, R1, R2)
P (I1, U1, U2, R1, R2)
(5.5)
Which can be factored in a similar way as (5.4), marginalizing out R1 and R2:
P (I2|I1, U1, U2) = P (I2|I1)P (I1) X r2∈R2 P (U2|R2)P (R2|I2) P (U2) X r1∈R1 P (U1|R1)P (R1|I1) P (U1) (5.6)
With this, we can further apply a trick to marginalise over I1, but explicitly define P (I2|I1)
as an enforcement on object identity; i.e., we define it as a function that is explicitly set to zero when I1 does not equal I2. This is similar to a simple Bayesian update, making the need to
range over all possible combinations unnecessary, which brings us the complexity savings (and hence computation savings) that we are looking for. Moreover, the rightmost summation in (5.6) over R1 is precisely the computation that occurred in the previous incremental step (i.e.,
the first word of this example two-wordRE), which is a distribution over I. We therefore treat P (I1) as that distribution, effectively making it a prior probability that is set to the posterior of
the previous step. We further drop P (Uk) by assuming that all words are equally likely to be
uttered. By applying these simplifications, we arrive at the model we are looking for, applied at each word increment, which fits the intuitions of the theoreticalIU-network described above:
P (I|U ) = P (I)X
r∈R
P (U |R = r)P (R = r|I) (5.7)
It can be trivially shown that (5.6), equivalently (5.7), can be applied toREs of all lengths; where each Rk is marginalised as well as all Ik, forcing identity at each step. This results in
the graphical model portrayed on the right of Figure 5.3, unrolled over 3 words. Though Ikis
not the same across increments, forcing identity effectively makes it remain constant across the
The model is composed of several sub-models, P (I), P (U |R), and P (R|I). When imple- mented in a component that resolves references, each sub-model performs a specific task and, in some instances, the models are learned from data. We explain these sub-models in greater detail in the remainder of this section. That is then followed by some toy examples. First, however, we explain further what we mean by the properties in R, which play an important role in our model.
Properties in R
Properties in our model can be visual properties such as colour (e.g., red or green), shape (e.g., cross or v-shaped), or spatial placement (left-of, below, etc.). The purpose of the proper- ties is to ground objects with language in a more fine-grained way than with the object itself. This is an intuitive observation, in that the words people use to refer to objects (particularly visible objects) are visual properties such as shape, spatial arrangement, etc. As is shown in the experiments, the choice of properties is crucial to the success of the model. One can perhaps see the set of properties as a flat (or at least very shallow) ontology; properties of all types are treated as equals.
The properties can also be where additional modalities are incorporated into the model. For example, in a scene where a speaker is pointing at an object and saying the word that, the object being pointed at could have a pointed-at property; the model would then learn that pointed-at grounds to demonstrative words, such as that. We explore this in Experiment 3 below.
Linking Objects and Properties: P(R|I)
The sub-model P (R|I) provides the link between objects and the properties that those objects have. Here we follow, to our knowledge, a novel approach, by deriving this distribution directly from the scene representation. We assume that with equal probability one of the properties that the intended object actually has is picked to be verbalised, leaving zero probability for the ones that it does not have.2 This in a way is a rationality assumption that we eluded to above: a rational speaker will, if at all, mention properties that are realised and not others (at least in non-negative contexts).
As eluded to above, there may be cases where the property to be uttered isn’t quite clear. For this reason, P (R|I) can also have uncertainty in its representation by maintaining a distribution over properties, where the probability that a property has represents the degree of belief that
2
Certainly, this is a rather naive assumption as certain properties could be more salient, or allow the object to be easier uniquely identified, but this formulation works well in practice.
it belongs to that particular object. It isn’t completely intuitive for a generative model such as this to consider uncertainty in the scene, given the way it has been formulated. Uncertainty in colours, for example, could mean two things: 1) that there is an actual problem on the side of the speaker with the perception of the properties, or 2) the speaker recognises that the properties to be uttered are not completely prototypical to what might be understood by the listener, so uncertainty is implicit in the distribution over colours in how the utterance is expressed (e.g., the red onevs. the reddish one). Option 2 makes more sense here; making aREthat signals to the listener that there might be some uncertainty in the colours allows them to be more accommodating in how the distribution is spread away from the prototypical colour that was uttered. In either case, the rationality assumption of the speaker should hold, i.e., that certain properties are picked out that the object does have, but if the speaker recognises a problem in perception either on her side or the side of the listener, accommodations can be made in the way theREis produced.
Besides uncertainty, P (R|I) could also encode saliency information in the distribution over properties, giving some objects a higher probability of being the referred one (in which case, P (R) in the derivation would not be uniform and would be better left in the model). Whether P (R|I) encodes uncertainty about a scene or saliency in properties is a design decision. In Experiment 4, we examine how the model performs when varying uncertainty.
Linking Language and Properties: P(U|R)
The sub-model P (R|U ) represents the grounded mapping between properties and language; i.e., aspects ofREs that can be used to pick out those properties. More semantically, this model can be seen as a function from a linguistic element, such as a word, to a semantic concept (e.g., the word red maps to the concept of redness represented in this model by a corresponding property) where the set of properties represents the set of semantic concepts that words can map to.
This is one point where our model departs from most previous work (as presented in Chap- ter 3) in that the mapping between words and concepts is not pre-defined by rules. Rather, P (U |R) can be learned directly from data by (smoothed) Maximum Likelihood estimation. For training, we assume that the property R that is picked out for verbalisation is actually observable. In our data, we know which properties the referent actually has, and so we can simply count how often a word (or its derived semantic representation) co-occurred with a given property, out of all cases where that property was present.
For the experiments described below, we make a technical modification to the model by applying Bayes’ Rule to P (U |R):
P (I|U ) = P (I)X
r∈R
P (R = r|U )P (R = r|I) (5.8)
which cancels P (U ) (before it was dropped from (5.7)) and introduces P (R) into the sum- mation, but P (R) can be dropped since (in this work) it can be approximated with a uniform distribution. This is motivated by the assumption that P (R|U ) is easier to learn using standard available classifiers (with R as class labels; another approach, which we do not explore here, would be to train P (U |R) as a family of language models). The formulation in (5.8) represents the model that we use in the experiments below.
Contextual Prior: P(I)
The sub-model P (I) acts as a prior in our model and provides a way of keeping track of the distribution over I as the REincrementally unfolds. At the beginning of the computation for an incoming RE, we set the prior P (I) to a uniform distribution (or, it can be used to encode initial expectations about intentions; i.e., prior gaze information). For later words, it is set to the posteriori of the previous step, and so this constitutes a Bayesian updating of belief (as explained above, with a trivial, constant transition model that equates P (It−1) and P (It)).3