• No results found

2.7 Chapter Summary

3.1.3 Incremental Approaches

Resolving references incrementally to visually present objects was looked at in Stoness et al. (2004, 2005) to provide feedback to earlier processing modules. For example, given a sentence that ambiguously referred to an object, e.g., put the apple in the box in the corner (which could mean the apple is in the box, or that the apple is not in the box, yet the box is in the corner) knowing something about the scene and that there is an apple in a box could provide useful feedback to a syntactic parser that provides parses with both possible attachments to disprefer the parse that does not resolve to a visually present object. The authors found that their parsers produced far fewer parse hypotheses when given feedback information. The approach to ref- erence resolution was the approach described in Tetreault and Allen (2004) (though it mostly focused on pronoun resolution) which was not grounded; rather, it followed a rule-based ap- proach such as those we saw above. The approach was incremental to a degree, as it generally worked with phrases (e.g.,REs) before it attempted to resolve, but it worked on a finer-grained level than sentence-level. Similarly, Schuler et al. (2009) described a framework for incor- porating referential semantic information from a world model directly into a language model (similar in spirit to feedback from a world model to theASR); their approach was incremental, though the reference objects were not necessarily visually present.

The work presented in Peldszus et al. (2012) also usedRRas a task to prune away unlikely syntactic parses. Their main contribution was a semantic processing module that was robust against ill-formed input (again, common in speech and ASR), respected both syntactic and

pragmatic constraints, used a principled semantic representation (robust minimal recursion se- manticswhich will be described in greater detail in Chapter 5), and worked incrementally; that is, it produced underspecified semantic output monotonically at each word increment. How- ever, as with work presented above, the extension (denotation) of theREs was determined by a set of rules which checked if a word matched the name given to a specific symbolic property. The data used in the evaluation was from the Pentomino (PENTO) puzzle piece domain (also used in, inter alia, Fern´andez et al. (2007), which also looked at reference as a task, but the fo- cus of that paper was not on modellingRR). An example of this kind of scene is in Figure 3.12; example instructions given in Example (10).6

Figure 3.12: Example PENTO scene.

(10) a. move the yellow v in the top left to the bottom. b. take the piece in the middle on the right side.

Siebert and Schlangen (2008) also used the PENTO domain (albeit with different data; the task was similar to the one described above) to resolve references to visually present PENTO objects. Features were extracted from the scene (e.g., object features such as size, length, shape; topological features such as groupings and distance to other objects) which made up W , and a grounded meaning of the words was learned using a set of tags that determined what

6We will see PENTO data more throughout the course of this thesis; several corpora will be explained in greater

contribution each word in an utterance makes (e.g., if it belongs to the referred object or to a nearby landmark; whether it is a colour word or a shape word, etc.) which, along with the words, comprised U . They then learn the meaning of a word by extracting all instances in the corpus where the word is found and identifying which features in the corresponding scene predict the appropriateness of that word. This was done by simple co-occurrence counting between features and words, but additional processing was done to filter out irrelevant features (for a particular word); that is, if the variance of that feature and word is above a certain threshold (determined by hand, but the authors hint that it could be determined automatically). Their model was able to resolve the referred object, given U and W , correctly 80% of the time (resolving 1 out of 7 objects; baseline of 14%).

Schlangen et al. (2009) also used the PENTO domain for application of aRRmodel. They used a Bayesian filtering model where the intended referent is treated as a latent variable that generates a sequence of observations. Formally,

P (r|w1:n) = α ∗ P (wn|r, W1:n−1) ∗ P (r|w1:n−1) (3.3)

where P (wn|r, W1:n−1) is the likelihood of the new observation which is modelled by

referent-specific language models that approximate the joint probabilities of reference and word-sequences (n-grams in the RE; i.e., object names are part of the language model se- quence, e.g., for the RE the red circlereferring to piece X, the bigram sequence would be the X, red X, circle X). P (r|w1:n−1) is the prior at step n and the posterior at step n − 1

(at the initial word, this is just a uniform distribution over the possible referents), and α is a normalising constant. Thus their model was grounded, as it learns these joint probabilities from data. Their model was able to take disfluencies into account, such as filled pauses (i.e., the speaker took extra time in producing the RE) in that the filled pauses actually provided

useful information to the model and improved the belief as to which object was being referred (e.g., an object with an unusual shape would be more difficult to describe than an object that has a common shape, causing disfluencies). Importantly, their model was update-incremental; it maintained a belief state in the form of a distribution over the potentially referred objects which was updated at each word (i.e., new information was not recomputed). They report that in about 55% of the cases, their model referred to (i.e., the argmax of the distribution) was the intended object by the end of theRE. They also report incremental metrics which we will use in the evaluation of our models in Chapters 5 and 6.

Discussion Though ample work has been done in RR as a task, fewer approaches learned grounded word meanings, even fewer resolved references incrementally, and fewer still did

both. Chapters 5 and 6 present two models that do both (albeit in somewhat different ways from each other).