Further Work - Incrementally resolving references in order to identify visually present objects

Though the two models presented in this thesis go beyond previous work in several respects, they by no means completely handle all kinds ofREs under all circumstances. Both models, for example, assume that a different module does the difficult, yet important task of segmenting the

REs (including quantifier scope) identifying that they are in factREs, and identifying possible

relations between them, is left to another component entirely. At the moment, only WACcan handleREs that have relations, though neither model has, at the moment, any way of handling negation. There are other important aspects of language that can be found in REs that the models might not handle directly, but I think that these models present a way of resolving references to a fair portion of the kinds ofREs that people come across day-to-day.

Specifically, we are interested in looking further into the following:

• Compositionality. Composition for both models happens on the level of extension, not necessarily on the level of composing words into phrases, etc. Improved composition would mean composing the way the meanings are represented, e.g., through a semantic calculus.

• Quantification. At the moment, we are assuming reference to a single, visually present object. Definite descriptions preface such a reference type with the word the. But other words denote different types of reference, for example a implies that any object falling in a certain class (or set of classes) can be referred, all implies that all objects falling in a certain set of classes be referred, or even specific numbers like two, etc.

• Language Generation. Being a generative model,SIUMis a natural candidate for poten-

tialNLGresearch; in particular for generation ofREs. It was mentioned in Chapter 3 how

SIUM is similar in principle to the NLGmodel in Mast et al. (2014), but actually using

SIUMin aNLGtask is left for future work. We have looked into using theWACclassifiers for generation ofREs by clustering the classifier coefficients with some initial promising findings.

• Reference Domains. At the moment, the two models require that the reference domain, i.e., the set of candidate objects, be pre-specified and visually present. More would could be done to relax this; a reference could be made to an object that is not visually present, but later perceived.1 Also, abstract entities that aren’t visually present, but represent an object that could be imagined (e.g., a unicorn).

• Non-visual Reference. Though the focus of this thesis has been to visually present objects, the models could be usable in referring to entities that are not perceivable, at least in the way objects are. For example, referring to a particular person, city, or idea (e.g., health or democracy). This could potentially be done usingSIUMif there are properties to each entity, and those properties (though not visual; e.g., for a city, a property could be the country it is in, it’s population, nicknames, etc.) could be learned to ground with certain words that refer to those entities. ForWAC, the features would need to be deter- mined. Possibly fittingWACinto a formal framework would also allow it to be usable in more abstract situations like referring to non-visual entities.

• Demonstratives (and Pronouns). While SIUM can handle the types ofREs that we are interested in, at the moment WAC can only handle definite descriptions. It could be made to handle demonstratives by incorporating additional features (e.g., that a hand is pointing, and coordinates to where it is pointing, etc.) and it is unclear as to how pronouns could be incorporated, though a binary feature like that which was used for

SIUMcould also be used.

• Further Fitting into a Formal Framework. WAC represents individual words as classifiers. We have seen how application of an object to those classifiers can happen using lambda calculus. However, those classifiers in turn could be fit into a larger semantic framework, thus combining the benefits of the WACmodel for grounded semantics, as well as formal frameworks which provide necessary scoping, relations, etc, without, for example, grounding non-content words like the.

• Learning through Association. Using words as classifiers, they need to be presented with visual features. If, for example, we have a richer set of features that can distinguish animals from each other usingWAC, but the model has no direct acquaintance with tigers, then it would be feasible to ask what is a tiger? and if the answer a tiger is a kind or large cat, then the model could possibly take the classifier for cat, copy it for a new tiger classifier, and then adjust the weights for size. If additional information, such as colour and the fact that tigers generally have stripes is also explained, then the weights that represent those features could also be adjusted, thus gaining an intensional notion of what a tiger is without ever having seen one.

• Learning by Discovery in Interactive Dialogue. Both models are trained on a corpus and evaluated offline. It would be developmentally motivating, for example, to use the

WACmodel and begin with no classifiers, but after interacting with a human who points to objects, makes definite descriptions to objects, etc., that the model learns through

experience the meanings of words online as it interacts with a human.

In document Incrementally resolving references in order to identify visually present objects in a situated dialogue setting (Page 191-193)