In this section we will look into some literature on natural language understanding (NLU). We
will define NLU, see some examples ofNLUin the literature, and compare it with the task of
RR.
NLU(also called spoken language understanding–SLU–we use both interchangeably here), is defined in Hazen (2011) as “interpretation of signs (e.g., words) conveyed by a speech sig- nal”. Interpretation can be seen as a classification of groups of signs into classes being identi- fied by a semantic label describing a type of semantic constituent. As put in Tur et al. (2012) for the setting of dialogue systems,NLU“aims to automatically identify the domain and intent of the user [speaker] as expressed in natural language and to extract associated arguments or slots to achieve a goal.” Thus NLUgoes a step beyond a meaning representation (i.e., where words are converted into logical constants, and their relations are annotated; e.g., a logical form such asFOL) and attempts to determine what the intention of the speaker was by uttering what she did.
The example in (11) shows an utterance in (11-a), aFOLabstraction over that utterance in
(11-b) which is a meaning representation, and anNLUinterpretation of that meaning in a useful format (in this case, a semantic frame (Fillmore and Baker, 2001)) in (11-c).
(11) a. Could you please hand me that red one?
b. ιxy. speaker(x) ∧ thing(y) ∧ red(y) ∧ give(x, y)
c. D I A L O G U E-A C T request R E F E R E N T red+thing
A C T I O N give(REFERENT,speaker)
Note the RE in (11-a) denoted in bold typeface, which is represented as an entity x in the FOL representation in (11-b), and as the slot value referent in the frame in (11-c). This frame gives more practical information for a later component (such as the dialogue manager, as explained in Chapter 2); if such a component controlled a robotic arm, this information would be more useful than the logical form in (11-b) to tell it what the speaker intended; namely, a request to hand over a particular object. The component would need to be able to interpret the constants in (11-c) in order to perform the action to move the robotic arm.
We can see from this example thatNLUis a slightly broader task thanRR. LikeRR,NLUis concerned with going beyond a meaning representation and interpreting an speaker’s intention
by performing an utterance–thoughRRassumes that the intention is referring to an object. Note, however, (following Heintze et al. (2010)) that the RErepresented abstractly in the REFERENTslot in (11-c) has not yet been resolved. The wholeNLUframe breaks down the utterance, determines the overall goal of the utterance (i.e., the dialogue act–the first slot), the referent, and the action to be performed (slot 3); namely, giving the referent to the speaker. Resolving the speaker is done to the person who last spoke, that is, made the request, is the recipient of the give action. But what object does the system give to the speaker? The NLU
abstraction only picked out the words that belong to the REfor the object to be given, but it didn’t actually resolve which object it was. This is an additional step that is sometimes a sub- component ofNLU, but in any case it is the component that we are concerned with: a component
for resolvingREs. A frame with a resolved referent, would be more like the example shown in (12) where the identity of the object has been resolved, uniquely from the other visually present objects (e.g., object with ID 3 out of a potential scene with 12 objects, where the component assigns each perceived object a unique ID).
(12) a. D I A L O G U E-A C T request R E F E R E N T o3
A C T I O N give(REFERENT,speaker)
In non-situated dialogue systems, theNLUcomponent doesn’t necessarily need to perform this
kind of RR. For example, we mentioned in the previous chapter the Air Travel Information System(ATIS) (Dahl et al., 1994; Hemphill et al., 1990) which is a commonly used data set for
NLUresearch. An example utterance and corresponding (annotated) NLUframe are shown in
Example (13). The corpus has 17 different intents (i.e., dialogue acts; e.g., flight means the user wishes to book a flight; this makes up the goal slot in (13-b)). In order to fill the slots in the frame, it can be treated as a tagging task known as concept tagging, where the slot values are the tags and the values are the corresponding words in the utterance.
(13) a. What flights are there arriving in Chicago after 11pm?
b. GOAL flight
TOLOC.CITY NAME Chicago ARRIVE TIME.TIME RELATIVE after
ARRIVE TIME.TIME 11pm
Once each word is tagged, a database query can be created (by a set of rules) and executed which returns the desired information. This kind ofNLUtask is quite useful for speech-based
corpus (Bonneau-Maynard et al., 2006) as well as transportation information in the Polish
LUNAcorpus (Marciniak et al., 2010) and the Let’s Go! corpus (Raux et al., 2005) which are all collected telephone conversations, and are hence non-situated (in the way that situated has been defined in Chapter 1).
Most approaches to NLUusing concept tagging have applied and compared various ma-
chine learning methods, where features are generally words (and, in some cases, part of speech tags) within a certain context. Meurs et al. (2009b, 2008b) applied dynamic Bayesian networks (DBN), a graphical model approach, to NLUin the MEDIA corpus, where words and phrases were related to each other in the DBN structure. This work was extended in Lefevre (2007); Meurs et al. (2009a), which used multiple-levels of DBNs to produce input for a conditional
random field(CRF) to predict the slots. Markov logic networks (MLN), another graphical model approach, were also applied to theMEDIAdata in Meurs et al. (2008a). Meza-Ruiz et al. (2008) appliedMLNs toATIS, yielding respectable results when considering data across the entire ut- terance. Hahn et al. (2011) provided a comparison of various machine learning methods and applied them to several tasks (includingMEDIA andLUNA). Dinarelli et al. (2012) attempted
to performNLUon several tasks by adding long distance dependencies by re-ranking a typical model ofNLUusing features from dependency information and aSVMclassifier.
Chinaei et al. (2009) applied a more involved method ofNLUusing unsupervised Hidden
Topic Markov Models (HTMM) for recovering the user intention. More recently, Tur et al. (2012) applied deep convex networks for semantic utterance classification, a task similar to
NLU (where the utterance domain is determined, rather than the intent). Another approach to discriminative classification was applied to ATISin Mairesse et al. (2009), where semantic tuple classifierswere used.
Another, recent and novel approach also deserves mention. Henderson et al. (2014) used “deep learning” recurrent neural networks, not to the task of NLU directly; rather, they at- tempted to treat the semantic frame as a latent variable and directly predict a dialogue decision. Such an approach is feasible in a non-situated, minimally-interactive task such as this where the words of the entire utterance provide the observed variable which is used to predict the dialogue decision.7
A task that required understanding of fairly complicated speech was presented in Liang et al. (2013). The data task was to retrieve facts about United States geography (facts stored
7
Despite the original definition thatNLUprovides an interpretation of the intention, it doesn’t provide or make use of individual word meanings, which is something we are interested in. For example, in (13), we see that the word Chicago is tagged asTOLOC.CITY NAME. This tells us that Chicago is a city’s name, but not which city, nor does it tell us what a city name actually is. The city name is a pre-defined constant (e.g., a database table column) that the system can use. Meaning is by no means represented anywhere, though this kind ofNLUapproach works for practical database lookup systems.
in a database called theGEOcorpus), using a hand-typed utterance (not speech, in this case), e.g., state with the largest area should return Alaska.8 To get the proper result, their approach applied a semantic abstraction over the typed utterance the authors called dependency-based compositional semantics(DCS). More formally, a probabilistic model learns the mapping from a question x to a latent logical form z, which is then evaluated with respect to world w, in this case a database of facts, which produces an answer, y; see Figure 3.13. The resultingDCStree can then be traversed to generate queries.9
Figure 3.13: ExampleDCSparse for the utterance state with the largest area, taken from Liang et al. (2013)
The answer is an entity such as a city, state, or naturally occurring entity such as a river, which in some ways is resolving a referring expression (albeit in an attributive way); the in- tention of the speaker is always a question, and the desired result is always an answer in the form of the name of an entity. This task is somewhat different from the task presented here, however, where we are interested in directing someone’s attention to a visually present object. We already looked into some situated approaches toNLUin Section 3.1 that, of necessity, had some kind of component (or part of theNLUmodel) that resolved referring expressions (or, the entire model ofNLUwas in fact a model ofRR), such as the map direction task approaches. In some cases, these were not grounded approaches; a direct mapping from U to W was done via rules; in other cases, a grounded word meaning was learned from data.
Efforts have been made to model NLUincrementally. This amounts to filling a frame (ei-
8No, it’s not Texas. 9
In fact, when traversing aDCStree to produce a query, nodes in the tree map directly to database operations (e.g., join, aggregate, execute, etc.). This kind of semantic parser is quite effective, provided the world is represented in such a pre-defined database, and that questions and answers can be provided. This, along with the concept- tagging approaches toNLUseem to show that approaches toNLUdepend heavily on how the data they work with is represented (an insight credited to Jana G¨otze).
ther pre-defined or one that is dynamically being built) as the utterance unfolds, word for word. Traum et al. (2012) presented a fully-functional dialogue system that handled incremental un- derstanding and feedback in a situated, multi-party situation. Their NLUcomponent produced semantic representations and predictions of final-utterance meaning when given (partial)ASR
output. Their approach was restart-incremental as information that was processed in a previous increment was re-processed (see Chapter 2 for a definition of types of incremental systems). Though entities in the discourse and world needed to be resolved to concepts in the NLU, this generally amounted to pronouns (e.g., I, we, etc.) and not to visually present objects.
The approach toNLUpresented in Kennington and Schlangen (2012, 2014) was also (restart) incremental, and for part of the NLU frame, reference was made to visually present objects.
Reference could be made via definite descriptions or exophoric pronouns to PENTO objects on a virtual screen (similar to 3.12), example utterances (in German; this corpus of data will be explained in greater detail in the next chapter) and correspondingNLUframes in (14) and (15),
REs depicted in bold typeface. The utterance in (15) follows directly after the utterance in (14), both the definite description and pronoun refer to the same object in the scene, identified as o4.
(14) a. drehe die Schlange nach rechts b. rotate the snake to the right c. ACTION rotate OBJECT o4 RESULT clockwise
(15) a. drehe sie nochmal b. rotate it again c. ACTION rotate OBJECT o4 RESULT clockwise
This is closer in spirit to the work done in this thesis, albeit the model (applied Markov Logic Networks) were slower than could be realised in a real-time dialogue system. It is useful to note, and we will explore this further in a later chapter, that in reality all three slots (a pre- defined set) were filled with separate models; each model was essentially performing a RR