• No results found

2.3 Diagnostics Dataset and Methodology

2.3.2 Diagnostics Methodology

Our work is part of a recent research trend that aims at analyzing, interpret- ing, and evaluating neural models by means of auxiliary tasks besides the task they have been trained for (Adi et al., 2017;Linzen et al., 2016;Alishahi et al.,

2017;Zhang and Bowman, 2018;Conneau et al., 2018). Adi et al.(2017) pro- pose a methodology that facilitates comparing sentence embeddings on a finer- grained level. Specifically, they focused on the length of sentence, the presence of a word in the sentence, and the order of word in the sentence. Linzen et al.

(2016) evaluated neural network architectures sentence representation based on the different grammatical complexity focusing on the subject-verb agreement. They found out that given explicit supervision LSTMs could learn to approxi- mate structure-sensitive dependencies. Alishahi et al.(2017) analyzed the repre- sentation and encoding of phonemes in RNN using the grounded speech signal. They have shown that phoneme information is saliently present in the lower layer of RNN. Zhang and Bowman (2018) studied the effect of pre-training task on the type of linguistic knowledge is learn by the model. They found out that the language model learned using bidirectional language models do better compare to the translation model in extracting syntax information. They also

2.3. DIAGNOSTICS DATASET AND METHODOLOGY CHAPTER 2. RELATED WORK

show that randomly initialized model performs reasonably well. Conneau et al.

(2018) introduce ten probing tasks to infer the type of information is stored in the sentence embedding vector. They show that the sentence embeddings are capturing a wide range of linguistic knowledge.

Jointly Learning to See, Ask, and

GuessWhat

In this chapter, a grounded dialogue state encoder is proposed which ad- dresses a foundational issue on how to integrate visual grounding with dialogue system components. The proposed visually-grounded encoder leverages syner- gies between guessing and asking questions, as it is trained jointly using multi- task learning. The model is further enriched via cooperative learning. We show that the introduction of both the joint architecture and cooperative learning lead to accuracy improvements over the baseline system and provide in-depth anal- ysis to show that the linguistic skills of the models differ dramatically, despite approaching comparable performance levels. This points at the importance of analyzing the linguistic output of competing systems beyond numeric compari- son solely based on task success.1

1Part of work of this chapter will appear in NAACL 2019 as

Ravi Shekhar, Aashish Venkatesh, Tim Baumgärtner, Elia Bruni, Barbara Plank, Raffaella Bernardi, and Raquel Fernández, “Beyond task success: A closer look at jointly learning to see, ask, and GuessWhat”,In Proc. of

17thAnnual Conference of the North American Chapter of the Association for Computational Linguistics: Human

3.1. INTRODUCTION CHAPTER 3. JOINT

Figure 3.1: Our proposed Questioner model with a decision making component.

3.1

Introduction

Over the last few decades, substantial progress has been made in developing dialogue systems that address the abilities that need to be put to work during conversations: Understanding and generating natural language, planning ac- tions, and tracking the information exchanged by the dialogue participants. The latter is particularly critical since, for communication to be effective, partici- pants need to represent the state of the dialogue and the common ground estab- lished through the conversation (Stalnaker,1978; Lewis, 1979;Clark, 1996).

In addition to the challenges above, the dialogue is often situated in a per- ceptual environment. In this study, we develop a dialogue agent that builds a representation of the context and the dialogue state by integrating information from both the visual and linguisticmodalities. We take the GuessWhat?!game

de Vries et al. (2017) as our test-bed and model the agent in the Questioner’s role.

To model the Questioner, previous work relies on two independent models to learn to ask questions and to guess the target object, each equipped with its own encoder (de Vries et al., 2017; Strub et al., 2017a; Zhu et al., 2017; Lee et al.,

2017;Shekhar et al., 2018;Zhang et al.,2018). In contrast, we propose an end- to-end architecture with a singlevisually-grounded dialogue state encoder with a decision making(DM) component (Figure 3.1). As shown in Figure 3.1, both visual and textual (QA-pair) is first merged to form a dialogue state (details in Section 3.3). Using this dialogue state, the DM makes the decision to ‘ask’, further question using question generator, or ‘guess’, the target object using the Guesser. Our system is trained jointly in a supervised learning setup, extended with cooperative learning (CL) regime: By letting the model play the game with self-generated dialogues, the components of the Questioner agent learn to better perform the overall Questioner’s task in a cooperative manner. Das et al.

(2017b) have explored the use of CL to train two visual dialogue agents that receive joint rewards when they play a game successfully. To our knowledge, ours is the first approach where cooperative learning is applied to the internal components of a grounded conversational agent.