• No results found

The resulting architecture is fully differentiable. In addition, the GDSE agent faces a multi-task optimization problem: While the QGen optimizesLQ and the

Guesser optimizes LG, the parameters of the Encoder (W, LSTMe) are opti-

mized via both LQ and LG. Hence, both tasks faced by the Questioner agent

contribute to the optimization of the dialogue stateht, and thus to more efficient

encoding of the input context.

Decision Making We further extend the GDSE with a decision-making com- ponent (DM) (cf. Figure 3.1) that determines, after each question/answer pair, whether QGen shouldaskanother question or whether the Guesser shouldguess

the target object. We treat this decision problem as a binary classification task, for which we use an MLP followed by a Softmax function that outputs prob- abilities for the two classes of interest: ask and guess. The argmax function then determines the class of the next action. With this approach, we bypass the need to specify any decision thresholds and instead let the model learn whether enough evidence has been accumulated during the dialogue so far to let the Guesser pick up a referent.

3.4

Learning Approach

We first introduce the supervised learning approach used to train both BL and GDSE, then our cooperative learning regime, and finally the reinforcement learning approach we compare to.

3.4.1 Supervised learning

In the baseline model, the QGen and the Guesser modules are trained au- tonomously with supervised learning (SL): QGen is trained to replicate human questions and, independently, the Guesser is trained to predict the target object.

Our new architecture with a common dialogue state encoder allows us to formu- late these two tasks as a multi-task problem, with two different losses (Eq. 3.4

and3.6in Section3.3).

These two tasks are not equally difficult: While the Guesser has to learn the probability distribution of the set of possible objects in the image, QGen needs to fit the distribution of natural language words. Thus, QGen has a harder task to optimize and requires more parameters and training iterations. We address this issue by making the learning schedule task-dependent. We call this setup

modulo-n training, where n indicates after how many epochs of QGen training the Guesser is updated together with QGen.

As there are no labels for the DM about when to ask more questions and when to guess the object, we follow the label generation procedure introduced byShekhar et al.(2018). The labels for the Decider are generated by annotating all the last question-answer pairs in the games with guess and other question- answer pairs asask. So, we have an unbalanced dataset for the decider module where the guess label makes up for only 20%. We address this class imbalance by adding a weighting factor,α, to the loss. The balanced loss is given by

LD = αtarget ·(−logp(dectarget)) (3.7)

where αguess = 0.8 and αask = 0.2. During inference, we continue to ask

questions unless the Decider chooses to end the conversation or the maximum number of questions has been reached. The architecture of the Decider consists only of the MLPd.

Using the validation set, we experimented withnfrom 5 to 15 and found that updating the Guesser every 7 epochs worked best. With this optimal configura- tion, we then train GDSE for 100 epochs (batch size of 1024, Adam, learning rate of 0.0001) and select the Questioner module best performing on the val- idation set (henceforth, GDSE-SL or simply SL, and with Decision making GDSE-SL-DM or SL-DM).

3.4. LEARNING APPROACH CHAPTER 3. JOINT 3.4.2 Cooperative learning

Once the model has been trained with SL, new training data can be generated by letting the agent play new games. Given an image from the training set used in the SL phase, we generate a new training instance by randomly sampling a target object from all objects in the image. We then let our Questioner agent and the Oracle play the game with that object as a target, and further train the common encoder using the generated dialogues by backpropagating the error with gradient descent through the Guesser. After training the Guesser and the encoder with generated dialogues, QGen needs to ‘re-adapt’ to the newly ar- ranged encoder parameters. To achieve this, we re-train QGen on the human data with SL, but using the new encoder states. Also here, the error is back- propagated with gradient descent through the common encoder. To update the parameters of the DM, we use the guesser output as the label to decide ‘ask’ or ‘guess’. When the guesser is successful, the DM has to ‘guess’ else it has to ‘ask’ further question.

Regardingmodulo-n, in this case QGen is updated at everynth epoch, while the Guesser is updated at all other epochs; we experimented with n from 3-7 and set it to the optimal value of 5.

The GDSE previously trained with SL is further trained with this coopera- tive learning regime for 100 epochs (batch size of 256, Adam, learning rate of 0.0001), and we select the Questioner module performing best on the validation set (henceforth, GDSE-CL or simply CL and with Decision making GDSE-CL- DM or CL-DM).

3.4.3 Reinforcement learning

Strub et al.(2017a) proposed the first extension of BL de Vries et al.(2017) with deep reinforcement learning (RL). They present an architecture for end-to- end training using an RL policy. First, the Oracle, Guesser, and QGen models

are trained independently using supervised learning. Then, QGen is further trained using a policy gradient.

We use the publicly available code and pre-trained model based on Sam- pling Strub et al. (2017a), which resulted in the closest performance to what was reported. by the authors.2 This is the RL model we use throughout the rest of the chapter.