6. Conclusion and future work 26
6.2. Future work
This work presented four different multimodal model architectures, including a combined approach.
More architectures could be developed and examined based on mutlimodal encoder-decoder models coming from MNMT. Putting focus on the attention mechanism like Calixto et al. (2017) did by attend-ing to text and image separately is a promisattend-ing approach, especially when the image dataset size and diversity increases. Considering fine-tuning of the pretrained CNN, i.e. not fixing all of its weights after a few training epochs, could enhance object recognition and therefore circumvent the image processing bottleneck spotted.
While this thesis showed that multiple modalities can be used to improve performance on a goal-oriented dialog task, future work exploring multimodal dialog systems should aim at widening the set-ting in comparison to the closed-world kitchen robot example used here. As a first step, the kitchen robot could be upgraded to a more general-purpose robot with more diversity in command structure and action
Section 6.2: Future work Page27
types. More ambiguous situations should be added, going beyond color and size towards more com-plex or indirect ambiguities. Furthermore, expanding the architecture to support parallel video and text streams would lead to a much bigger scale of real-world applications. For instance, a goal-oriented per-sonal assistant dialog system could be trained with a visual sense to react to spoken phrases considering the emotional state of the user, detected using CNNs like the one of Jain et al. (2018).
To conclude, despite currently being in a very early stage, combining visual and textual features into a multimodal dialog system provides tremendous opportunities to improve user experience for goal-oriented tasks and build up the next generation of human computer interaction.
Appendices
A. Implementation
A.1. Image acquisition
Scene To collect images of possible dialog scenes in a structured way, first, the concept of a “scene”
had to be defined. In the ARMAR-III example used throughout this thesis, a scene is a set of objects and drinks placed on the kitchen table in front of the robot. Objects are composed of a name (e.g. “cup”), a color (“red”) and a size (“big”). The set of object properties is designed to be easily extensible to allow posterior addition of other attributes such as material. Additionally, each object has a flag stating if it can be used to pour drinks in. The set of drinks is simply a set of strings representing their names (e.g. “water”). A simple scriptread_available_objs.pywas implemented to parse a text file of a predefined syntax into aSceneobject containing all available objects.
Image plans Since manually labeling all the captured images in hindsight would require a tremendous amount of time, plans for taking the images were created in advance. An image acquisition plan is simply a list of scenes to capture. To ease the process of capturing the images, an easy-to-read text file format was defined for representing image plans. Thereby, each line represents one scene to capture. To avoid listing all objects of a scene each time, only the difference to the previous scene is written down. Listing A.1 shows an example of this.
Listing A.1: Example of an image plan file. The+and-direct to add or remove the given item, while (d)indicates the given object can be used to pour drinks in.
Before each image acquisition session, plans were generated by a custom script. Initially, this script generates a list of interesting scenes based on object count, combination of vessels and drinks and ambi-guity situations (e.g. same object of different colors). See section 5.1 for more details on “interesting”
scenes. Afterwards, a greedy algorithm is used to sort the scenes so that the number of objects to change between each scene is minimized1. Eventually, an image diff-plan as explained above is created based on the sorted list of scenes.
During image sessions, these image plans were mostly obeyed. Nevertheless, since the diff-plan for-mat allows for easy manipulations, this opportunity was sometimes used to spontaneously take some challenging images the generated plan did not mention (e.g. with three drinks at once).
A.2. Dialog generation
Grammar To generate sufficiently complex dialogs, a small formal grammar framework was written.
Its base is a context-free grammar library consisting of the base type Grammar and its subtypes Terminal,Or andConcat. The framework heavily uses operator overloading to improve grammar
1This algorithm does not find optimal solutions, however this was not required either.
Section A.3: Multimodal models Page31
definition readability, i.e. instead ofConcat(expr1, expr2)orOr(expr1, expr2)one can simply writeexpr1 + expr2orexpr1 | expr2, respectively. Generation of all words defined by a context-free grammar is a simple task solved recursively.
Since a context-free grammar is not sufficient for generating multi-step dialogs, the framework was extended to allow back-references and context-dependent term choice. Therefore, theGrammar-subtype Variablewith its specializationsIndependentVariableandBoundVariablewere introduced. An IndependentVariablebehaves similar to anOr, except that each occurrence of the variable yields the same string, whereas anOryields all permutations of its options if occurring multiple times. Bound variables can change their set of possible options depending on other variables. With this extension, the following steps are applied to generate all words of the grammar:
1. Sort the occurring variables topologically according to their dependencies.
2. For each variable, choose the next of its possible options (in nested loops, so that all permutations are hit once). Note that the set of possible options may change for bound variables depending on the value chosen for their dependencies, hence the topological sorting.
3. With all variables fixed to one value, the resulting grammar behaves like a context-free one. All words in this grammar are generated recursively.
The following example shows a simple usage of the described grammar framework:
T, V = Terminal, Variable
places = T("table") | "ground" | "fridge"
objects = T("cup") | "mug" | "plate"
place = V("place", places) # First argument is the variable name obj = V("object", objects)
move_cmd = "Please move the " + obj + " to the " + place
move_answer = "Ok, I am going to move the " + obj + " to the " + place move_dialog = move_cmd + "\n" + move_answer
for w in generate_all_word(move_dialog):
print(w)
Listing A.2: Example of grammar framework usage. Note that the occurrences of the variables will yield the same values, respectively. This example will generate dialogs of the form “Please move the cup to the table \n Ok, I am going to move the cup to the table” with all nine possible combinations of places and objects.
A.3. Multimodal models
The multimodal encoder-decoder models were implemented as an extension of the OpenNMT frame-work by Klein et al. (2017). Initially, thepreprocess.pyscript of OpenNMT was modified to allow two simultaneous input modalities. For space and computation efficiency as well as improved flexibil-ity, the component responsible for converting images to tensors was made configurable to enable the definition of a custom image to tensor callback function. This mechanism was used to place the image features, detected classes or ground-truth classes in the dataset when implementing the different ap-proaches. Therefore, the dataset files yielded by the preprocess step contained triples of source sentence, target sentence and second modality tensor.
The next step was to modifytrain.py to support multimodal models. This was done by addi-tionally passing the second modality tensor to the model if present in the dataset. An abstract class
MultiModalModel was defined with an interface equivalent to NMTModel. The different model ar-chitectures described in section 4.3 were implemented as subclasses of MultiModalModel, with the FVTL+GMclass being a subclass of both FVTLand GM. Finally,translate.pywas adjusted to sup-port multimodal models and datasets.
All changes made to OpenNMT were implemented in a generic way, neither specific to goal-oriented dialogs nor to images as second modality. In fact, the image to tensor callback stated above could be used to convert any file type to tensors, enabling the use of any other second modality like audio. Furthermore, MultiModalModel and its subclasses can be used for any sequence to sequence task, especially the initial purpose of OpenNMT, which is machine translation. To prove this generality, the framework as used in this thesis was trained on the multimodal machine translation dataset Multi30K by Elliott et al.
(2016) and tested on the ambiguous COCO test dataset created for WMT17 (Elliott et al., 2017), reaching a BLUE score of 20.8 in comparison to the WMT provided baseline of 18.7.
All changes made to the OpenNMT framework will be open-sourced to make a contribution to re-search on multimodal sequence-to-sequence tasks. The following command line options were added to preprocess.pyandtranslate.py:
Image loading
-img_to_tensor_fn A custom image file to tensor function which will be used to load images.
Its type must be (str -> Tensor). The function should be given by its full name ("module.submodule.function"). Make sure the corresponding module is inPYTHONPATH! If this is not given, a default implementation (PIL + torchvision) will be used to read image files to tensors. (default:
None) Multimodal data
-second_data_type Type of the second source input. If this is not present, multimodal input is not used. If you supply this, you also have to supplysecond_train_src, second_valid_srcandsrc_dir(for the images). Options are [img].
(default: None)
-second_train_src Path to the second modality training source data (listing file names inside ofsrc_dir)
-second_valid_src Path to the second modality validation source data (listing file names inside ofsrc_dir)
The following options were added totrain.py: Multimodal model
-multimodal_type The multimodal model implementation to use. If not specified, no multimodal model is used at all. Options:
• hsm = hidden state merge
• fvtl = first view, then listen
• gm = generator merge
• fvtl+gm = first view, then listen, finally view -second_dim_in The input dimension of the second encoder.
-second_dim The output dimension of the second encoder.
Page33
B. Dataset
B.1. Examples
The following section presents four example images with corresponding exemplary dialogs from train and test datasets. At the bottom of each scene, the total number of dialogs without and with all possible word and sentence variants is shown. User utterances are shown in italic, robot utterances in normal letters.
Ok I am going to give you the plate Please give me that thing over there What type of object are you talking about The plate please
API_CALL grab plate
Ok I am going to give you the plate Please move the plate to the table API_CALL move plate table
Ok I am going to move the plate to the table Total dialog count (simple / full): 5 / 1736
Dataset Test
Camera Canon EOS 7D
Scene orange small cup, orange big cup, milk
Please give me the cup
Do you want to have the big or the small cup The small one please
API_CALL grab cup small
Ok I am going to give you the small cup Please give me that thing over there What type of object are you talking about The cup please
Do you want to have the big or the small cup The small one please
API_CALL grab cup small
Ok I am going to give you the small cup Please give me that thing over there What type of object are you talking about The small cup please
API_CALL grab cup small
Ok I am going to give you the small cup Total dialog count (simple / full): 10 / 4752
Dataset Train
Camera ARMAR-III left eye Scene orange big cup, red big mug,
green big mug, yellow big plate
Please give me the cup API_CALL grab cup
Ok I am going to give you the cup Please move the plate next to the cup API_CALL move_next_to plate cup
Ok I am going to move the plate next to the cup Please move the mug next to the plate
Do you want me to move the green or the red mug The red one please
API_CALL move_next_to mug red plate
Ok I am going to move the red mug next to the plate Please move the cup next to the mug
Do you want the cup next to the the green or the red mug
The green one please
API_CALL move_next_to cup mug green
Ok I am going to move the cup next to the green mug
Total dialog count (simple / full): 28 / 8544
Dataset Test Camera iPhone 6
Scene orange small cup, red big mug, yellow big mug, milk, beer
Please give me something to drink Do you want to drink beer or milk I want the beer please
Do you want the beer served in the cup or mug The cup please
API_CALL pour beer cup
Ok I am going to pour the beer into the cup Please give me something to drink
Do you want to drink beer or milk I want the beer please
Do you want the beer served in the cup or mug The mug please
Do you want me to use the red or the yellow mug The red one please
API_CALL pour beer mug red
Ok I am going to pour the beer into the red mug Total dialog count (simple / full): 23 / 966408
Section B.2: Statistics Page35
B.2. Statistics
Supplementary figures for dataset analysis. See section 5.1.
C. Results
C.1. Train curves
The following graphs show the course of validation accuracy and perplexity during training of the multi-modal model architectures using the different approaches. Each model was trained five times, the graphs show an arbitrary one of the train courses. For test evaluation, the result of the epoch with the highest validation accuracy was chosen.
Section C.2: Evaluation details Page37
Table C.1 shows the detailed evaluation results for the multimodal model experiments. All nets were trained using the following parameters:
CNN (does not apply to the raw model) se_resnext50_32x4d Second modality dimension (S from section 4.3) 256
38ChapterC
Architecture Approach l % d % dexcl % dunknown % a q w
Total 14264 6536 5420 630
Raw 8880 62.3 % 2308 35.3 % 1859 34.3 % 18 2.9 % 1440 792 2232
HSM features 9006 63.1 % 2365 36.2 % 2061 38.0 % 18 2.9 % 826 874 1700
probabilities 8862 62.1 % 2381 36.4 % 2039 37.6 % 18 2.9 % 612 990 1602
classes 8880 62.3 % 2291 35.1 % 2039 37.6 % 18 2.9 % 612 990 1602
ground-truth 8718 61.1 % 2147 32.8 % 1895 35.0 % 18 2.9 % 612 1062 1674
FVTL features 9210 64.6 % 2461 37.7 % 2058 38.0 % 18 2.9 % 1162 678 1840
probabilities 8826 61.9 % 2503 38.3 % 2223 41.0 % 18 2.9 % 640 1115 1755
classes 9145 64.1 % 2480 37.9 % 2226 41.1 % 18 2.9 % 636 995 1631
ground-truth 9126 64.0 % 2734 41.8 % 2357 43.5 % 18 2.9 % 648 1008 1656
GM features 9652 67.7 % 2824 43.2 % 2461 45.4 % 18 2.9 % 772 660 1432
probabilities 9443 66.2 % 2647 40.5 % 2297 42.4 % 18 2.9 % 790 827 1617
classes 9300 65.2 % 2651 40.6 % 2277 42.0 % 18 2.9 % 976 827 1803
ground-truth 10206 71.6 % 3242 49.6 % 2749 50.7 % 18 2.9 % 864 612 1476
FVTL+GM features 9437 66.2 % 2819 43.1 % 2412 44.5 % 18 2.9 % 1124 602 1726
probabilities 9344 65.5 % 2760 42.2 % 2343 43.2 % 18 2.9 % 1187 574 1761
classes 9204 64.5 % 2679 41.0 % 2251 41.5 % 19 3.0 % 1203 640 1843
ground-truth 11024 77.3 % 3854 59.0 % 3351 61.8 % 18 2.9 % 648 540 1188
Table C.1.: Overview of the multimodal model experiment results (on the test dataset). l: correct lines, d: correct dialogs, dexcl: correct dialogs excluding scenes with unknown objects, dunknown: correct dialogs about unknown objects, a: API-calls even though ambiguous, q: questions even though unambiguous, w: wrong number of ambiguity detections (= a + q). The best of each approach is highlighted.
Multimodalgoal-orienteddialogusingEncoder-Decoder-Networks
Bibliography Page39
Bibliography
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. (2015). VQA: visual question answering. CoRR, abs/1505.00468. 11
Asfour, T., Regenstein, K., Azad, P., Schroder, J., Bierbaum, A., Vahrenkamp, N., and Dillmann, R.
(2007). Armar-iii: An integrated humanoid platform for sensory-motor control. pages 169 – 175. 2 Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align
and translate. CoRR, abs/1409.0473. 9, 10
Bishop, C. M. (2006). Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer, Berlin, Heidelberg. 4
Bordes, A. and Weston, J. (2016). Learning end-to-end goal-oriented dialog. CoRR, abs/1605.07683. 5, 6, 2, 11
Britz, D., Goldie, A., Luong, M.-T., and Le, Q. (2017). Massive exploration of neural machine translation architectures. CoRR, abs/1703.03906. 10, 13
Caglayan, O., Barrault, L., and Bougares, F. (2016). Multimodal attention for neural machine translation.
CoRR, abs/1609.03976. 11, 12
Calixto, I., Liu, Q., and Campbell, N. (2017). Doubly-attentive decoder for multi-modal neural machine translation. CoRR, abs/1702.01287. 12, 26
Chung, J., Çaglar Gülçehre, Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555. 7, 8
Constantin, S., Niehues, J., and Waibel, A. (2018). An end-to-end goal-oriented dialog system with a generative natural language response generation. CoRR, abs/1803.02279. 5, 6, 2, 11
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09. 13
Elliott, D., Frank, S., Barrault, L., Bougares, F., and Specia, L. (2017). Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description. In Proceedings of the Second Conference on Machine Translation, Copenhagen, Denmark. Association for Computational Linguistics. 32
Elliott, D., Frank, S., Sima’an, K., and Specia, L. (2016). Multi30k: Multilingual english-german image descriptions. In Proceedings of the 5th Workshop on Vision and Language, pages 70–74. Association for Computational Linguistics. 32
Hastie, T., Friedman, J., and Tibshirani, R. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York, New York, NY, USA. 6
He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume abs/1512.03385.
IEEE. 11, 20
Hochreiter, S. (1991). Untersuchungen zu dynamischen neuronalen Netzen. diploma thesis. Master’s thesis, Institut für Informatik, Lehrstuhl Prof. Brauer, Technische Universität München. 7
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–
1780. 7
Hu, J., Shen, L., and Sun, G. (2018). Squeeze-and-excitation networks. 20
Huang, P.-Y., Liu, F., Shiang, S.-R., Oh, J., and Dyer, C. (2016). Attention-based multimodal neural machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers. Association for Computational Linguistics. 11, 12
Jain, N., Kumar, S., Kumar, A., Shamsolmoali, P., and Zareapoor, M. (2018). Hybrid deep neural networks for face emotion recognition. Pattern Recognition Letters. 27
Klein, G., Kim, Y., Deng, Y., Senellart, J., and Rush, A. M. (2017). OpenNMT: Open-source toolkit for neural machine translation. In Proc. ACL. 5, 6, 13, 31
Krizhevsky, A. (2014). One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997. 20
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. In Proceedings of the IEEE, volume 86, pages 2278–2324. Institute of Electrical and Electronics Engineers (IEEE). 11
Libovický, J., Helcl, J., Tlustý, M., Bojar, O., and Pecina, P. (2016). CUNI system for WMT16 automatic post-editing and multimodal translation tasks. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, volume abs/1606.07481. Association for Computational Linguistics. 12
Malinowski, M., Rohrbach, M., and Fritz, M. (2015). Ask your neurons: A neural-based approach to answering questions about images. CoRR, abs/1505.01121. 11
Mysid and Dave (2006). Neural network. https://commons.wikimedia.org/w/index.php?curid=1412126.
6
Olah, C. (2015). Understanding lstm networks. http://colah.github.io/posts/2015-08-Understanding-LSTMs. 7, 8
Olah, C. and Carter, S. (2016). Attention and augmented recurrent neural networks. Distill, http://distill.pub/2016/augmented-rnns. 9
Ren, S., He, K., Girshick, R., and Sun, J. (2017). Faster r-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149. 12
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. 10
Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recog-nition. CoRR, abs/1409.1556. 20
Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks.
CoRR, abs/1409.3215. 8
Szegedy, C., Ioffe, S., and Vanhoucke, V. (2016). Inception-v4, inception-resnet and the impact of residual connections on learning. CoRR, abs/1602.07261. 20
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2015). Rethinking the inception archi-tecture for computer vision. CoRR, abs/1512.00567. 20
Taspinar, A. (2016). The perceptron. http://ataspinar.com/2016/12/22/the-perceptron. 5
Bibliography Page41
Vinyals, O. and Le, Q. V. (2015). A neural conversational model. CoRR, abs/1506.05869. 11
Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., and Lang, K. (1990). Phoneme recognition using time-delay neural networks. In Waibel, A. and Lee, K.-F., editors, Readings in Speech Recognition, pages 393–404. Elsevier, San Francisco. 11
Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. (2017). Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
IEEE. 20
Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 3320–3328, Cambridge, MA, USA. MIT Press. 11, 13
Young, S., Gasic, M., Thomson, B., and Williams, J. D. (2013). POMDP-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179. 11