In this work we proposed a composition of approaches for advancing visual questioning agents. We looked at the specific cases of visual question generation and visual dialog.
On the visual question generation side, we combined the generative strength of variational au- toencoder with LSTM language representations. In the future we plan to use more structured reasoning for this task and also look into convolutional methods [65,66,67,68,69,70,71,72].
On the visual dialog side, we introduced a reformulation of the visual dialog dataset for a more effective evaluation dialog agents. We introduces a simple baseline which improved over existing complex models. We also demonstrated how questioning and answering models can communicate to create dialog sequences. Going forward we plan to combine visual dialog and textual ground- ing [73,74,75,76].
These works have helped later works to understand language and vision tasks better [23,77,37,
78]. Wang et al. [23] improved our VAE-LSTM model but introducing Gaussian Mixture model (GMM) and Additive Gaussian (AG) priors to the latent space. Li et al. [77] establish VQG as a dual task of VQA question generation, and utilize it to boost VQA performance.
We see visual questioning playing a key role in building AI agents that communicate. Recent advances towards combining language-vision with navigation & robotics [79, 80, 81] stand to benefit from better conversational abilities. Das et al. [79] and Gordon et al. [80] introduce tasks where agents have to travel in an unseen room to answer a given question. Combined with ques- tioning, interactive agents which seamlessly participate in a dialog are a plausible next step for the community.
REFERENCES
[1] U. Jain∗, Z. Zhang∗, and A. G. Schwing, “Creativity: Generating Diverse Questions using Variational Autoencoders,” in CVPR, 2017,∗ equal contribution. iv,4
[2] U. Jain, S. Lazebnik, and A. G. Schwing, “Two can play this Game: Visual Dialog with Discriminative Question Generation and Answering,” in CVPR, 2018. iv,4
[3] Y. Bengio, E. Thibodeau-Laufer, G. Alain, and J. Yosinski, “Deep Generative Stochastic Networks trainable by Backprop,” in JMLR, 2014. 1
[4] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,” Nature, 2015. 1
[5] A. Krizhevsky, I. Sutskever, , and G. E. Hinton, “Imagenet classification with deep convolu- tional neural networks,” in NIPS, 2012. 1
[6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NIPS, 2014. 1
[7] G. E. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-R. Mohamed, N. Jaitly, A. Senior, V. Van- houcke, P. Nguyen, T. N. Sainath, and B. Kingsbury, “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups,” IEEE Signal Processing Magazine, 2012. 1
[8] N. Mostafazadeh, I. Misra, J. Devlin, M. Mitchell, X. He, and L. Vanderwende, “Generating natural questions about an image,” in ACL, 2016. 1,6,7,14,15,16,17,18
[9] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,” in ICLR, 2014. 1,9,10 [10] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, “Generative Adversarial Networks,” in NIPS, 2014. 1,9
[11] A. Das, S. Kottur, K. Gupta, A. Singh, D. Yadav, J. M. Moura, D. Parikh, and D. Batra, “Visual Dialog,” in CVPR, 2017. 2,3,6,7,21,23,24,25,26,27,28,29,31
[12] J. Lu, A. Kannan, , J. Yang, D. Parikh, and D. Batra, “Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model,” NIPS, 2017.3, 7,24,25,26,29,31
[13] G. Kulkarni, V. Premraj, V. Ordonez, S. Dhar, S. Li, Y. Choi, A. C. Berg, and T. L. Berg, “Babytalk: Understanding and generating simple image descriptions,” 2011. 5
[14] S. Li, G. Kulkarni, T. L. Berg, A. C. Berg, and Y. Choi, “Composing simple image descrip- tions using web-scale n-grams,” in CoNLL, 2011. 5
[15] M. Mitchell, X. Han, J. Dodge, A. Mensch, A. Goyal, A. Berg, K. Yamaguchi, T. Berg, K. Stratos, and H. Daum´e III, “Midge: Generating image descriptions from computer vision detections,” EACL, 2012. 5
[16] P. Kuznetsova, V. Ordonez, A. C. Berg, T. L. Berg, and Y. Choi, “Collective generation of natural image descriptions,” in ACL, 2012. 5
[17] P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi, “Treetalk: Composition and compression of trees for image descriptions,” 2014. 5
[18] T. Mikolov, M. Karafi´at, L. Burget, J. ˇCernock`y, and S. Khudanpur, “Recurrent neural net- work based language model,” in INTERSPEECH, 2010. 5
[19] J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, “Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN),” CoRR, vol. abs/1412.6632, 2014. 5
[20] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in CVPR, 2015. 5,6,16
[21] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in ICML, 2015. 5
[22] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descrip- tions,” in CVPR, 2015. 5
[23] L. Wang, A. G. Schwing, and S. Lazebnik, “Diverse and Accurate Image Description Using a Variational Auto-Encoder with an Additive Gaussian Encoding Space,” in Proc. NIPS, 2017. 5,33
[24] M. Malinowski and M. Fritz, “A Multi-World Approach to Question Answering about Real- World Scenes based on Uncertain Input,” in NIPS, 2014. 6
[25] M. Ren, R. Kiros, and R. Zemel, “Exploring models and data for image question answering,” in NIPS, 2015. 6
[26] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “VQA: Visual question answering,” in ICCV, 2015. 6,14,28
[27] H. Gao, J. Mao, J. Zhou, Z. Huang, L. Wang, and W. Xu, “Are you talking to a machine? Dataset and Methods for Multilingual Image Question Answering,” in NIPS, 2015. 6
[28] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei, “Visual7W: Grounded Question Answering in Images,” in CVPR, 2016. 6
[29] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick, “Clevr: A diagnostic dataset for compositional language and elementary visual reasoning,” in CVPR, 2017. 6
[30] J. Lu, J. Yang, D. Batra, and D. Parikh, “Hierarchical question-image co-attention for visual question answering,” in NIPS, 2016. 6
[31] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, 2016. 6
[32] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Deep compositional question answering with neural module networks,” in CVPR, 2016. 6
[33] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra, “Human attention in visual question answering: Do humans and deep networks look at the same regions?” in EMNLP, 2016. 6
[34] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” in EMNLP, 2016. 6
[35] K. J. Shih, S. Singh, and D. Hoiem, “Where to look: Focus regions for visual question answering,” in CVPR, 2016. 6
[36] H. Xu and K. Saenko, “Ask, attend and answer: Exploring question-guided spatial attention for visual question answering,” in ECCV, 2016. 6
[37] I. Schwartz, A. G. Schwing, and T. Hazan, “High-Order Attention Models for Visual Ques- tion Answering,” in Proc. NIPS, 2017. 6,33
[38] H. Ben-younes, R. Cadene, M. Cord, and N. Thome, “Mutan: Multimodal tucker fusion for visual question answering,” in ICCV, 2017. 6
[39] M. Malinowski, M. Rohrbach, and M. Fritz, “Ask your neurons: A neural-based approach to answering questions about images,” in ICCV, 2015. 6
[40] L. Ma, Z. Lu, and H. Li, “Learning to answer questions from image using convolutional neural network,” in AAAI, 2016. 6
[41] C. Xiong, S. Merity, and R. Socher, “Dynamic memory networks for visual and textual ques- tion answering,” in ICML, 2016. 6
[42] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering,” in CVPR, 2017. 6 [43] A. K. Vijayakumar, M. Cogswell, R. R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Ba- tra, “Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models,” in AAAI, 2018. 6,7,16
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, 1997. 6,11
[45] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” 2014. 6
[46] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in https://arxiv.org/abs/1611.01144, 2016. 7
[47] K. Gimpel, D. Batra, G. Shakhnarovich, and C. Dyer, “A Systematic Exploration of Diversity in Machine Translation,” in EMNLP, 2013. 7
[48] D. Batra, P. Yadollahpour, A. Guzman-Rivera, and G. Shakhnarovich, “Diverse M-Best So- lutions in Markov Random Fields,” in ECCV, 2012. 7
[49] A. Jabri, A. Joulin, and L. van der Maaten, “Revisiting visual question answering baselines,” in ECCV, 2016. 7
[50] A. Das, S. Kottur, J. M. Moura, S. Lee, and D. Batra, “Learning cooperative visual dialog agents with deep reinforcement learning,” in ICCV, 2017. 7
[51] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved Techniques for Training GANs,” in NIPS, 2016. 9
[52] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep con- volutional generative adversarial networks,” in ICLR, 2016. 9
[53] Y. Burda, R. Grosse, and R. R. Salakhutdinov, “Importance Weighted Autoencoders,” in ICLR, 2016. 9
[54] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015. 11,14,24
[55] L. Wang, Y. Li, and S. Lazebnik, “Learning Deep Structure-Preserving Image-Text Embed- dings,” in CVPR, 2016. 11
[56] Y. Gong, L. Wang, M. Hodosh, J. Hockenmaier, and S. Lazebnik, “Improving Image- Sentence Embeddings Using Large Weakly Annotated Photo Collections,” in ECCV, 2014. 11
[57] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014. 14,28
[58] R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image descrip- tion evaluation,” in CVPR, 2015. 16
[59] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reduc- ing internal covariate shift,” in ICML, 2015. 24
[60] D. Kingma and J. Ba, “A method for stochastic optimization,” in CVPR, 2017. 25
[61] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in ICCV, 2015. 25
[62] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in AISTATS, 2010. 25
[63] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representa- tion,” in EMNLP, 2014. 26
[64] C.-W. Liu, R. Lowe, I. V. Serban, M. Noseworthy, L. Charlin, and J. Pineau, “How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation,” in EMNLP, 2016. 28
[65] L.-C. Chen∗, A. G. Schwing∗, A. L. Yuille, and R. Urtasun, “Learning Deep Structured Mod- els,” in ICML, 2015,∗ equal contribution. 33
[66] A. G. Schwing and R. Urtasun, “Fully Connected Deep Structured Networks,” in https://arxiv.org/abs/1503.02351, 2015. 33
[67] B. London∗ and A. G. Schwing∗, “Generative Adversarial Structured Networks,” in NIPS Workshop on Adversarial Training, 2016,∗ equal contribution. 33
[68] O. Meshi, M. Mahdavi, and A. G. Schwing, “Smooth and Strong: MAP Inference with Linear Convergence,” in NIPS, 2015. 33
[69] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, “Globally Convergent Parallel MAP LP Relaxation Solver using the Frank-Wolfe Algorithm,” in ICML, 2014. 33
[70] A. G. Schwing, T. Hazan, M. Pollefeys, and R. Urtasun, “Globally Convergent Dual MAP LP Relaxation Solvers using Fenchel-Young Margins,” in NIPS, 2012. 33
[71] A. Deshpande, J. Aneja, L. Wang, A. Schwing, and D. Forsyth, “Diverse and controllable image captioning with part-of-speech guidance,” https://arxiv.org/abs/1805.12589, 2018. 33 [72] J. Aneja, A. Deshpande, and A. Schwing, “Convolutional image captioning,” CVPR, 2018.
33
[73] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazeb- nik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to- sentence models,” in Proc. ICCV, 2015. 33
[74] B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazeb- nik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to- sentence models,” IJCV, 2017. 33
[75] R. A. Yeh, J. Xiong, W.-M. Hwu, M. Do, and A. G. Schwing, “Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts,” in Proc. NIPS, 2017. 33 [76] R. A. Yeh, M. Do, and A. G. Schwing, “Unsupervised Textual Grounding: Linking Words to
Image Concepts,” in Proc. CVPR, 2018. 33
[77] Y. Li, N. Duan, B. Zhou, X. Chu, W. Ouyang, and X. Wang, “Visual Question Generation as Dual Task of Visual Question Answering,” in https://arxiv.org/abs/1709.07192, 2017. 33 [78] D. Massiceti, N. Siddharth, P. K. Dokania, and P. H. Torr, “Flipdial: A generative model for
two-way visual dialogue,” 2018. 33
[79] A. Das, S. Datta, G. Gkioxari, S. Lee, D. Parikh, and D. Batra, “Embodied Question Answer- ing,” in CVPR, 2018. 33
[80] D. Gordon, A. Kembhavi, M. Rastegari, J. Redmon, D. Fox, and A. Farhadi, “Iqa: Visual question answering in interactive environments,” in CVPR, 2018. 33
[81] Y. Bisk, K. J. Shih, Y. Choi, and D. Marcu, “Learning interpretable spatial operations in a rich 3d blocks world,” https://arxiv.org/abs/1712.03463, 2017. 33