[PDF] Top 20 Dynamic Capsule Attention for Visual Question Answering

Dynamic Capsule Attention for Visual Question Answering

... “FRCNN”. When using VGG-16, the dimensions of LSTM and CapsAtt is set to 512. For ResNet and FRCNN features, we set the dimensions of LSTM and CapsAtt to both 1,024. During training, the learning rate is 4e-4 for the ... See full document

8

Analyzing the Behavior of Visual Question Answering Models

... pay attention to specific spatial regions in an image) produce the same response for at least half the images for fewer questions (42% for the ATT model, 40% for the MCB ... See full document

6

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

... image-based question-answering, existing question- answering methods are mainly focusing on using a LSTM network to encode question sequence and fusing question representation ... See full document

8

Stacking with Auxiliary Features for Visual Question Answering

... Most VQA systems have a single underlying method that optimizes a specific loss function and do not leverage the advantage of using multiple di- verse models. One recent ensembling approach to VQA (Fukui et al., 2016) ... See full document

10

Generating Question Relevant Captions to Aid Visual Question Answering

... 3.5 Training and Implementation Details We train our joint model using the AdaMax op- timizer (Kingma and Ba, 2015) with a batch size of 384 and a learning rate of 0.002 as suggested by Teney et al. (2017). We use the ... See full document

10

Faithful Multimodal Explanation for Visual Question Answering

... both visual and multimodal ...of attention values and source identifiers’ values in Eq, 2 over time (t) and assign the accumulated attention weight to each corresponding segmentation ...normalize ... See full document

10

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding

... the visual grounding task. For open-ended question answering, we present an architecture for VQA which uses MCB twice, once to predict spatial attention and the second time to predict the ... See full document

12

Human Attention in Visual Question Answering: Do Humans and Deep Networks look at the same regions?

... state-of-the-art attention-based VQA models (Yang et ...machine-generated attention maps from the most accurate VQA model have a mean rank-correlation of ...VQA attention models do not seem to be ... See full document

6

Visual TTR Modelling Visual Question Answering in Type Theory with Records

... In this paper, I pay particular attention to the formal core of this system. A necessary aspect of such a model that I have glossed over is the parser. There is no off-the-shelf English to TTR parser, so the model ... See full document

6

Structured Two-Stream Attention Network for Video Question Answering

... The attention mechanism has also been widely used in video ...semantic attention mechanism, which detects concepts from the video first and then fuses them with text encoding/decoding to infer an ...focus ... See full document

8

BLOCK: Bilinear Superdiagonal Fusion for Visual Question Answering and Visual Relationship Detection

... the question, words are preprocessed and then fed into a pretrained Skip-thought encoder (Kiros et ...whole question, as in (Yu et ...The question vector is used as a context to guide the ... See full document

8

Improving Visual Question Answering by Referring to Generated Paragraph Captions

... as visual question ...with visual information present in the image because it can discuss both more ab- stract concepts and more explicit, intermediate symbolic information about objects, events, and ... See full document

7

Cross-Modal Multistep Fusion Network with Co-Attention for Visual Question Answering

... Our overall framework is illustrated in Figure 1. It can be divided into three parts: the first part (in blue color) is the Multimodal Representation. To increase the consistency of the image and text, we propose a novel ... See full document

9

Segmentation Guided Attention Networks for Visual Question Answering

... of Visual Question Answering by using a novel segmentation guided attention based network which we call SegAttend- ...our attention maps and use these refined attention maps to ... See full document

6

Multi grained Attention with Object level Grounding for Visual Question Answering

... sual Question Answering (VQA) to search for visual clues related to the ...train attention models from a coarse- grained association between sentences and images, which tends to fail on ... See full document

6

Data Augmentation for Visual Question Answering

... Many algorithms have been proposed for VQA. Some notable formulations include attention based methods (Yang et al., 2016; Xiong et al., 2016; Lu et al., 2016; Fukui et al., 2016), Bayesian frame- works (Kafle and ... See full document

5

Differential Networks for Visual Question Answering

... categories: attention-based models (Yang et ...them, attention-based models, which select the useful part of input information with attention mecha- nisms, achieve significant performances on the VQA ... See full document

8

Aspect Sentiment Classification Towards Question Answering with Reinforced Bidirectional Attention Network

... tional attention network approach to tackle the above two ...Bidirectional Attention Network (RBAN) approach to ASC-QA, which employs two funda- mental RAWS modules to perform word selection over the ... See full document

10

Arabic Question Answering: A Study on Challenges, Systems, and Techniques

... the question answering is and how to build it in Arabic, there was a need to include only the English and Arabic papers that explained the question answering processes as well as sub ... See full document

9

Adversarial Regularization for Visual Question Answering: Strengths, Shortcomings, and Side Effects

... Visual question answering (VQA) models have been shown to over-rely on linguis- tic biases in VQA datasets, answering questions “blindly” without considering visual ...on ... See full document

13