Visual dialog is a challenging vision-language task in which a series of questions visually grounded by a given image are answered. To resolve the visual dialog task, a high-level understanding of various multimodal inputs (e.g., question, dialog history, image, and answer) is required. Specifically, it is necessary for an agent to 1) understand question-relevant dialog history and 2) focus on question-relevant visual contents among the diverse visual contents in a given image. In this paper, we propose Multi-View Attention Network (MVAN), which considers complementary views of multimodal inputs based on attention mechanisms. MVAN effectively captures the question-relevant information from the dialog history with two different textual-views (i.e., Topic Aggregation and Context Matching), and integrates multimodal representations with two-step fusion process. Experimental results on VisDial v1.0 and v0.9 benchmarks show the effectiveness of our proposed model, which outperforms the previous state-of-the-art methods with respect to all evaluation metrics.

论文旨在通过提出 Multi-View Attention Network (MVAN) 模型来解决视觉对话任务中的挑战性问题，该模型基于注意机制，利用多个视角来处理异构输入，并且通过序列对齐过程构建多模态表示，从而可以更好地捕捉到对话历史中与问题相关的信息，并在 VisDial v1.0 数据集上达到了最佳结果。

基于多视角注意力网络的视觉对话