Visual dialogue is a challenging task since it needs to answer a series of
coherent questions on the basis of understanding the visual environment.
Previous studies focus on the implicit exploration of multimodal co-reference
by implicitly attending to spatial image features or object-level image
features but neglect the importance of locating the objects explicitly in the
visual content, which is associated with entities in the textual content.
Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf
T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of
two key parts: visual grounding and multimodal incremental transformer. Visual
grounding aims to explicitly locate related objects in the image guided by
textual entities, which helps the model exclude the visual content that does
not need attention. On the basis of visual grounding, the multimodal
incremental transformer encodes the multi-turn dialogue history combined with
visual scene step by step according to the order of the dialogue and then
generates a contextually and visually coherent response. Experimental results
on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the
proposed model, which achieves comparable performance.

该研究提出了一种多模态增量变形器（MITVG）的视觉指向方法，该方法可以显式地定位与文本实体相关的图像对象，从而帮助模型排除不需要关注的视觉内容，进而在多轮对话历史记录和视觉场景的基础上生成一致且连贯的响应。该模型在 VisDial v0.9 和 v1.0 数据集上实验结果证明了其优越性能。