Visual Dialog (VD) is a task where an agent answers a series of image-related questions based on a multi-round dialog history. However, previous VD methods often treat the entire dialog history as a simple text input, disregarding the inherent conversational information flows at the round level. In this paper, we introduce Multi-round Dialogue State Tracking model (MDST), a framework that addresses this limitation by leveraging the dialogue state learned from dialog history to answer questions. MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations. These representations effectively ground the current question, enabling the generation of accurate answers. Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting. Furthermore, through a series of human studies, we validate the effectiveness of MDST in generating long, consistent, and human-like answers while consistently answering a series of questions correctly.

本文针对视觉对话任务中的对话历史信息流被忽视的问题，提出了多轮对话状态跟踪模型（MDST），通过利用对话历史学习到的状态来回答问题。实验结果表明，MDST在生成设置下的表现达到了新的最优水平，并且通过人类研究验证了其在生成长且一致的人类般答案方面的有效性。

通过多轮对话中的迭代对象-实体对齐增强视觉对话状态跟踪