Video-grounded dialogue generation (VDG) requires the system to generate a
fluent and accurate answer based on multimodal knowledge. However, the
difficulty in multimodal knowledge utilization brings serious hallucinations to
VDG models in practice. Although previous works mitigate the hallucination in a
variety of ways, they hardly take notice of the importance of the multimodal
knowledge anchor answer tokens. In this paper, we reveal via perplexity that
different VDG models experience varying hallucinations and exhibit diverse
anchor tokens. Based on this observation, we propose M2K-VDG, a model-adaptive
multimodal knowledge anchor enhancement framework for hallucination reduction.
Furthermore, we introduce the counterfactual effect for more accurate anchor
token detection. The experimental results on three popular benchmarks exhibit
the superiority of our approach over state-of-the-art methods, demonstrating
its effectiveness in reducing hallucinations.

通过计算困惑度，我们揭示了不同的视频对话生成（VDG）模型经历了不同的幻觉，并展示了多样的锚点标记。基于这一观察，我们提出了一种模型适应性的多模态知识锚点增强框架 M2K-VDG，用于减少幻觉产生。进一步，我们引入了反事实效应以更准确地检测锚点标记。在三个流行的基准测试上的实验结果显示，我们的方法优于现有方法，证明了它在减少幻觉方面的有效性。

M2K-VDG: 模型自适应多模态知识锚定增强的视频驱动对话生成

M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced  Video-grounded Dialogue Generation

We study video-grounded dialogue generation, where a response is generated
based on the dialogue context and the associated video. The primary challenges
of this task lie in (1) the difficulty of integrating video data into
pre-trained language models (PLMs) which presents obstacles to exploiting the
power of large-scale pre-training; and (2) the necessity of taking into account
the complementarity of various modalities throughout the reasoning process.
Although having made remarkable progress in video-grounded dialogue generation,
existing methods still fall short when it comes to integrating with PLMs in a
way that allows information from different modalities to complement each other.
To alleviate these issues, we first propose extracting pertinent information
from videos and turning it into reasoning paths that are acceptable to PLMs.
Additionally, we propose a multi-agent reinforcement learning method to
collaboratively perform reasoning on different modalities (i.e., video and
dialogue context). Empirical experiment results on two public datasets indicate
that the proposed model can significantly outperform state-of-the-art models by
large margins on both automatic and human evaluations.

本文研究了基于视频对话生成，提出一种方法，可以将视频数据集成到预训练语言模型中，通过多模态推理实现各种模态之间的互补信息，实验结果表明，该模型能够在自动和人工评估方面显著优于现有的最先进模型。