Video-grounded Dialogue (VGD) aims to answer questions regarding a given
multi-modal input comprising video, audio, and dialogue history. Although there
have been numerous efforts in developing VGD systems to improve the quality of
their responses, existing systems are competent only to incorporate the
information in the video and text and tend to struggle in extracting the
necessary information from the audio when generating appropriate responses to
the question. The VGD system seems to be deaf, and thus, we coin this symptom
of current systems' ignoring audio data as a deaf response. To overcome the
deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is
proposed to perform sensible listening by selectively attending to audio
whenever the question requires it. The HEAR framework enhances the accuracy and
audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD
datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various
VGD systems.

提出了 Hearing Enhanced Audio Response（HEAR）框架，用于解决视频对话系统（Video-grounded Dialogue）中的聋响应问题，通过选择性地关注音频来改善系统的听觉能力和准确性。

HEAR: 面向视频对话的听觉增强音频响应

HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question
regarding a given video and dialogue context. Despite the recent success of
multi-modal reasoning to generate answer sentences, existing dialogue systems
still suffer from a text hallucination problem, which denotes indiscriminate
text-copying from input texts without an understanding of the question. This is
due to learning spurious correlations from the fact that answer sentences in
the dataset usually include the words of input texts, thus the VGD system
excessively relies on copying words from input texts by hoping those words to
overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating
(THAM) framework, which incorporates Text Hallucination Regularization (THR)
loss derived from the proposed information-theoretic text hallucination
measurement approach. Applying THAM with current dialogue systems validates the
effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows
enhanced interpretability.

该研究设计了一种文本幻觉缓解框架（THAM），并通过当前的对话系统运用该框架验证了其在基准测试（即 AVSD @ DSTC7 和 AVSD @ DSTC8）上对 Video-grounded 对话的有效性和提高的解释性。

基于信息论的视频对话文本幻觉减少

Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Pre-trained language models have shown remarkable success in improving
various downstream NLP tasks due to their ability to capture dependencies in
textual data and generate natural responses. In this paper, we leverage the
power of pre-trained language models for improving video-grounded dialogue,
which is very challenging and involves complex features of different dynamics:
(1) Video features which can extend across both spatial and temporal
dimensions; and (2) Dialogue features which involve semantic dependencies over
multiple dialogue turns. We propose a framework by extending GPT-2 models to
tackle these challenges by formulating video-grounded dialogue tasks as a
sequence-to-sequence task, combining both visual and textual representation
into a structured sequence, and fine-tuning a large pre-trained GPT-2 network.
Our framework allows fine-tuning language models to capture dependencies across
multiple modalities over different levels of information: spatio-temporal level
in video and token-sentence level in dialogue context. We achieve promising
improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from
DSTC7, which supports a potential direction in this line of research.

本文提出了一种基于 GPT-2 模型的框架，将视频与文本表示结合成连续、有结构的序列，并利用其 fine-tuning 能力来解决视频对话中的挑战，从而在 Audio-Visual Scene-Aware Dialogues 基准测试中取得了显著的改进。