Pre-trained language models have shown remarkable success in improving
various downstream NLP tasks due to their ability to capture dependencies in
textual data and generate natural responses. In this paper, we leverage the
power of pre-trained language models for improving video-grounded dialogue,
which is very challenging and involves complex features of different dynamics:
(1) Video features which can extend across both spatial and temporal
dimensions; and (2) Dialogue features which involve semantic dependencies over
multiple dialogue turns. We propose a framework by extending GPT-2 models to
tackle these challenges by formulating video-grounded dialogue tasks as a
sequence-to-sequence task, combining both visual and textual representation
into a structured sequence, and fine-tuning a large pre-trained GPT-2 network.
Our framework allows fine-tuning language models to capture dependencies across
multiple modalities over different levels of information: spatio-temporal level
in video and token-sentence level in dialogue context. We achieve promising
improvement on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark from
DSTC7, which supports a potential direction in this line of research.

本文提出了一种基于 GPT-2 模型的框架，将视频与文本表示结合成连续、有结构的序列，并利用其 fine-tuning 能力来解决视频对话中的挑战，从而在 Audio-Visual Scene-Aware Dialogues 基准测试中取得了显著的改进。