There have been many attempts to build multimodal dialog systems that can
respond to a question about given audio-visual information, and the
representative task for such systems is the Audio Visual Scene-Aware Dialog
(AVSD). Most conventional AVSD models adopt the Convolutional Neural Network
(CNN)-based video feature extractor to understand visual information. While a
CNN tends to obtain both temporally and spatially local information, global
information is also crucial for boosting video understanding because AVSD
requires long-term temporal visual dependency and whole visual information. In
this study, we apply the Transformer-based video feature that can capture both
temporally and spatially global representations more efficiently than the
CNN-based feature. Our AVSD model with its Transformer-based feature attains
higher objective performance scores for answer generation. In addition, our
model achieves a subjective score close to that of human answers in DSTC10. We
observed that the Transformer-based visual feature is beneficial for the AVSD
task because our model tends to correctly answer the questions that need a
temporally and spatially broad range of visual information.

本研究探讨使用 Transformer-based 视频特征提取器在 Audio Visual Scene-Aware Dialog（AVSD）中解决长期时间视觉依赖和全局视觉信息的问题，并在答案生成方面取得了更高的目标性能评分。

基于 Transformer 的视频表示的视听场景感知对话生成

Audio Visual Scene-Aware Dialog Generation with Transformer-based Video Representations

Audio Visual Scene-aware Dialog (AVSD) is the task of generating a response
for a question with a given scene, video, audio, and the history of previous
turns in the dialog. Existing systems for this task employ the transformers or
recurrent neural network-based architecture with the encoder-decoder framework.
Even though these techniques show superior performance for this task, they have
significant limitations: the model easily overfits only to memorize the
grammatical patterns; the model follows the prior distribution of the
vocabularies in a dataset. To alleviate the problems, we propose a Multimodal
Semantic Transformer Network. It employs a transformer-based architecture with
an attention-based word embedding layer that generates words by querying word
embeddings. With this design, our model keeps considering the meaning of the
words at the generation stage. The empirical results demonstrate the
superiority of our proposed model that outperforms most of the previous works
for the AVSD task.

提出了一种多模态语义变形器网络，基于注意力词嵌入层的变形器架构和查询单词嵌入层生成单词。该模型在 AVSD 任务中取得了优异的表现。

DSTC8-AVSD：多模态语义 Transformer 网络及检索式词汇生成器

DSTC8-AVSD: Multimodal Semantic Transformer Network with Retrieval Style  Word Generator

Understanding dynamic scenes and dialogue contexts in order to converse with
users has been challenging for multimodal dialogue systems. The 8-th Dialog
System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog
(AVSD) task, which contains multiple modalities including audio, vision, and
language, to evaluate how dialogue systems understand different modalities and
response to users. In this paper, we proposed a multi-step joint-modality
attention network (JMAN) based on recurrent neural network (RNN) to reason on
videos. Our model performs a multi-step attention mechanism and jointly
considers both visual and textual representations in each reasoning process to
better integrate information from the two different modalities. Compared to the
baseline released by AVSD organizers, our model achieves a relative 12.1% and
22.4% improvement over the baseline on ROUGE-L score and CIDEr score.

本文提出了一种基于循环神经网络的多步关注机制的多模态联合注意网络（JMAN），用于对视频进行推理，该模型在每个推理过程中联合考虑了视觉和文本表示，以更好地集成两种不同模态的信息。与 AVSD 组织发布的基线相比，我们的模型在 ROUGE-L 得分和 CIDEr 得分上相对提高了 12.1％和 22.4％。