In previous work, we have proposed the Audio-Visual Scene-Aware Dialog (AVSD)
task, collected an AVSD dataset, developed AVSD technologies, and hosted an
AVSD challenge track at both the 7th and 8th Dialog System Technology
Challenges (DSTC7, DSTC8). In these challenges, the best-performing systems
relied heavily on human-generated descriptions of the video content, which were
available in the datasets but would be unavailable in real-world applications.
To promote further advancements for real-world applications, we proposed a
third AVSD challenge, at DSTC10, with two modifications: 1) the human-created
description is unavailable at inference time, and 2) systems must demonstrate
temporal reasoning by finding evidence from the video to support each answer.
This paper introduces the new task that includes temporal reasoning and our new
extension of the AVSD dataset for DSTC10, for which we collected
human-generated temporal reasoning data. We also introduce a baseline system
built using an AV-transformer, which we released along with the new dataset.
Finally, this paper introduces a new system that extends our baseline system
with attentional multimodal fusion, joint student-teacher learning (JSTL), and
model combination techniques, achieving state-of-the-art performances on the
AVSD datasets for DSTC7, DSTC8, and DSTC10. We also propose two temporal
reasoning methods for AVSD: one attention-based, and one based on a time-domain
region proposal network.

本文介绍了第三个 AVSD 挑战赛，其包括时间推理的任务和新的数据集，在这个数据集中，人类生成了时间推理数据。文章提出了基于 AV-transformer 的基线系统，并通过注意力多模态融合、联合师生学习和模型组合技术扩展了基线系统，提高了 AVSD 数据集的性能，同时提出了两种 AVSD 的时间推理方法：一种是基于注意力的，一种是基于时间域的区域建议网络。

使用视听 Transformer 进行场景感知对话和推理，并进行联合师生学习

Audio-Visual Scene-Aware Dialog and Reasoning using Audio-Visual  Transformers with Joint Student-Teacher Learning

Now that everyone can easily record videos, the quantity of which is
continuously increasing, research on methods for improved video retrieval is
important in the contemporary world. In cases where target videos are to be
identified within a large collection gathered by individuals, the appropriate
information must be obtained to retrieve the correct video within a large
number of similar items in the target database. The purpose of this research is
to retrieve target videos in such cases by introducing an interaction, or a
dialog, between the system and the user. We propose a system to retrieve videos
by asking questions about the content of the videos and leveraging the user's
responses to the questions. Additionally, we confirmed the usefulness of the
proposed system through experiments using the dataset called AVSD which
includes videos and dialogs about the videos.

本研究介绍了一种基于交互对话的视频检索系统，能够帮助用户在包含许多类似的视频中快速准确地找到目标视频，并通过 AVSD 数据集的实验证明了系统的有效性。

交互式视频检索与对话

Interactive Video Retrieval with Dialog

We introduce the task of scene-aware dialog. Our goal is to generate a
complete and natural response to a question about a scene, given video and
audio of the scene and the history of previous turns in the dialog. To answer
successfully, agents must ground concepts from the question in the video while
leveraging contextual cues from the dialog history. To benchmark this task, we
introduce the Audio Visual Scene-Aware Dialog (AVSD) Dataset. For each of more
than 11,000 videos of human actions from the Charades dataset, our dataset
contains a dialog about the video, plus a final summary of the video by one of
the dialog participants. We train several baseline systems for this task and
evaluate the performance of the trained models using both qualitative and
quantitative metrics. Our results indicate that models must utilize all the
available inputs (video, audio, question, and dialog history) to perform best
on this dataset.

本论文介绍了场景感知对话任务，通过视频和音频研究场景，并在对话历史中利用上下文线索，以回答关于场景的问题；同时提出了 AVSD 数据集，并通过多项定量和定性指标评估了基础模型的表现，结果表明模型必须充分利用所有可用输入（视频、音频、问题和对话历史）才能在该数据集上取得最佳表现。