We introduce the task of scene-aware dialog. Given a follow-up question in an ongoing dialog about a video, our goal is to generate a complete and natural response to a question given (a) an input video, and (b) the history of previous turns in the dialog. To succeed, agents must ground the semantics in the video and leverage contextual cues from the history of the dialog to answer the question. To benchmark this task, we introduce the Audio Visual Scene-Aware Dialog (AVSD) dataset. For each of more than 11,000 videos of human actions for the Charades dataset. Our dataset contains a dialog about the video, plus a final summary of the video by one of the dialog participants. We train several baseline systems for this task and evaluate the performance of the trained models using several qualitative and quantitative metrics. Our results indicate that the models must comprehend all the available inputs (video, audio, question and dialog history) to perform well on this dataset.

本论文介绍了场景感知对话任务，通过视频和音频研究场景，并在对话历史中利用上下文线索，以回答关于场景的问题；同时提出了AVSD数据集，并通过多项定量和定性指标评估了基础模型的表现，结果表明模型必须充分利用所有可用输入（视频、音频、问题和对话历史）才能在该数据集上取得最佳表现。

视听场景感知对话