Designed for tracking user goals in dialogues, a dialogue state tracker is an
essential component in a dialogue system. However, the research of dialogue
state tracking has largely been limited to unimodality, in which slots and slot
values are limited by knowledge domains (e.g. restaurant domain with slots of
restaurant name and price range) and are defined by specific database schema.
In this paper, we propose to extend the definition of dialogue state tracking
to multimodality. Specifically, we introduce a novel dialogue state tracking
task to track the information of visual objects that are mentioned in
video-grounded dialogues. Each new dialogue utterance may introduce a new video
segment, new visual objects, or new object attributes, and a state tracker is
required to update these information slots accordingly. We created a new
synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer
Network (VDTN), for this task. VDTN combines both object-level features and
segment-level features and learns contextual dependencies between videos and
dialogues to generate multimodal dialogue states. We optimized VDTN for a state
generation task as well as a self-supervised video understanding task which
recovers video segment or object representations. Finally, we trained VDTN to
use the decoded states in a response prediction task. Together with
comprehensive ablation and qualitative analysis, we discovered interesting
insights towards building more capable multimodal dialogue systems.

本文提出了一项新的多模式对话状态跟踪任务，用于跟踪视频对话中提到的视觉对象的信息；并介绍了 Video-Dialogue Transformer Network (VDTN) 作为实现该任务的基准模型。

多模态对话状态跟踪

Multimodal Dialogue State Tracking

Compared to traditional visual question answering, video-grounded dialogues
require additional reasoning over dialogue context to answer questions in a
multi-turn setting. Previous approaches to video-grounded dialogues mostly use
dialogue context as a simple text input without modelling the inherent
information flows at the turn level. In this paper, we propose a novel
framework of Reasoning Paths in Dialogue Context (PDC). PDC model discovers
information flows among dialogue turns through a semantic graph constructed
based on lexical components in each question and answer. PDC model then learns
to predict reasoning paths over this semantic graph. Our path prediction model
predicts a path from the current turn through past dialogue turns that contain
additional visual cues to answer the current question. Our reasoning model
sequentially processes both visual and textual information through this
reasoning path and the propagated features are used to generate the answer. Our
experimental results demonstrate the effectiveness of our method and provide
additional insights on how models use semantic dependencies in a dialogue
context to retrieve visual cues.

使用 PDC 模型通过语义图的构建和路径预测，依据对话上下文进行推理，从而在多轮视频对话中检索视觉线索并有效回答问题。

学习语义图上的推理路径，为基于视频的对话建立基础

Learning Reasoning Paths over Semantic Graphs for Video-grounded  Dialogues

Video-grounded dialogues are very challenging due to (i) the complexity of
videos which contain both spatial and temporal variations, and (ii) the
complexity of user utterances which query different segments and/or different
objects in videos over multiple dialogue turns. However, existing approaches to
video-grounded dialogues often focus on superficial temporal-level visual cues,
but neglect more fine-grained spatial signals from videos. To address this
drawback, we propose Bi-directional Spatio-Temporal Learning (BiST), a
vision-language neural framework for high-resolution queries in videos based on
textual cues. Specifically, our approach not only exploits both spatial and
temporal-level information, but also learns dynamic information diffusion
between the two feature spaces through spatial-to-temporal and
temporal-to-spatial reasoning. The bidirectional strategy aims to tackle the
evolving semantics of user queries in the dialogue setting. The retrieved
visual cues are used as contextual information to construct relevant responses
to the users. Our empirical results and comprehensive qualitative analysis show
that BiST achieves competitive performance and generates reasonable responses
on a large-scale AVSD benchmark. We also adapt our BiST models to the Video QA
setting, and substantially outperform prior approaches on the TGIF-QA
benchmark.

提出了一种基于文本提示的高分辨率视频查询的视觉 - 语言神经框架，名为 Bi-directional Spatio-Temporal Learning（BiST）。结果表明，BiST 在视频段落检索（AVSD）基准测试中取得了有竞争力的性能并产生了合理的响应。另外，在 TGIF-QA 基准测试中，BiST 模型比先前的方法表现更好。