Incorporating visual knowledge into text-only dialogue systems has become a
potential direction to imitate the way humans think, imagine, and communicate.
However, existing multimodal dialogue systems are either confined by the scale
and quality of available datasets or the coarse concept of visual knowledge. To
address these issues, we provide a new paradigm of constructing multimodal
dialogues as well as two datasets extended from text-only dialogues under such
paradigm (ReSee-WoW, ReSee-DD). We propose to explicitly split the visual
knowledge into finer granularity (``turn-level'' and ``entity-level''). To
further boost the accuracy and diversity of augmented visual information, we
retrieve them from the Internet or a large image dataset. To demonstrate the
superiority and universality of the provided visual knowledge, we propose a
simple but effective framework ReSee to add visual representation into vanilla
dialogue models by modality concatenations. We also conduct extensive
experiments and ablations w.r.t. different model configurations and visual
knowledge settings. Empirical, encouraging results not only demonstrate the
effectiveness of introducing visual knowledge at both entity and turn level but
also verify the proposed model ReSee outperforms several state-of-the-art
methods on automatic and human evaluations. By leveraging text and vision
knowledge, ReSee can produce informative responses with real-world visual
concepts.

本文提出了一种将视觉知识集成入基于文本的对话系统的方法，通过对视觉知识进行细分并从互联网或大型图像数据集中检索增强的视觉信息，实现了两个数据集（ReSee-WoW、ReSee-DD）的构建，并在所构建的对话模型（ReSee）上进行了大量的实验和消融，结果表明该模型在自动和人工评估上均优于现有的几种最先进的方法。

ReSee：在开放域对话中通过视觉信息响应和传递细粒度视觉知识

ReSee: Responding through Seeing Fine-grained Visual Knowledge in  Open-domain Dialogue

The demand for multimodal dialogue systems has been rising in various
domains, emphasizing the importance of interpreting multimodal inputs from
conversational and situational contexts. We explore three methods to tackle
this problem and evaluate them on the largest situated dialogue dataset, SIMMC
2.1. Our best method, scene-dialogue alignment, improves the performance by
~20% F1-score compared to the SIMMC 2.1 baselines. We provide analysis and
discussion regarding the limitation of our methods and the potential directions
for future works. Our code is publicly available at
this https URL.

通过探索三种方法并在 SIMMC 2.1 数据集上进行评估，我们提出了一种最有效的方法 —— 场景对话对齐，相较于 SIMMC 2.1 基准提升了约 20% 的 F1 分数。我们还分析和讨论了该方法的局限性以及未来研究的潜在方向。

坐标对话中的多模态物体识别

Which One Are You Referring To? Multimodal Object Identification in Situated Dialogue

Understanding dynamic scenes and dialogue contexts in order to converse with
users has been challenging for multimodal dialogue systems. The 8-th Dialog
System Technology Challenge (DSTC8) proposed an Audio Visual Scene-Aware Dialog
(AVSD) task, which contains multiple modalities including audio, vision, and
language, to evaluate how dialogue systems understand different modalities and
response to users. In this paper, we proposed a multi-step joint-modality
attention network (JMAN) based on recurrent neural network (RNN) to reason on
videos. Our model performs a multi-step attention mechanism and jointly
considers both visual and textual representations in each reasoning process to
better integrate information from the two different modalities. Compared to the
baseline released by AVSD organizers, our model achieves a relative 12.1% and
22.4% improvement over the baseline on ROUGE-L score and CIDEr score.

本文提出了一种基于循环神经网络的多步关注机制的多模态联合注意网络（JMAN），用于对视频进行推理，该模型在每个推理过程中联合考虑了视觉和文本表示，以更好地集成两种不同模态的信息。与 AVSD 组织发布的基线相比，我们的模型在 ROUGE-L 得分和 CIDEr 得分上相对提高了 12.1％和 22.4％。