Emotion and Intent Joint Understanding in Multimodal Conversation (MC-EIU)
aims to decode the semantic information manifested in a multimodal
conversational history, while inferring the emotions and intents simultaneously
for the current utterance. MC-EIU is enabling technology for many
human-computer interfaces. However, there is a lack of available datasets in
terms of annotation, modality, language diversity, and accessibility. In this
work, we propose an MC-EIU dataset, which features 7 emotion categories, 9
intent categories, 3 modalities, i.e., textual, acoustic, and visual content,
and two languages, i.e., English and Mandarin. Furthermore, it is completely
open-source for free access. To our knowledge, MC-EIU is the first
comprehensive and rich emotion and intent joint understanding dataset for
multimodal conversation. Together with the release of the dataset, we also
develop an Emotion and Intent Interaction (EI$^2$) network as a reference
system by modeling the deep correlation between emotion and intent in the
multimodal conversation. With comparative experiments and ablation studies, we
demonstrate the effectiveness of the proposed EI$^2$ method on the MC-EIU
dataset. The dataset and codes will be made available at:
this https URL

描述了一种旨在解码多模态对话历史中的语义信息，同时推断出当前话语的情感和意图的技术，即多模态对话中情感和意图联合理解 (MC-EIU)，并提出了 MC-EIU 数据集，此数据集包括 7 个情感类别、9 个意图类别、3 种模态 (文本、声学和视觉内容) 以及英文和普通话两种语言。
与此同时，还开发了 Emotion and Intent Interaction (EI$^2$) 网络作为参考系统，通过模拟多模态对话中情感和意图之间的深层关联来实现。实验证明了所提出的 EI$^2$ 方法在 MC-EIU 数据集上的有效性。

多模态对话中情感和意图的联合理解：一个基准数据集

Emotion and Intent Joint Understanding in Multimodal Conversation: A  Benchmarking Dataset

Dynamically synthesizing talking speech that actively responds to a listening
head is critical during the face-to-face interaction. For example, the speaker
could take advantage of the listener's facial expression to adjust the tones,
stressed syllables, or pauses. In this work, we present a new visual-aware
text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual
inputs and sequential visual feedback (e.g., nod, smile) of the listener in
face-to-face communication. Different from traditional text-to-speech, VA-TTS
highlights the impact of visual modality. On this newly-minted task, we devise
a baseline model to fuse phoneme linguistic information and listener visual
signals for speech synthesis. Extensive experiments on multimodal conversation
dataset ViCo-X verify our proposal for generating more natural audio with
scenario-appropriate rhythm and prosody.

本文提出了一种新的视觉感知文本转语音（VA-TTS）任务，它可以根据面对面交流中听者的语音和面部表情条件语音的生成，实验表明该方法可以在多种情景下生成更加自然有节奏感的音频。