In this paper we are interested in the problem of image segmentation given
natural language descriptions, i.e. referring expressions. Existing works
tackle this problem by first modeling images and sentences independently and
then segment images by combining these two types of representations. We argue
that learning word-to-image interaction is more native in the sense of jointly
modeling two modalities for the image segmentation task, and we propose
convolutional multimodal LSTM to encode the sequential interactions between
individual words, visual information, and spatial information. We show that our
proposed model outperforms the baseline model on benchmark datasets. In
addition, we analyze the intermediate output of the proposed multimodal LSTM
approach and empirically explain how this approach enforces a more effective
word-to-image interaction.

本研究探讨了自然语言描述下图像分割的问题，提出了基于卷积多模态 LSTM 编码单词、视觉和空间信息的序列交互的方法，并在基准数据集上展示出了其比基准模型更好的性能。

用于参考图像分割的循环多模态交互

Recurrent Multimodal Interaction for Referring Image Segmentation

Speaker identification refers to the task of localizing the face of a person
who has the same identity as the ongoing voice in a video. This task not only
requires collective perception over both visual and auditory signals, the
robustness to handle severe quality degradations and unconstrained content
variations are also indispensable. In this paper, we describe a novel
multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies
both visual and auditory modalities from the beginning of each sequence input.
The key idea is to extend the conventional LSTM by not only sharing weights
across time steps, but also sharing weights across modalities. We show that
modeling the temporal dependency across face and voice can significantly
improve the robustness to content quality degradations and variations. We also
found that our multimodal LSTM is robustness to distractors, namely the
non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory
dataset and showed that our system outperforms the state-of-the-art systems in
speaker identification with lower false alarm rate and higher recognition
accuracy.

本文提出了一种新颖的多模态长短时记忆结构 (MLSTM)，它可以无缝整合来自视频序列的视觉和听觉信息，建模人脸和声音之间的时间依赖关系，从而提高语音识别的鲁棒性和识别精度。