In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Fickr8K and Flick30K databases significantly outperform the state-of-the-art approaches.

本论文提出了多模态卷积神经网络(m-CNNs)，用于匹配图像和句子。该网络结构采用卷积架构来利用图像表示、单词组合和两种模态之间的匹配关系。实验结果表明，我们的m-CNNs可以有效地捕捉图像和句子匹配所需的信息，并在Flickr30K和Microsoft COCO数据库的双向图像和句子检索上取得了最先进的性能。

多模态卷积神经网络：图像和文本匹配