In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12, Flickr 8K, and Flickr 30K). Our model outperforms the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.

本文提出了一种基于多模态循环神经网络 (m-RNN) 的模型，实现图像内容的生成式描述，模型包含句子的深度循环神经网络和图像的卷积神经网络两个子网络以及它们的多模态层，经验证在三个基准数据集上的表现优于现有方法, 还可以应用于图像或句子的检索任务，比现有直接优化排名目标函数的方法取得了显著的性能提升。

用多模态递归神经网络来解释图像