In this paper we explore the bi-directional mapping between images and their sentence-based descriptions. We propose learning this mapping using a recurrent neural network. Unlike previous approaches that map both sentences and images to a common embedding, we enable the generation of novel sentences given an image. Using the same model, we can also reconstruct the visual features associated with an image given its visual description. We use a novel recurrent visual memory that automatically learns to remember long-term visual concepts to aid in both sentence generation and visual feature reconstruction. We evaluate our approach on several tasks. These include sentence generation, sentence retrieval and image retrieval. State-of-the-art results are shown for the task of generating novel image descriptions. When compared to human generated captions, our automatically generated captions are preferred by humans over $19.8\%$ of the time. Results are better than or comparable to state-of-the-art results on the image and sentence retrieval tasks for methods using similar visual features.

本研究探索了图像及其基于语句的描述之间的双向映射，提出了使用递归神经网络学习该映射的方法。我们使用相同的模型生成新的描述句子，并重新构建与图像相关的可视化特征，同时使用新颖的递归视觉记忆来辅助语句生成和可视化特征重构。在生成新的图像描述任务中，我们的自动生成字幕被人类喜欢的比例超过了 19.8％。和使用类似的视觉特征方法相比，我们的结果在图像和语句检索任务上达到了同等或更好的效果。

学习用于图像字幕生成的递归视觉表示