In this paper we propose a model to learn multimodal multilingual
representations for matching images and sentences in different languages, with
the aim of advancing multilingual versions of image search and image
understanding. Our model learns a common representation for images and t