Recent models for cross-modal retrieval have benefited from an increasingly rich understanding of visual scenes, afforded by scene graphs and object interactions to mention a few. This has resulted in an improved matching between the visual representation of an image and the textual representation of its caption. Yet, current visual representations overlook a key aspect: the text appearing in images, which may contain crucial information for retrieval. In this paper, we first propose a new dataset that allows exploration of cross-modal retrieval where images contain scene-text instances. Then, armed with this dataset, we describe several approaches which leverage scene text, including a better scene-text aware cross-modal retrieval method which uses specialized representations for text from the captions and text from the visual scene, and reconcile them in a common embedding space. Extensive experiments confirm that cross-modal retrieval approaches benefit from scene text and highlight interesting research questions worth exploring further. Dataset and code are available at http://europe.naverlabs.com/stacmr

本文提出了一个新的数据集，使得可以探索图像包含场景文本实例时的交叉检索。我们提出了几种方法，其中包括更好的场景文本感知交叉检索方法，它使用了来自标题和视觉场景文本的专门表示，并将它们调和在一个公共嵌入空间中。大量实验证实了这些方法从场景文本中受益，并突出了值得进一步探索的有趣研究问题。本文中提出的数据集和代码可在 http URL 中获得。

StacMR: 场景文本感知的跨模态检索