We propose Localized Narratives, a new form of multimodal image annotations
connecting vision and language. We ask annotators to describe an image with
their voice while simultaneously hovering their mouse over the region they are
describing. Since the voice and the mouse pointer are synchronized, we can
localize every single word in the description. This dense visual grounding
takes the form of a mouse trace segment per word and is unique to our data. We
annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and
ADE20K datasets, and 671k images of Open Images, all of which we make publicly
available. We provide an extensive analysis of these annotations showing they
are diverse, accurate, and efficient to produce. We also demonstrate their
utility on the application of controlled image captioning.

我们提出了一种新的多模态图像标注方法，称为定位叙述，将视觉和语言进行连接。通过请求标注者在将鼠标指针悬停在要描述的区域的同时用语音来描述图像，我们实现了对每个单词进行定位。该方法经过全面的分析和外部数据验证，具有高度准确性和生产效率，并且在受控图像字幕应用程序中具有实用性。