连接视觉和语言的局部叙述

Dec, 2019

Connecting Vision and Language with Localized Narratives

Jordi Pont-Tuset, Jasper Uijlings, Soravit Changpinyo, Radu Soricut, Vittorio Ferrari

TL;DR我们提出了一种新的多模态图像标注方法，称为定位叙述，将视觉和语言进行连接。通过请求标注者在将鼠标指针悬停在要描述的区域的同时用语音来描述图像，我们实现了对每个单词进行定位。该方法经过全面的分析和外部数据验证，具有高度准确性和生产效率，并且在受控图像字幕应用程序中具有实用性。

Abstract

We propose localized narratives, an efficient way to collect image captions with dense visual grounding. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the