Image captioning has been shown as an effective pretraining method similar to
contrastive pretraining. However, the incorporation of location-aware
information into visual pretraining remains an area with limited research. In
this paper, we propose a simple visual pretraining method with location-aware
captioners (LocCa). LocCa uses a simple image captioner task interface, to
teach a model to read out rich information, i.e. bounding box coordinates, and
captions, conditioned on the image pixel input. Thanks to the multitask
capabilities of an encoder-decoder architecture, we show that an image
captioner can easily handle multiple tasks during pretraining. Our experiments
demonstrate that LocCa outperforms standard captioners significantly on
localization downstream tasks while maintaining comparable performance on
holistic tasks.

在本文中，我们提出了一种简单的可感知位置的图像预训练方法（LocCa），它使用一个简单的图像标题生成任务接口，在图像像素输入的条件下教导模型以读取丰富的信息，如边界框坐标和标题。通过编码器 - 解码器体系结构的多任务能力，我们展示了图像标题生成器在预训练期间可以轻松处理多个任务。我们的实验证明 LocCa 在本地化底层任务上明显优于标准的标题生成器，并且在整体任务上保持可比较的性能。