Vision-language pre-training (VLP) has shown impressive performance on a wide
range of cross-modal tasks, where VLP models without reliance on object
detectors are becoming the mainstream due to their superior computation
efficiency and competitive performance. However, the removal of object
detectors also deprives the capability of VLP models in explicit object
modeling, which is essential to various position-sensitive vision-language (VL)
tasks, such as referring expression comprehension and visual commonsense
reasoning. To address the challenge, we introduce PEVL that enhances the
pre-training and prompt tuning of VLP models with explicit object position
modeling. Specifically, PEVL reformulates discretized object positions and
language in a unified language modeling framework, which facilitates explicit
VL alignment during pre-training, and also enables flexible prompt tuning for
various downstream tasks. We show that PEVL enables state-of-the-art
performance of detector-free VLP models on position-sensitive tasks such as
referring expression comprehension and phrase grounding, and also improves the
performance on position-insensitive tasks with grounded inputs. We make the
data and code for this paper publicly available at
this https URL.

本研究提出了一种名为 PEVL 的显式目标位置建模方法，来提高 VLP 模型在特定视觉 - 语言任务（如指称表达理解和视觉常识推理）上的性能。该方法通过将离散化目标位置与语言内容整合到一个语言建模框架中，在预训练和提示微调阶段实现显式的视觉 - 语言对齐，并为各种下游任务提供了灵活的提示微调方式。实验结果显示，PEVL 在无检测器的 VLP 模型上能够取得最先进的性能，即在特定视觉 - 语言任务上取得优异的表现，也能提高在具有定位敏感输入的任务上的性能。