Grounding-based vision and language models have been successfully applied to
low-level vision tasks, aiming to precisely locate objects referred in
captions. The effectiveness of grounding representation learning heavily relies
on the scale of the training dataset. Despite being a useful data enrichment
strategy, data augmentation has received minimal attention in existing vision
and language tasks as augmentation for image-caption pairs is non-trivial. In
this study, we propose a robust phrase grounding model trained with
text-conditioned and text-unconditioned data augmentations. Specifically, we
apply text-conditioned color jittering and horizontal flipping to ensure
semantic consistency between images and captions. To guarantee image-caption
correspondence in the training samples, we modify the captions according to
pre-defined keywords when applying horizontal flipping. Additionally, inspired
by recent masked signal reconstruction, we propose to use pixel-level masking
as a novel form of data augmentation. While we demonstrate our data
augmentation method with MDETR framework, the proposed approach is applicable
to common grounding-based vision and language tasks with other frameworks.
Finally, we show that image encoder pretrained on large-scale image and
language datasets (such as CLIP) can further improve the results. Through
extensive experiments on three commonly applied datasets: Flickr30k, referring
expressions and GQA, our method demonstrates advanced performance over the
state-of-the-arts with various metrics. Code can be found in
this https URL

通过数据增强和使用大规模图像和语言数据集（如 CLIP）进行预训练的图像编码器，提出了一种鲁棒的短语基础模型，用于低层次视觉任务中的关键字本体识别，并通过多种指标在常用数据集上展示了先进性能。

增强图像 - 标题对：用于基于视觉和语言模型的语义保留的图像 - 标题对增强

Augment the Pairs: Semantics-Preserving Image-Caption Pair Augmentation  for Grounding-Based Vision and Language Models

In this paper, we present a model which takes as input a corpus of images
with relevant spoken captions and finds a correspondence between the two
modalities. We employ a pair of convolutional neural networks to model visual
objects and speech signals at the word level, and tie the networks together
with an embedding and alignment model which learns a joint semantic space over
both modalities. We evaluate our model using image search and annotation tasks
on the Flickr8k dataset, which we augmented by collecting a corpus of 40,000
spoken captions using Amazon Mechanical Turk.

本文提出了一种模型，其将图像和相关的口头描述作为输入，并找到两种模态之间的对应关系。使用一对卷积神经网络在单词级别模拟视觉对象和语音信号，并采用嵌入和对准模型将两个网络联系在一起，以学习跨两种模态的联合语义空间，最终在 Flickr8k 数据集上使用图像搜索和注释任务评估了我们的模型。