In this paper, we introduce a novel approach to novel object captioning which
employs relative contrastive learning to learn visual and semantic alignment.
Our approach maximizes compatibility between regions and object tags in a
contrastive manner. To set up a proper contrastive learning objective, for each
image, we augment tags by leveraging the relative nature of positive and
negative pairs obtained from foundation models such as CLIP. We then use the
rank of each augmented tag in a list as a relative relevance label to contrast
each top-ranked tag with a set of lower-ranked tags. This learning objective
encourages the top-ranked tags to be more compatible with their image and text
context than lower-ranked tags, thus improving the discriminative ability of
the learned multi-modality representation. We evaluate our approach on two
datasets and show that our proposed RCA-NOC approach outperforms
state-of-the-art methods by a large margin, demonstrating its effectiveness in
improving vision-language representation for novel object captioning.

通过相对对比学习，本研究提出了一种新颖的方法来学习视觉和语义对齐，从而实现新颖物体的描述。针对每个图像，通过利用基于 CLIP 的正负样本的相对性质增加标签，设置适当的对比学习目标，并且将每个增强标签在列表中的排名作为相对相关性标签来对比每个排名靠前的标签和一组排名较低的标签。通过这个学习目标，使得排名靠前的标签与图像和文本上下文的兼容性比排名较低的标签更好，从而提高了学习到的多模态表示的判别能力。在两个数据集上对这种方法进行评估，并且显示了 RCA-NOC 方法在改进新颖物体描述的视觉语言表示方面的显著优势，证明了其有效性。

RCA-NOC: 相对比对对齐用于新颖物体字幕生成

RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning

Several multi-modality representation learning approaches such as LXMERT and
ViLBERT have been proposed recently. Such approaches can achieve superior
performance due to the high-level semantic information captured during
large-scale multimodal pretraining. However, as ViLBERT and LXMERT adopt visual
region regression and classification loss, they often suffer from domain gap
and noisy label problems, based on the visual features having been pretrained
on the Visual Genome dataset. To overcome these issues, we propose unbiased
Contrastive Visual-Linguistic Pretraining (CVLP), which constructs a visual
self-supervised loss built upon contrastive learning. We evaluate CVLP on
several down-stream tasks, including VQA, GQA and NLVR2 to validate the
superiority of contrastive learning on multi-modality representation learning.
Our code is available at: this https URL

本文提出了一种基于对比学习的无偏置视觉 - 语言预训练方法，可以在多模态表示学习中获得更好的性能，在验证集 VQA、GQA 和 NLVR2 中取得了良好的结果。