Fine-grained supervision based on object annotations has been widely used for
vision and language pre-training (VLP). However, in real-world application
scenarios, aligned multi-modal data is usually in the image-caption format,
which only provides coarse-grained supervision. It is cost-expensive to collect
object annotations and build object annotation pre-extractor for different
scenarios. In this paper, we propose a fine-grained self-supervision signal
without object annotations from a replacement perspective. First, we propose a
homonym sentence rewriting (HSR) algorithm to provide token-level supervision.
The algorithm replaces a verb/noun/adjective/quantifier word of the caption
with its homonyms from WordNet. Correspondingly, we propose a replacement
vision-language modeling (RVLM) framework to exploit the token-level
supervision. Two replaced modeling tasks, i.e., replaced language contrastive
(RLC) and replaced language modeling (RLM), are proposed to learn the
fine-grained alignment. Extensive experiments on several downstream tasks
demonstrate the superior performance of the proposed method.

本文提出了一种无需对象注释的细粒度自我监督信号，其基于同义词句子改写（HSR）算法提供令牌级别的监督，并使用置换视觉语言建模（RVLM）框架，分别提供被替换语言对比（RLC）和被替换语言模型（RLM）两种方法来学习细粒度对齐，通过多项下游任务的广泛实验，证明了所提出方法的卓越性能。

自监督替换用于细粒度视觉语言预训练

Replacement as a Self-supervision for Fine-grained Vision-language Pre-training

In recent years, vision and language pre-training (VLP) models have advanced
the state-of-the-art results in a variety of cross-modal downstream tasks.
Aligning cross-modal semantics is claimed to be one of the essential
capabilities of VLP models. However, it still remains unclear about the inner
working mechanism of alignment in VLP models. In this paper, we propose a new
probing method that is based on image captioning to first empirically study the
cross-modal semantics alignment of VLP models. Our probing method is built upon
the fact that given an image-caption pair, the VLP models will give a score,
indicating how well two modalities are aligned; maximizing such scores will
generate sentences that VLP models believe are of good alignment. Analyzing
these sentences thus will reveal in what way different modalities are aligned
and how well these alignments are in VLP models. We apply our probing method to
five popular VLP models, including UNITER, ROSITA, ViLBERT, CLIP, and LXMERT,
and provide a comprehensive analysis of the generated captions guided by these
models. Our results show that VLP models (1) focus more on just aligning
objects with visual words, while neglecting global semantics; (2) prefer fixed
sentence patterns, thus ignoring more important textual information including
fluency and grammar; and (3) deem the captions with more visual words are
better aligned with images. These findings indicate that VLP models still have
weaknesses in cross-modal semantics alignment and we hope this work will draw
researchers' attention to such problems when designing a new VLP model.

本文提出了一种基于图像字幕生成的新型探测方法，用于研究视觉语言预训练模型中跨模态语义对齐的内部机制，发现 VLP 模型对齐的主要是对象和视觉词，忽略了全局语义，还存在固定的句子模式，无视语法和流畅性等问题。

从文本角度探究跨模态语义对齐能力

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

Real-world recognition system often encounters the challenge of unseen
labels. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL)
focuses on transferring knowledge by a pre-trained textual label embedding
(e.g., GloVe). However, such methods only exploit single-modal knowledge from a
language model, while ignoring the rich semantic information inherent in
image-text pairs. Instead, recently developed open-vocabulary (OV) based
methods succeed in exploiting such information of image-text pairs in object
detection, and achieve impressive performance. Inspired by the success of
OV-based methods, we propose a novel open-vocabulary framework, named
multi-modal knowledge transfer (MKT), for multi-label classification.
Specifically, our method exploits multi-modal knowledge of image-text pairs
based on a vision and language pre-training (VLP) model. To facilitate
transferring the image-text matching ability of VLP model, knowledge
distillation is employed to guarantee the consistency of image and label
embeddings, along with prompt tuning to further update the label embeddings. To
further enable the recognition of multiple objects, a simple but effective
two-stream module is developed to capture both local and global features.
Extensive experimental results show that our method significantly outperforms
state-of-the-art methods on public benchmark datasets. The source code is
available at this https URL

本研究提出一种新的基于开放词汇的跨模态知识迁移框架 (MKT)，利用视觉和语言预训练模型的多模态知识，采用知识蒸馏技术和双流模块来实现多标签分类和多目标识别，并在公开基准数据集上显著优于现有方法。