Humans learn language via multi-modal knowledge. However, due to the
text-only pre-training scheme, most existing pre-trained language models (PLMs)
are hindered from the multi-modal information.
To inject visual knowledge into PLMs, existing methods incorporate either the
text or image encoder of vision-language models (VLMs) to encode the visual
information and update all the original parameters of PLMs for knowledge
fusion.
In this paper, we propose a new plug-and-play module, X-adapter, to flexibly
leverage the aligned visual and textual knowledge learned in pre-trained VLMs
and efficiently inject them into PLMs.
Specifically, we insert X-adapters into PLMs, and only the added parameters
are updated during adaptation.
To fully exploit the potential in VLMs, X-adapters consist of two
sub-modules, V-expert and T-expert, to fuse VLMs' image and text
representations, respectively.
We can opt for activating different sub-modules depending on the downstream
tasks.
Experimental results show that our method can significantly improve the
performance on object-color reasoning and natural language understanding (NLU)
tasks compared with PLM baselines.

本文提出了一种新的插入式模块 X-adapter，用于将预训练的 VLMs 的对齐视觉和文本知识灵活地融入 PLMs 中，以提高对象 - 颜色推理和自然语言理解 (NLU) 任务性能。

基于交叉模态衔接器的通用高效视觉知识注入预训练语言模型

Towards Versatile and Efficient Visual Knowledge Injection into  Pre-trained Language Models with Cross-Modal Adapters

TIReID aims to retrieve the image corresponding to the given text query from
a pool of candidate images. Existing methods employ prior knowledge from
single-modality pre-training to facilitate learning, but lack multi-modal
correspondences. Besides, due to the substantial gap between modalities,
existing methods embed the original modal features into the same latent space
for cross-modal alignment. However, feature embedding may lead to intra-modal
information distortion. Recently, CLIP has attracted extensive attention from
researchers due to its powerful semantic concept learning capacity and rich
multi-modal knowledge, which can help us solve the above problems. Accordingly,
in the paper, we propose a CLIP-driven Fine-grained information excavation
framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
To transfer the multi-modal knowledge effectively, we perform fine-grained
information excavation to mine intra-modal discriminative clues and inter-modal
correspondences. Specifically, we first design a multi-grained global feature
learning module to fully mine intra-modal discriminative local information,
which can emphasize identity-related discriminative clues by enhancing the
interactions between global image (text) and informative local patches (words).
Secondly, cross-grained feature refinement (CFR) and fine-grained
correspondence discovery (FCD) modules are proposed to establish the
cross-grained and fine-grained interactions between modalities, which can
filter out non-modality-shared image patches/words and mine cross-modal
correspondences from coarse to fine. CFR and FCD are removed during inference
to save computational costs. Note that the above process is performed in the
original modality space without further feature embedding. Extensive
experiments on multiple benchmarks demonstrate the superior performance of our
method on TIReID.

提出了一种基于 CLIP 驱动的细粒度信息挖掘框架 (CFine)，旨在为 TIReID 提供强大的多模态知识，通过细粒度信息挖掘，建立跨模态对齐，并在多个基准测试上显示了其优越的性能。

基于 CLIP 的细粒度文本图像人员再识别

CLIP-Driven Fine-grained Text-Image Person Re-identification

Referring image segmentation aims to segment a referent via a natural
linguistic expression.Due to the distinct data properties between text and
image, it is challenging for a network to well align text and pixel-level
features. Existing approaches use pretrained models to facilitate learning, yet
separately transfer the language/vision knowledge from pretrained models,
ignoring the multi-modal corresponding information. Inspired by the recent
advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we
propose an end-to-end CLIP-Driven Referring Image Segmentation framework
(CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to
vision-language decoding and contrastive learning for achieving the
text-to-pixel alignment. More specifically, we design a vision-language decoder
to propagate fine-grained semantic information from textual representations to
each pixel-level activation, which promotes consistency between the two
modalities. In addition, we present text-to-pixel contrastive learning to
explicitly enforce the text feature similar to the related pixel-level features
and dissimilar to the irrelevances. The experimental results on three benchmark
datasets demonstrate that our proposed framework significantly outperforms the
state-of-the-art performance without any post-processing. The code will be
released.

本文提出了一种基于 CLIP 的终端到终端的指代图像分割框架（CRIS），该框架采用视觉语言解码器和对比学习实现文本到像素级特征的对齐，通过在三个基准数据集上的实验结果表明，该框架的性能显著优于现有方法。

CRIS: 基于 CLIP 推理的参考图像分割

CRIS: CLIP-Driven Referring Image Segmentation

VQA models may tend to rely on language bias as a shortcut and thus fail to
sufficiently learn the multi-modal knowledge from both vision and language.
Recent debiasing methods proposed to exclude the language prior during
inference. However, they fail to disentangle the "good" language context and
"bad" language bias from the whole. In this paper, we investigate how to
mitigate language bias in VQA. Motivated by causal effects, we proposed a novel
counterfactual inference framework, which enables us to capture the language
bias as the direct causal effect of questions on answers and reduce the
language bias by subtracting the direct language effect from the total causal
effect. Experiments demonstrate that our proposed counterfactual inference
framework 1) is general to various VQA backbones and fusion strategies, 2)
achieves competitive performance on the language-bias sensitive VQA-CP dataset
while performs robustly on the balanced VQA v2 dataset without any augmented
data. The code is available at this https URL

本文提出了一种新的因果推断框架来缓解视觉问答模型中的语言偏见，可以从整体上减少语言上对回答结果的直接影响，实验结果表明，该框架可以适用于各种 VQA 问答模型，在均衡的 VQA v2 数据集上表现稳定， 同时在语言相关的 VQA-CP 数据集上达到有竞争力的表现。