Referring image segmentation (RIS) is a fundamental vision-language task that
intends to segment a desired object from an image based on a given natural
language expression. Due to the essentially distinct data properties between
image and text, most of existing methods either introduce complex designs
towards fine-grained vision-language alignment or lack required dense
alignment, resulting in scalability issues or mis-segmentation problems such as
over- or under-segmentation. To achieve effective and efficient fine-grained
feature alignment in the RIS task, we explore the potential of masked
multimodal modeling coupled with self-distillation and propose a novel
cross-modality masked self-distillation framework named CM-MaskSD, in which our
method inherits the transferred knowledge of image-text semantic alignment from
CLIP model to realize fine-grained patch-word feature alignment for better
segmentation accuracy. Moreover, our CM-MaskSD framework can considerably boost
model performance in a nearly parameter-free manner, since it shares weights
between the main segmentation branch and the introduced masked
self-distillation branches, and solely introduces negligible parameters for
coordinating the multimodal features. Comprehensive experiments on three
benchmark datasets (i.e. RefCOCO, RefCOCO+, G-Ref) for the RIS task
convincingly demonstrate the superiority of our proposed framework over
previous state-of-the-art methods.

本文提出了一种名为 CM-MaskSD 的跨模态掩膜自学习框架，利用被称为 CLIP 模型的知识实现了精细的图像 - 文本对齐，并引入少量参数协调多模态特征，使其在三个基准数据集上优于现有方法，实现了对指定图像中物体的分割。

跨模态掩码自蒸馏用于指代图像分割的 CM-MaskSD

CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image  Segmentation

Large pre-trained multimodal models have demonstrated significant success in
a range of downstream tasks, including image captioning, image-text retrieval,
visual question answering (VQA), etc. However, many of these methods rely on
image-text pairs collected from the web as pre-training data and unfortunately
overlook the need for fine-grained feature alignment between vision and
language modalities, which requires detailed understanding of images and
language expressions. While integrating VQA and dense captioning (DC) into
pre-training can address this issue, acquiring image-question-answer as well as
image-location-caption triplets is challenging and time-consuming.
Additionally, publicly available datasets for VQA and dense captioning are
typically limited in scale due to manual data collection and labeling efforts.
In this paper, we propose a novel method called Joint QA and DC GEneration
(JADE), which utilizes a pre-trained multimodal model and easily-crawled
image-text pairs to automatically generate and filter large-scale VQA and dense
captioning datasets. We apply this method to the Conceptual Caption (CC3M)
dataset to generate a new dataset called CC3M-QA-DC. Experiments show that when
used for pre-training in a multi-task manner, CC3M-QA-DC can improve the
performance with various backbones on various downstream tasks. Furthermore,
our generated CC3M-QA-DC can be combined with larger image-text datasets (e.g.,
CC15M) and achieve competitive results compared with models using much more
data. Code and dataset will be released.

本文提出一种名为 Joint QA and DC Generation (JADE) 的新方法，利用预训练的多模态模型及易于爬取的图像 - 文本对生成和过滤大规模的视觉问答和密集字幕数据集。我们将该方法应用于概念字幕（CC3M）数据集，生成了一个名为 CC3M-QA-DC 的新的数据集，在多任务方式预训练时，CC3M-QA-DC 可以改善各种骨干网络在各种下游任务中的性能，并与更多数据使用模型相比，我们生成的 CC3M-QA-DC 和更大的图像 - 文本数据集（例如 CC15M）相结合，在相同的计算条件下达到了有竞争力的结果。

联合学习问答器和密集字幕生成器强化视觉语言预训练

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner  and Dense Captioner

Unsupervised domain adaptation methods aim to alleviate performance
degradation caused by domain-shift by learning domain-invariant
representations. Existing deep domain adaptation methods focus on holistic
feature alignment by matching source and target holistic feature distributions,
without considering local features and their multi-mode statistics. We show
that the learned local feature patterns are more generic and transferable and a
further local feature distribution matching enables fine-grained feature
alignment. In this paper, we present a method for learning domain-invariant
local feature patterns and jointly aligning holistic and local feature
statistics. Comparisons to the state-of-the-art unsupervised domain adaptation
methods on two popular benchmark datasets demonstrate the superiority of our
approach and its effectiveness on alleviating negative transfer.

本文介绍了一种方法，它通过学习领域不变的局部特征模式并联合对齐整体和局部特征统计量，从而进一步实现细粒度特征对齐，并在两个流行的基准数据集上将其与现有的无监督领域适应方法进行比较，证明了我们方法的优越性和对减轻负迁移的有效性。