Referring expression comprehension (REC) is a vision-language task to locate
a target object in an image based on a language expression. Fully fine-tuning
general-purpose pre-trained models for REC yields impressive performance but
becomes increasingly costly. Parameter-efficient transfer learning (PETL)
methods have shown strong performance with fewer tunable parameters. However,
applying PETL to REC faces two challenges: (1) insufficient interaction between
pre-trained vision and language encoders, and (2) high GPU memory usage due to
gradients passing through both heavy encoders. To address these issues, we
present M$^2$IST: Multi-Modal Interactive Side-Tuning with M$^3$ISAs: Mixture
of Multi-Modal Interactive Side-Adapters. During fine-tuning, we keep the
pre-trained vision and language encoders fixed and update M$^3$ISAs on side
networks to establish connections between them, thereby achieving parameter-
and memory-efficient tuning for REC. Empirical results on three benchmarks show
M$^2$IST achieves the best performance-parameter-memory trade-off compared to
full fine-tuning and other PETL methods, with only 3.14M tunable parameters
(2.11% of full fine-tuning) and 15.44GB GPU memory usage (39.61% of full
fine-tuning). Source code will soon be publicly available.

Referring expression comprehension is improved through M$^2$IST, a parameter- and memory-efficient transfer learning method utilizing M$^3$ISAs for establishing connections between pre-trained vision and language encoders.

M$^2$IST: 多模式交互侧调节用于记忆效率的指称表达理解

M$^2$IST: Multi-Modal Interactive Side-Tuning for Memory-efficient  Referring Expression Comprehension

We introduce Visual Caption Restoration (VCR), a novel vision-language task
that challenges models to accurately restore partially obscured texts using
pixel-level hints within images. This task stems from the observation that text
embedded in images is intrinsically different from common visual elements and
natural language due to the need to align the modalities of vision, text, and
text embedded in images. While numerous works have integrated text embedded in
images into visual question-answering tasks, approaches to these tasks
generally rely on optical character recognition or masked language modeling,
thus reducing the task to mainly text-based processing. However, text-based
processing becomes ineffective in VCR as accurate text restoration depends on
the combined information from provided images, context, and subtle cues from
the tiny exposed areas of masked texts. We develop a pipeline to generate
synthetic images for the VCR task using image-caption pairs, with adjustable
caption visibility to control the task difficulty. With this pipeline, we
construct a dataset for VCR called VCR-Wiki using images with captions from
Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and
hard split variants. Our results reveal that current vision language models
significantly lag behind human performance in the VCR task, and merely
fine-tuning the models on our dataset does not lead to notable improvements. We
release VCR-Wiki and the data construction code to facilitate future research.

我们介绍了一种名为 Visual Caption Restoration（VCR）的新颖视觉 - 语言任务，该任务要求模型使用图像中的像素级提示准确恢复部分被遮挡的文本。我们开发了一个流程来生成用于 VCR 任务的合成图像，并构建了一个名为 VCR-Wiki 的数据集，该数据集包含来自维基百科的图像标题对，包括在易和难两个变体中的 211 万英文实体和 34.6 万中文实体。我们的结果表明，当前的视觉语言模型在 VCR 任务中明显落后于人类表现，仅对我们的数据集进行微调并没有显著改进。我们提供了 VCR-Wiki 数据集和数据构建代码，以便促进未来的研究。

视觉字幕恢复

VCR: Visual Caption Restoration

Text-to-point-cloud cross-modal localization is an emerging vision-language
task critical for future robot-human collaboration. It seeks to localize a
position from a city-scale point cloud scene based on a few natural language
instructions. In this paper, we address two key limitations of existing
approaches: 1) their reliance on ground-truth instances as input; and 2) their
neglect of the relative positions among potential instances. Our proposed model
follows a two-stage pipeline, including a coarse stage for text-cell retrieval
and a fine stage for position estimation. In both stages, we introduce an
instance query extractor, in which the cells are encoded by a 3D sparse
convolution U-Net to generate the multi-scale point cloud features, and a set
of queries iteratively attend to these features to represent instances. In the
coarse stage, a row-column relative position-aware self-attention (RowColRPA)
module is designed to capture the spatial relations among the instance queries.
In the fine stage, a multi-modal relative position-aware cross-attention (RPCA)
module is developed to fuse the text and point cloud features along with
spatial relations for improving fine position estimation. Experiment results on
the KITTI360Pose dataset demonstrate that our model achieves competitive
performance with the state-of-the-art models without taking ground-truth
instances as input.

提出了一种新的模型来解决现有方法的两个主要限制：依赖于地面实例作为输入以及忽视可能实例之间的相对位置，通过文本到点云的跨模态本地化任务，能够在一个城市规模的点云场景中根据少量自然语言指令来定位一个位置。实验结果表明，该模型在 KITTI360Pose 数据集上与最先进的模型相比具有竞争力的性能，同时也不需要使用地面实例作为输入。

无实例文本到点云定位与相对位置感知

Instance-free Text to Point Cloud Localization with Relative Position  Awareness

Visual dialog is a challenging vision-language task, which requires the agent
to answer multi-round questions about an image. It typically needs to address
two major problems: (1) How to answer visually-grounded questions, which is the
core challenge in visual question answering (VQA); (2) How to infer the
co-reference between questions and the dialog history. An example of visual
co-reference is: pronouns (\eg, ``they'') in the question (\eg, ``Are they on
or off?'') are linked with nouns (\eg, ``lamps'') appearing in the dialog
history (\eg, ``How many lamps are there?'') and the object grounded in the
image. In this work, to resolve the visual co-reference for visual dialog, we
propose a novel attention mechanism called Recursive Visual Attention (RvA).
Specifically, our dialog agent browses the dialog history until the agent has
sufficient confidence in the visual co-reference resolution, and refines the
visual attention recursively. The quantitative and qualitative experimental
results on the large-scale VisDial v0.9 and v1.0 datasets demonstrate that the
proposed RvA not only outperforms the state-of-the-art methods, but also
achieves reasonable recursion and interpretable attention maps without
additional annotations. The code is available at
https://github.com/yuleiniu/rva.

本文提出了一种名为 Recursive Visual Attention (RvA) 的新型注意力机制，用于解决视觉对话中的视觉协同参考问题，并在大规模的 VisDial v0.9 和 v1.0 数据集上进行了实验，结果表明 RvA 不仅超越了现有技术，而且在没有附加注释的情况下实现了合理的递归和可解释的注意力图。