Referring Image Segmentation (RIS) is a challenging task that requires an
algorithm to segment objects referred by free-form language expressions.
Despite significant progress in recent years, most state-of-the-art (SOTA)
methods still suffer from considerable language-image modality gap at the pixel
and word level. These methods generally 1) rely on sentence-level language
features for language-image alignment and 2) lack explicit training supervision
for fine-grained visual grounding. Consequently, they exhibit weak object-level
correspondence between visual and language features. Without well-grounded
features, prior methods struggle to understand complex expressions that require
strong reasoning over relationships among multiple objects, especially when
dealing with rarely used or ambiguous clauses. To tackle this challenge, we
introduce a novel Mask Grounding auxiliary task that significantly improves
visual grounding within language features, by explicitly teaching the model to
learn fine-grained correspondence between masked textual tokens and their
matching visual objects. Mask Grounding can be directly used on prior RIS
methods and consistently bring improvements. Furthermore, to holistically
address the modality gap, we also design a cross-modal alignment loss and an
accompanying alignment module. These additions work synergistically with Mask
Grounding. With all these techniques, our comprehensive approach culminates in
MagNet Mask-grounded Network), an architecture that significantly outperforms
prior arts on three key benchmarks (RefCOCO, RefCOCO+ and G-Ref), demonstrating
our method's effectiveness in addressing current limitations of RIS algorithms.
Our code and pre-trained weights will be released.

通过引入 Mask Grounding 辅助任务和跨模态对齐损失以及对应的对齐模块，提出了一种用于改善参照图像分割算法的综合方法 MagNet。该方法通过教授模型学习掩蔽文本标记与匹配的视觉对象之间的细粒度对应关系，在 RefCOCO、RefCOCO + 和 G-Ref 等三个关键基准测试中显著优于现有算法，有效地解决了当前参照图像分割算法的局限性。

指代图像分割的遮罩定位

Mask Grounding for Referring Image Segmentation

Referring expression segmentation (RES) aims at segmenting the foreground
masks of the entities that match the descriptive natural language expression.
Previous datasets and methods for classic RES task heavily rely on the prior
assumption that one expression must refer to object-level targets. In this
paper, we take a step further to finer-grained part-level RES task. To promote
the object-level RES task towards finer-grained vision-language understanding,
we put forward a new multi-granularity referring expression segmentation (MRES)
task and construct an evaluation benchmark called RefCOCOm by manual
annotations. By employing our automatic model-assisted data engine, we build
the largest visual grounding dataset namely MRES-32M, which comprises over
32.2M high-quality masks and captions on the provided 1M images. Besides, a
simple yet strong model named UniRES is designed to accomplish the unified
object-level and part-level grounding task. Extensive experiments on our
RefCOCOm for MRES and three datasets (i.e., RefCOCO(+/g) for classic RES task
demonstrate the superiority of our method over previous state-of-the-art
methods. To foster future research into fine-grained visual grounding, our
benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly
available at this https URL

提出了一种多层次指代表达式分割任务 (MRES)，构建了一个评估基准 RefCOCOm 和一个规模为 32.2M 的高质量数据集 MRES-32M，设计了 UniRES 模型完成统一的对象级和部分级视觉对齐任务，通过在 RefCOCOm、RefCOCO (+/g) 等数据集上的实验证明了该方法的优越性。