Referring image segmentation, the task of segmenting any arbitrary entities
described in free-form texts, opens up a variety of vision applications.
However, manual labeling of training data for this task is prohibitively
costly, leading to lack of labeled data for training. We address this issue by
a weakly supervised learning approach using text descriptions of training
images as the only source of supervision. To this end, we first present a new
model that discovers semantic entities in input image and then combines such
entities relevant to text query to predict the mask of the referent. We also
present a new loss function that allows the model to be trained without any
further supervision. Our method was evaluated on four public benchmarks for
referring image segmentation, where it clearly outperformed the existing method
for the same task and recent open-vocabulary segmentation models on all the
benchmarks.

在此研究中，我们使用弱监督学习方法，利用训练图像的文本描述作为唯一的监督来源，解决了训练数据标记成本高昂的问题，并提出了一种新模型来发现输入图像中的语义实体，并组合与文本查询相关的实体来预测指代物的掩码。我们还提出了一种新的损失函数，使得模型可以在没有进一步监督的情况下进行训练。我们的方法在四个公共基准数据集上进行了评估，明显优于相同任务的现有方法和最近的开放词汇分割模型。

破碎和聚集：利用文本监督学习参考图像分割

Shatter and Gather: Learning Referring Image Segmentation with Text  Supervision

We consider the problem of localizing a spatio-temporal tube in a video
corresponding to a given text query. This is a challenging task that requires
the joint and efficient modeling of temporal, spatial and multi-modal
interactions. To address this task, we propose TubeDETR, a transformer-based
architecture inspired by the recent success of such models for text-conditioned
object detection. Our model notably includes: (i) an efficient video and text
encoder that models spatial multi-modal interactions over sparsely sampled
frames and (ii) a space-time decoder that jointly performs spatio-temporal
localization. We demonstrate the advantage of our proposed components through
an extensive ablation study. We also evaluate our full approach on the
spatio-temporal video grounding task and demonstrate improvements over the
state of the art on the challenging VidSTG and HC-STVG benchmarks. Code and
trained models are publicly available at
this https URL.

提出了基于 Transformer 的 TubeDETR 模型，该模型能够高效地建模时空和多模态交互，用于解决视频中给定文本查询的时空定位问题，并且在视频定位任务上表现出色。