In this paper, we propose a simple yet effective approach for self-supervised
video object segmentation (VOS). Our key insight is that the inherent
structural dependencies present in DINO-pretrained Transformers can be
leveraged to establish robust spatio-temporal correspondences in videos.
Furthermore, simple clustering on this correspondence cue is sufficient to
yield competitive segmentation results. Previous self-supervised VOS techniques
majorly resort to auxiliary modalities or utilize iterative slot attention to
assist in object discovery, which restricts their general applicability and
imposes higher computational requirements. To deal with these challenges, we
develop a simplified architecture that capitalizes on the emerging objectness
from DINO-pretrained Transformers, bypassing the need for additional modalities
or slot attention. Specifically, we first introduce a single spatio-temporal
Transformer block to process the frame-wise DINO features and establish
spatio-temporal dependencies in the form of self-attention. Subsequently,
utilizing these attention maps, we implement hierarchical clustering to
generate object segmentation masks. To train the spatio-temporal block in a
fully self-supervised manner, we employ semantic and dynamic motion consistency
coupled with entropy normalization. Our method demonstrates state-of-the-art
performance across multiple unsupervised VOS benchmarks and particularly excels
in complex real-world multi-object video segmentation tasks such as
DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will
be released at this https URL

我们提出了一个简单而有效的方法来进行自监督视频对象分割 (VOS)。我们的关键观点是，DINO 预训练的 Transformer 中具有的固有结构依赖性可以用于建立视频中的稳健时空对应关系。此外，利用这种对应线索进行简单的聚类就足以产生具有竞争力的分割结果。我们开发了一个简化的架构来应对这些挑战，利用 DINO 预训练的 Transformer 中新兴的对象性，避免了使用额外的多模态或槽关注的需要。我们的方法在多个无监督 VOS 基准测试中展示了最先进的性能，特别在复杂的现实世界多对象视频分割任务中表现出色，如 DAVIS-17-Unsupervised 和 YouTube-VIS-19。

被注意力背叛：一种简洁而有效的自监督视频对象分割方法

Betrayed by Attention: A Simple yet Effective Approach for  Self-supervised Video Object Segmentation

Large Multimodal Models (LMMs) extend Large Language Models to the vision
domain. Initial efforts towards LMMs used holistic images and text prompts to
generate ungrounded textual responses. Very recently, region-level LMMs have
been used to generate visually grounded responses. However, they are limited to
only referring a single object category at a time, require users to specify the
regions in inputs, or cannot offer dense pixel-wise object grounding. In this
work, we present Grounding LMM (GLaMM), the first model that can generate
natural language responses seamlessly intertwined with corresponding object
segmentation masks. GLaMM not only grounds objects appearing in the
conversations but is flexible enough to accept both textual and optional visual
prompts (region of interest) as input. This empowers users to interact with the
model at various levels of granularity, both in textual and visual domains. Due
to the lack of standard benchmarks for the novel setting of generating visually
grounded detailed conversations, we introduce a comprehensive evaluation
protocol with our curated grounded conversations. Our proposed Grounded
Conversation Generation (GCG) task requires densely grounded concepts in
natural scenes at a large-scale. To this end, we propose a densely annotated
Grounding-anything Dataset (GranD) using our proposed automated annotation
pipeline that encompasses 7.5M unique concepts grounded in a total of 810M
regions available with segmentation masks. Besides GCG, GLaMM also performs
effectively on several downstream tasks e.g., referring expression
segmentation, image and region-level captioning and vision-language
conversations. Project Page: this https URL

GLaMM 是首个能够无缝生成自然语言回复并与相应对象分割遮罩混合的模型，在图像和文本领域中以不同粒度的方式与模型进行交互，同时通过 GLaMM，还可以在诸多其他任务中有效地实现指代表达分割、图像和区域级别的字幕以及视觉语言对话。

GLaMM: 像素 grounding 大规模多模态模型

GLaMM: Pixel Grounding Large Multimodal Model

We introduce Diff-DOPE, a 6-DoF pose refiner that takes as input an image, a
3D textured model of an object, and an initial pose of the object. The method
uses differentiable rendering to update the object pose to minimize the visual
error between the image and the projection of the model. We show that this
simple, yet effective, idea is able to achieve state-of-the-art results on pose
estimation datasets. Our approach is a departure from recent methods in which
the pose refiner is a deep neural network trained on a large synthetic dataset
to map inputs to refinement steps. Rather, our use of differentiable rendering
allows us to avoid training altogether. Our approach performs multiple gradient
descent optimizations in parallel with different random learning rates to avoid
local minima from symmetric objects, similar appearances, or wrong step size.
Various modalities can be used, e.g., RGB, depth, intensity edges, and object
segmentation masks. We present experiments examining the effect of various
choices, showing that the best results are found when the RGB image is
accompanied by an object mask and depth image to guide the optimization
process.

我们引入了 Diff-DOPE，这是一种 6 自由度姿态微调器，其输入为图像、一个物体的 3D 纹理模型和物体的初始姿态。这种方法使用可微渲染来更新物体姿态，以减小图像与模型投影之间的视觉误差，我们展示了这种简单且有效的想法能在姿态估计数据集上达到最先进的结果。我们的方法与最近的方法有所不同，最近的方法中姿态微调器是在大型合成数据集上训练的深度神经网络，用于将输入映射到微调步骤，而我们的可微渲染的使用使我们能够完全避免训练。我们的方法可以同时进行多个梯度下降优化，使用不同的随机学习率，以避免对称对象、相似外观或错误步长的局部最小值。可以使用各种模态，例如 RGB、深度、强度边缘和物体分割掩码。我们进行了一系列实验，研究了各种选项的效果，结果表明，当 RGB 图像与物体掩码和深度图像一起用于引导优化过程时，能获得最佳的结果。