We introduce the task of localizing a flexible number of objects in
real-world 3D scenes using natural language descriptions. Existing 3D visual
grounding tasks focus on localizing a unique object given a text description.
However, such a strict setting is unnatural as localizing potentially multiple
objects is a common need in real-world scenarios and robotic tasks (e.g.,
visual navigation and object rearrangement). To address this setting we propose
Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains
61926 descriptions of 11609 objects, where zero, single or multiple target
objects are referenced by each description. We also introduce a new evaluation
metric and benchmark methods from prior work to enable further investigation of
multi-modal 3D scene understanding. Furthermore, we develop a better baseline
leveraging 2D features from CLIP by rendering object proposals online with
contrastive learning, which outperforms the state of the art on the ScanRefer
benchmark.

我们介绍了使用自然语言描述来定位现实世界 3D 场景中多个对象的任务。我们提出了 Multi3DRefer，扩展了 ScanRefer 数据集和任务，并引入了新的评估指标和基准方法以进一步研究多模态 3D 场景理解。此外，我们利用 CLIP 的 2D 特征和对比学习在线渲染对象提案，构建了更好的基准线，该基准线在 ScanRefer 基准测试上超越了最新技术。

Multi3DRefer: 文本描述与多个 3D 对象的关联

Multi3DRefer: Grounding Text Description to Multiple 3D Objects

We present LOWA, a novel method for localizing objects with attributes
effectively in the wild. It aims to address the insufficiency of current
open-vocabulary object detectors, which are limited by the lack of
instance-level attribute classification and rare class names. To train LOWA, we
propose a hybrid vision-language training strategy to learn object detection
and recognition with class names as well as attribute information. With LOWA,
users can not only detect objects with class names, but also able to localize
objects by attributes. LOWA is built on top of a two-tower vision-language
architecture and consists of a standard vision transformer as the image encoder
and a similar transformer as the text encoder. To learn the alignment between
visual and text inputs at the instance level, we train LOWA with three training
steps: object-level training, attribute-aware learning, and free-text joint
training of objects and attributes. This hybrid training strategy first ensures
correct object detection, then incorporates instance-level attribute
information, and finally balances the object class and attribute sensitivity.
We evaluate our model performance of attribute classification and attribute
localization on the Open-Vocabulary Attribute Detection (OVAD) benchmark and
the Visual Attributes in the Wild (VAW) dataset, and experiments indicate
strong zero-shot performance. Ablation studies additionally demonstrate the
effectiveness of each training step of our approach.

提出了一种名为 LOWA 的新方法，它基于视觉语言的训练策略，使用 transformer 架构，旨在解决当前基于开放词汇对象检测器的不足，用户不仅可以检测对象，还可以通过属性定位对象，并在 OVAD 基准测试和 VAW 数据集方面进行了评估，表现出较强的零样本性能，同时证明了该方法的每个训练步骤的有效性。

使用属性本地化野外物体

LOWA: Localize Objects in the Wild with Attributes

Localizing objects in image collections without supervision can help to avoid
expensive annotation campaigns. We propose a simple approach to this problem,
that leverages the activation features of a vision transformer pre-trained in a
self-supervised manner. Our method, LOST, does not require any external object
proposal nor any exploration of the image collection; it operates on a single
image. Yet, we outperform state-of-the-art object discovery methods by up to 8
CorLoc points on PASCAL VOC 2012. We also show that training a class-agnostic
detector on the discovered objects boosts results by another 7 points.
Moreover, we show promising results on the unsupervised object discovery task.
The code to reproduce our results can be found at
this https URL

本文提出了一种不需要昂贵的注释活动的图像集合中的目标本地化的简单方法（称为 LOST），该方法利用了以自我监督方式预训练的视觉转换器的激活特征，同时在 PASCAL VOC 2012 上的实验表明，该方法优于最先进的目标发现方法最高可达 8 CorLoc 点。此外，我们还展示了在发现对象的基础上训练一个不具有类别属性的检测器可以再次提高 7 个点，此外，我们在无监督对象发现任务上也展示了有希望的结果。