Learning generalizable visual representations from Internet data has yielded
promising results for robotics. Yet, prevailing approaches focus on
pre-training 2D representations, being sub-optimal to deal with occlusions and
accurately localize objects in complex 3D scenes. Meanwhile, 3D representation
learning has been limited to single-object understanding. To address these
limitations, we introduce a novel 3D pre-training framework for robotics named
SUGAR that captures semantic, geometric and affordance properties of objects
through 3D point clouds. We underscore the importance of cluttered scenes in 3D
representation learning, and automatically construct a multi-object dataset
benefiting from cost-free supervision in simulation. SUGAR employs a versatile
transformer-based model to jointly address five pre-training tasks, namely
cross-modal knowledge distillation for semantic learning, masked point modeling
to understand geometry structures, grasping pose synthesis for object
affordance, 3D instance segmentation and referring expression grounding to
analyze cluttered scenes. We evaluate our learned representation on three
robotic-related tasks, namely, zero-shot 3D object recognition, referring
expression grounding, and language-driven robotic manipulation. Experimental
results show that SUGAR's 3D representation outperforms state-of-the-art 2D and
3D representations.

通过名称为 SUGAR 的新型 3D 预训练框架，可以捕捉物体的语义、几何和功能属性，解决了处理复杂 3D 场景中的遮挡和准确定位对象的亚优缺陷；SUGAR 利用可变转换模型同时处理五个预训练任务，包括语义学习的跨模态知识蒸馏、理解几何结构的遮蔽点建模、对象功能的抓握姿势合成、3D 实例分割和杂乱场景中的指代表达接地；实验结果表明，SUGAR 的 3D 表示优于最先进的 2D 和 3D 表示。

SUGAR: 为机器人预训练 3D 视觉表征

SUGAR: Pre-training 3D Visual Representations for Robotics

Affordance knowledge is a fundamental aspect of commonsense knowledge. Recent
findings indicate that world knowledge emerges through large-scale
self-supervised pretraining, motivating our exploration of acquiring affordance
knowledge from the visual domain. To this end, we augment an existing
instructional video resource to create the new Causal Action-Effect (CAE)
dataset and design two novel pretraining tasks -- Masked Action Modeling (MAM)
and Masked Effect Modeling (MEM) -- promoting the acquisition of two affordance
properties in models: behavior and entity equivalence, respectively. We
empirically demonstrate the effectiveness of our proposed methods in learning
affordance properties. Furthermore, we show that a model pretrained on both
tasks outperforms a strong image-based visual-linguistic foundation model
(FLAVA) as well as pure linguistic models on a zero-shot physical reasoning
probing task.

通过自我监督预训练方法，从视觉领域获取行动效果相关的可供性知识，进而证实在学习可供性特性方面，基于行动模式和效果模式的双重预训练任务比基于图像的视觉 - 语言模型以及纯语言模型更为有效。

基于因果动作 - 效应建模的视频领域隐式便利性获取

Implicit Affordance Acquisition via Causal Action-Effect Modeling in the  Video Domain

Affordance detection, which refers to perceiving objects with potential
action possibilities in images, is a challenging task since the possible
affordance depends on the person's purpose in real-world application scenarios.
The existing works mainly extract the inherent human-object dependencies from
image/video to accommodate affordance properties that change dynamically. In
this paper, we explore to perceive affordance from a vision-language
perspective and consider the challenging phrase-based affordance detection
problem,i.e., given a set of phrases describing the action purposes, all the
object regions in a scene with the same affordance should be detected. To this
end, we propose a cyclic bilateral consistency enhancement network (CBCE-Net)
to align language and vision features progressively. Specifically, the
presented CBCE-Net consists of a mutual guided vision-language module that
updates the common features of vision and language in a progressive manner, and
a cyclic interaction module (CIM) that facilitates the perception of possible
interaction with objects in a cyclic manner. In addition, we extend the public
Purpose-driven Affordance Dataset (PAD) by annotating affordance categories
with short phrases. The contrastive experimental results demonstrate the
superiority of our method over nine typical methods from four relevant fields
in terms of both objective metrics and visual quality. The related code and
dataset will be released at https://github.com/lulsheng/CBCE-Net.

本文提出了一种基于视觉 - 语言角度的，循环双边一致性增强网络（CBCE-Net）来检测与对象互动的可能性，通过对公开的 Purpose-driven Affordance Dataset (PAD) 进行扩展，使用短语注释了能力类别。实验结果证明了我们的方法在目标评价指标和视觉质量两方面均优于相关领域的九种典型方法。