TL;DR将自然语言与物理世界联系起来是一个应用广泛的话题,该论文提出了一种基于对象中心先验知识的多视角特征融合策略,用于改善基于 2D 和 3D 图像的自然语言 grounding 和语言引导机器人抓取任务。
Abstract
grounding natural language to the physical world is a ubiquitous topic with a
wide range of applications in computer vision and robotics. Recently, 2D
vision-language models such as CLIP have been widely popularized, due to their
impressive capabilities for open-vocabulary grounding in