Recent vision-language-action (VLA) models rely on 2D inputs, lacking
integration with the broader realm of the 3D physical world. Furthermore, they
perform action prediction by learning a direct mapping from perception to
action, neglecting the vast dynamics of the world and the relations between
actions and dynamics. In contrast, human beings are endowed with world models
that depict imagination about future scenarios to plan actions accordingly. To
this end, we propose 3D-VLA by introducing a new family of embodied foundation
models that seamlessly link 3D perception, reasoning, and action through a
generative world model. Specifically, 3D-VLA is built on top of a 3D-based
large language model (LLM), and a set of interaction tokens is introduced to
engage with the embodied environment. Furthermore, to inject generation
abilities into the model, we train a series of embodied diffusion models and
align them into the LLM for predicting the goal images and point clouds. To
train our 3D-VLA, we curate a large-scale 3D embodied instruction dataset by
extracting vast 3D-related information from existing robotics datasets. Our
experiments on held-in datasets demonstrate that 3D-VLA significantly improves
the reasoning, multimodal generation, and planning capabilities in embodied
environments, showcasing its potential in real-world applications.

提出了一种基于 3D 感知、推理和行动的生成世界模型的 3D-VLA 模型，通过引入一系列交互令牌与具体环境进行交互，训练一系列融入大规模 3D 语言模型的生成扩散模型以预测目标图像和点云，并在大规模数据集上的实验中展示了 3D-VLA 在推理、多模态生成和规划能力上的显著改进，展示了其在真实世界应用中的潜力。

3D-VLA：一个基于三维视觉 - 语言 - 动作的生成式世界模型

3D-VLA: A 3D Vision-Language-Action Generative World Model

Learning to ground natural language queries to target objects or regions in
3D point clouds is quite essential for 3D scene understanding. Nevertheless,
existing 3D visual grounding approaches require a substantial number of
bounding box annotations for text queries, which is time-consuming and
labor-intensive to obtain. In this paper, we propose \textbf{3D-VLA}, a weakly
supervised approach for \textbf{3D} visual grounding based on \textbf{V}isual
\textbf{L}inguistic \textbf{A}lignment. Our 3D-VLA exploits the superior
ability of current large-scale vision-language models (VLMs) on aligning the
semantics between texts and 2D images, as well as the naturally existing
correspondences between 2D images and 3D point clouds, and thus implicitly
constructs correspondences between texts and 3D point clouds with no need for
fine-grained box annotations in the training procedure. During the inference
stage, the learned text-3D correspondence will help us ground the text queries
to the 3D target objects even without 2D images. To the best of our knowledge,
this is the first work to investigate 3D visual grounding in a weakly
supervised manner by involving large scale vision-language models, and
extensive experiments on ReferIt3D and ScanRefer datasets demonstrate that our
3D-VLA achieves comparable and even superior results over the fully supervised
methods.

基于大规模视觉 - 语言模型的弱监督学习方法，利用 2D 图像和 3D 点云之间天然存在的对应关系，无需精细标注的边界框注释，通过学习文本 - 3D 对应，实现文本查询与 3D 目标物的关联。实验结果在 ReferIt3D 和 ScanRefer 数据集上表明，3D-VLA 方法实现了与完全监督方法相当甚至更出色的效果。