The ability to understand and reason the 3D real world is a crucial milestone
towards artificial general intelligence. The current common practice is to
finetune Large Language Models (LLMs) with 3D data and texts to enable 3D
understanding. Despite their effectiveness, these approaches are inherently
limited by the scale and diversity of the available 3D data. Alternatively, in
this work, we introduce Agent3D-Zero, an innovative 3D-aware agent framework
addressing the 3D scene understanding in a zero-shot manner. The essence of our
approach centers on reconceptualizing the challenge of 3D scene perception as a
process of understanding and synthesizing insights from multiple images,
inspired by how our human beings attempt to understand 3D scenes. By
consolidating this idea, we propose a novel way to make use of a Large Visual
Language Model (VLM) via actively selecting and analyzing a series of
viewpoints for 3D understanding. Specifically, given an input 3D scene,
Agent3D-Zero first processes a bird's-eye view image with custom-designed
visual prompts, then iteratively chooses the next viewpoints to observe and
summarize the underlying knowledge. A distinctive advantage of Agent3D-Zero is
the introduction of novel visual prompts, which significantly unleash the VLMs'
ability to identify the most informative viewpoints and thus facilitate
observing 3D scenes. Extensive experiments demonstrate the effectiveness of the
proposed framework in understanding diverse and previously unseen 3D
environments.

通过引入 Agent3D-Zero 框架，我们能够以零样本学习的方式处理三维场景理解问题，通过选择和分析一系列视点来促进三维理解，并利用自定义的视觉提示来增强模型的能力。大量实验证明了该框架在理解各种以前未见的三维环境方面的有效性。

Agent3D-Zero：一种用于零样本三维理解的智能体

Agent3D-Zero: An Agent for Zero-shot 3D Understanding

Training a Large Visual Language Model (LVLM) from scratch, like GPT-4, is
resource-intensive. Our paper proposes an alternative method called LMEye, a
play-plug-in Interactive Perception Network for Large Language Models (LLMs),
aiming to improve the accuracy of image understanding for the LVLM. Previous
methods that infuse visual information into LLMs utilize a static visual
mapping network, but lack dynamic interaction between the LLMs and visual
information. LMEye addresses this issue by allowing the LLM to incorporate the
visual information that aligned with human instruction. Specifically, the LMEye
network consists of a static visual mapping network to provide the basic
perception of an image to LLMs. Then, it also contains additional linear layers
responsible for acquiring requests from LLMs, decomposing image features, and
transmitting the interleaved information to LLMs, respectively. In this way,
LLMs act to be in charge of understanding human instructions, sending it to the
interactive perception network, and generating the response based on the
interleaved multimodal information. We evaluate LMEye through extensive
experiments on multimodal question answering and reasoning tasks, demonstrating
that it significantly improves the zero-shot performance of LLMs on multimodal
tasks compared to previous methods.

本文提出了一种名为 LMEye 的交互感知网络，旨在提高 Large Visual Language Model 的图像理解精度。LMEye 网络包括一个静态视觉映射网络和一些负责获取请求、分解图像特征和传输交错信息的线性层。通过在多模态问答和推理任务上进行广泛实验，我们证明 LMEye 显著提高了 LLMs 在多模态任务上的零 - shot 性能。