Large language models (LLMs) have achieved impressive progress on several
open-world tasks. Recently, using LLMs to build embodied agents has been a
hotspot. In this paper, we propose STEVE, a comprehensive and visionary
embodied agent in the Minecraft virtual environment. STEVE consists of three
key components: vision perception, language instruction, and code action.
Vision perception involves the interpretation of visual information in the
environment, which is then integrated into the LLMs component with agent state
and task instruction. Language instruction is responsible for iterative
reasoning and decomposing complex tasks into manageable guidelines. Code action
generates executable skill actions based on retrieval in skill database,
enabling the agent to interact effectively within the Minecraft environment. We
also collect STEVE-21K dataset, which includes 600$+$ vision-environment pairs,
20K knowledge question-answering pairs, and 200$+$ skill-code pairs. We conduct
continuous block search, knowledge question and answering, and tech tree
mastery to evaluate the performance. Extensive experiments show that STEVE
achieves at most $1.5 \times$ faster unlocking key tech trees and $2.5 \times$
quicker in block search tasks compared to previous state-of-the-art methods.

STEVE 是一个在 Minecraft 虚拟环境中基于大型语言模型的综合和具有远见的具象代理，其三个关键组成部分是视觉感知、语言指导和代码动作，通过视觉信息解释、迭代推理和可执行技能行为生成，STEVE 在 Minecraft 环境中实现更快的技能解锁和方块搜索任务。

目视与思考：虚拟环境中的具身智能体

See and Think: Embodied Agent in Virtual Environment

Modern methods for vision-centric autonomous driving perception widely adopt
the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its
better efficiency than voxel representation, it has difficulty describing the
fine-grained 3D structure of a scene with a single plane. To address this, we
propose a tri-perspective view (TPV) representation which accompanies BEV with
two additional perpendicular planes. We model each point in the 3D space by
summing its projected features on the three planes. To lift image features to
the 3D TPV space, we further propose a transformer-based TPV encoder
(TPVFormer) to obtain the TPV features effectively. We employ the attention
mechanism to aggregate the image features corresponding to each query in each
TPV plane. Experiments show that our model trained with sparse supervision
effectively predicts the semantic occupancy for all voxels. We demonstrate for
the first time that using only camera inputs can achieve comparable performance
with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code:
this https URL.

采用鸟瞰图（BEV）描述自动驾驶中的三维场景难以描绘细粒度的三维结构，因此我们提出了三面图（TPV）表示法，并使用基于注意力机制的 TPV 编码器实现了显著的提升。模型可以通过稀疏监督有效预测语义占用，仅使用相机输入在 LiDAR 分割任务上可实现与基于 LiDAR 的方法相当的性能。