Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge to be used for the low-level trajectories themselves. In this work, we address this assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict a dense sequence of end-effector poses for manipulation skills, when given access to only object detection and segmentation vision models. We study how well a single task-agnostic prompt, without any in-context examples, motion primitives, or external trajectory optimisers, can perform across 26 real-world language-based tasks, such as "open the bottle cap" and "wipe the plate with the sponge", and we investigate which design choices in this prompt are the most effective. Our conclusions raise the assumed limit of LLMs for robotics, and we reveal for the first time that LLMs do indeed possess an understanding of low-level robot control sufficient for a range of common tasks, and that they can additionally detect failures and then re-plan trajectories accordingly. Videos, code, and prompts are available at: https://www.robot-learning.uk/language-models-trajectory-generators.

大型语言模型（LLMs）已经显示出在机器人方面作为高级规划器的潜力，但通常假设LLMs在低级轨迹规划方面不具备足够的知识。本文深入探讨了这个假设，研究了当LLM（GPT-4）只有对象检测和分割视觉模型的访问权限时，是否可以直接预测操作技能的密集序列的末端执行器姿态。我们研究了一个单一的任务无关提示在26个真实世界的基于语言的任务上的表现，比如“打开瓶盖”和“用海绵擦拭盘子”，并调查了这个提示中哪些设计选择是最有效的。我们的结论打破了LLMs在机器人领域的假设限制，首次揭示了LLMs确实具备在常见任务中理解低级机器人控制的能力，并且它们还可以检测到失败并相应地重新规划轨迹。

语言模型作为零样本轨迹生成器