Large Language Models (LLM) based agents have shown promise in autonomously
completing tasks across various domains, e.g., robotics, games, and web
navigation. However, these agents typically require elaborate design and expert
prompts to solve tasks in specific domains, which limits their adaptability. We
introduce AutoManual, a framework enabling LLM agents to autonomously build
their understanding through interaction and adapt to new environments.
AutoManual categorizes environmental knowledge into diverse rules and optimizes
them in an online fashion by two agents: 1) The Planner codes actionable plans
based on current rules for interacting with the environment. 2) The Builder
updates the rules through a well-structured rule system that facilitates online
rule management and essential detail retention. To mitigate hallucinations in
managing rules, we introduce \textit{case-conditioned prompting} strategy for
the Builder. Finally, the Formulator agent compiles these rules into a
comprehensive manual. The self-generated manual can not only improve the
adaptability but also guide the planning of smaller LLMs while being
human-readable. Given only one simple demonstration, AutoManual significantly
improves task success rates, achieving 97.4\% with GPT-4-turbo and 86.2\% with
GPT-3.5-turbo on ALFWorld benchmark tasks. The source code will be available
soon.

通过自动生成规则和提高适应性，AutoManual 框架使基于大型语言模型（LLM）的代理能够自主构建自身的理解并适应新的环境。在 ALFWorld 基准任务上，通过 GPT-4-turbo 和 GPT-3.5-turbo，AutoManual 显著提高了任务成功率，并生成了人可读的综合手册。

AutoManual: 通过互动环境学习，由 LLM 代理生成指南手册

AutoManual: Generating Instruction Manuals by LLM Agents via Interactive  Environmental Learning

In this paper, we study imitation learning under the challenging setting of:
(1) only a single demonstration, (2) no further data collection, and (3) no
prior task or object knowledge. We show how, with these constraints, imitation
learning can be formulated as a combination of trajectory transfer and unseen
object pose estimation. To explore this idea, we provide an in-depth study on
how state-of-the-art unseen object pose estimators perform for one-shot
imitation learning on ten real-world tasks, and we take a deep dive into the
effects that camera calibration, pose estimation error, and spatial
generalisation have on task success rates. For videos, please visit
this https URL

本文研究了在只有一个示范、没有进一步的数据收集和没有先前的任务或对象知识的挑战性环境下的模仿学习，并展示了如何在这些限制条件下将模仿学习表述为轨迹转移和未见物体姿态估计的组合。通过对十个真实世界任务进行一次性模仿学习，我们深入研究了最先进的未见物体姿态估计器在性能上的表现，并深入了解了相机标定、姿态估计误差和空间泛化对任务成功率的影响。

一次性模仿学习：姿势估计视角

One-Shot Imitation Learning: A Pose Estimation Perspective

Large language models encode a vast amount of semantic knowledge and possess
remarkable understanding and reasoning capabilities. Previous research has
explored how to ground language models in robotic tasks to ensure that the
sequences generated by the language model are both logically correct and
practically executable. However, low-level execution may deviate from the
high-level plan due to environmental perturbations or imperfect controller
design. In this paper, we propose DoReMi, a novel language model grounding
framework that enables immediate Detection and Recovery from Misalignments
between plan and execution. Specifically, during low-level skill execution, we
use a vision question answering (VQA) model to regularly detect plan-execution
misalignments. If certain misalignment occurs, our method will call the
language model to re-plan in order to recover from misalignments. Experiments
on various complex tasks including robot arms and humanoid robots demonstrate
that our method can lead to higher task success rates and shorter task
completion times. Videos of DoReMi are available at
this https URL

本文提出了 DoReMi，这是一个新颖的语言模型基础框架，旨在检测计划和执行之间的不一致并从中恢复，实验表明，与其他模型相比，DoReMi 可以提高任务成功率并缩短任务完成时间。