Solving long-horizon, temporally-extended tasks using Reinforcement Learning
(RL) is challenging, compounded by the common practice of learning without
prior knowledge (or tabula rasa learning). Humans can generate and execute
plans with temporally-extended actions and quickly learn to perform new tasks
because we almost never solve problems from scratch. We want autonomous agents
to have this same ability. Recently, LLMs have been shown to encode a
tremendous amount of knowledge about the world and to perform impressive
in-context learning and reasoning. However, using LLMs to solve real world
problems is hard because they are not grounded in the current task. In this
paper we exploit the planning capabilities of LLMs while using RL to provide
learning from the environment, resulting in a hierarchical agent that uses LLMs
to solve long-horizon tasks. Instead of completely relying on LLMs, they guide
a high-level policy, making learning significantly more sample efficient. This
approach is evaluated in simulation environments such as MiniGrid, SkillHack,
and Crafter, and on a real robot arm in block manipulation tasks. We show that
agents trained using our approach outperform other baselines methods and, once
trained, don't need access to LLMs during deployment.

利用强化学习解决长期和拓展性任务很具挑战性，特别是在没有先验知识的情况下，为了提高样本效率，本文通过利用 LLMs 的规划能力结合强化学习的环境学习，构建了一个层次化代理，用于解决长期任务，并在 MiniGrid、SkillHack 和 Crafter 等仿真环境以及实际机械臂的块操作任务中验证了该方法的优越性能，且训练完成后不需要依赖 LLMs 进行部署。

LLM 增强分层代理

LLM Augmented Hierarchical Agents

Hindsight Experience Replay (HER) is a technique used in reinforcement
learning (RL) that has proven to be very efficient for training off-policy
RL-based agents to solve goal-based robotic manipulation tasks using sparse
rewards. Even though HER improves the sample efficiency of RL-based agents by
learning from mistakes made in past experiences, it does not provide any
guidance while exploring the environment. This leads to very large training
times due to the volume of experience required to train an agent using this
replay strategy. In this paper, we propose a method that uses primitive
behaviours that have been previously learned to solve simple tasks in order to
guide the agent toward more rewarding actions during exploration while learning
other more complex tasks. This guidance, however, is not executed by a manually
designed curriculum, but rather using a critic network to decide at each
timestep whether or not to use the actions proposed by the previously-learned
primitive policies. We evaluate our method by comparing its performance against
HER and other more efficient variations of this algorithm in several block
manipulation tasks. We demonstrate the agents can learn a successful policy
faster when using our proposed method, both in terms of sample efficiency and
computation time. Code is available at this https URL

采用前视经验回放（HER）技术，通过利用过去的经验来改进基于强化学习代理训练的目标导向机器人操作任务的采样效率，本文提出了一种使用先前学习的简单任务来引导代理在探索中选择更有益动作的方法。我们通过在多个块操纵任务中与 HER 和其他更高效变体算法性能的对比实验中展示了使用我们提出的方法可以更快地学习有效策略，包括采样效率和计算时间。

学习和复用原始行为以提高回顾经验重演的样本效率

Learning and reusing primitive behaviours to improve Hindsight  Experience Replay sample efficiency

Learning from human demonstrations (behavior cloning) is a cornerstone of
robot learning. However, most behavior cloning algorithms require a large
number of demonstrations to learn a task, especially for general tasks that
have a large variety of initial conditions. Humans, however, can learn to
complete tasks, even complex ones, after only seeing one or two demonstrations.
Our work seeks to emulate this ability, using behavior cloning to learn a task
given only a single human demonstration. We achieve this goal by using linear
transforms to augment the single demonstration, generating a set of
trajectories for a wide range of initial conditions. With these demonstrations,
we are able to train a behavior cloning agent to successfully complete three
block manipulation tasks. Additionally, we developed a novel addition to the
temporal ensembling method used by action chunking agents during inference. By
incorporating the standard deviation of the action predictions into the
ensembling method, our approach is more robust to unforeseen changes in the
environment, resulting in significant performance improvements.

使用行为克隆从单个人类示范中学习任务，通过使用线性变换增强单个示范来生成一系列初始条件广泛变化的轨迹，从而成功训练行为克隆智能体完成三个方块操作任务，并通过在推断期间将行动预测的标准差纳入集成方法中，使方法对环境中的意外变化更加稳健，从而实现了显著的性能提升。