Large language models (LLMs) have recently demonstrated their impressive
ability to provide context-aware responses via text. This ability could
potentially be used to predict plausible solutions in sequential decision
making tasks pertaining to pattern completion. For example, by observing a
partial stack of cubes, LLMs can predict the correct sequence in which the
remaining cubes should be stacked by extrapolating the observed patterns (e.g.,
cube sizes, colors or other attributes) in the partial stack. In this work, we
introduce LaGR (Language-Guided Reinforcement learning), which uses this
predictive ability of LLMs to propose solutions to tasks that have been
partially completed by a primary reinforcement learning (RL) agent, in order to
subsequently guide the latter's training. However, as RL training is generally
not sample-efficient, deploying this approach would inherently imply that the
LLM be repeatedly queried for solutions; a process that can be expensive and
infeasible. To address this issue, we introduce SEQ (sample efficient
querying), where we simultaneously train a secondary RL agent to decide when
the LLM should be queried for solutions. Specifically, we use the quality of
the solutions emanating from the LLM as the reward to train this agent. We show
that our proposed framework LaGR-SEQ enables more efficient primary RL
training, while simultaneously minimizing the number of queries to the LLM. We
demonstrate our approach on a series of tasks and highlight the advantages of
our approach, along with its limitations and potential future research
directions.

通过使用大型语言模型的预测能力，我们引入了 LaGR（语言引导的强化学习）和 SEQ（样本高效查询）两个框架，用于在部分完成的任务中提出解决方案，并同时降低对语言模型的查询次数，从而更高效地进行主要强化学习训练。

LaGR-SEQ: 语言引导的强化学习与高效抽样查询

LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient  Querying

This paper investigates the idea of encoding object-centered representations
in the design of the reward function and policy architectures of a
language-guided reinforcement learning agent. This is done using a combination
of object-wise permutation invariant networks inspired from Deep Sets and
gated-attention mechanisms. In a 2D procedurally-generated world where agents
targeting goals in natural language navigate and interact with objects, we show
that these architectures demonstrate strong generalization capacities to
out-of-distribution goals. We study the generalization to varying numbers of
objects at test time and further extend the object-centered architectures to
goals involving relational reasoning.

本文研究了在自然语言引导下的强化学习中，将以对象为中心的表现编码到奖励函数和策略架构中的想法。通过使用受深度集合启发的对象排列不变网络和门控注意机制的组合，我们在二维过程生成的世界中显示出这些结构对于分布外的目标具有强大的泛化能力，同时我们研究了在测试时对象数量的泛化和将以对象为中心的架构扩展到涉及关系推理的目标。