Large language models (LLMs) provide excellent text-generation capabilities,
but standard prompting and generation methods generally do not lead to
intentional or goal-directed agents and might necessitate considerable prompt
tuning. This becomes particularly apparent in multi-turn conversations: even
the best current LLMs rarely ask clarifying questions, engage in explicit
information gathering, or take actions now that lead to better decisions after
multiple turns. Reinforcement learning has the potential to leverage the
powerful modeling capabilities of LLMs, as well as their internal
representation of textual interactions, to create capable goal-directed
language agents. This can enable intentional and temporally extended
interactions, such as with humans, through coordinated persuasion and carefully
crafted questions, or in goal-directed play through text games to bring about
desired final outcomes. However, enabling this requires the community to
develop stable and reliable reinforcement learning algorithms that can
effectively train LLMs. Developing such algorithms requires tasks that can
gauge progress on algorithm design, provide accessible and reproducible
evaluations for multi-turn interactions, and cover a range of task properties
and challenges in improving reinforcement learning algorithms. Our paper
introduces the LMRL-Gym benchmark for evaluating multi-turn RL for LLMs,
together with an open-source research framework containing a basic toolkit for
getting started on multi-turn RL with offline value-based and policy-based RL
methods. Our benchmark consists of 8 different language tasks, which require
multiple rounds of language interaction and cover a range of tasks in
open-ended dialogue and text games.

大型语言模型和强化学习的协作为创建目标导向代理提供了潜力，但需要稳定可靠的强化学习算法。本研究引入了 LMRL-Gym 评估多轮 RL 针对 LLMs 的基准，以及一个包含基本工具包的开源研究框架，用于开始进行多轮 RL 的离线值基和策略基 RL 方法。该基准由 8 个不同的语言任务组成，需要多轮语言交互，涵盖开放对话和文本游戏的多种任务。