Large Language Models (LLMs) encapsulate an extensive amount of world
knowledge, and this has enabled their application in various domains to improve
the performance of a variety of Natural Language Processing (NLP) tasks. This
has also facilitated a more accessible paradigm of conversation-based
interactions between humans and AI systems to solve intended problems. However,
one interesting avenue that shows untapped potential is the use of LLMs as
Reinforcement Learning (RL) agents to enable conversational RL problem solving.
Therefore, in this study, we explore the concept of formulating Markov Decision
Process-based RL problems as LLM prompting tasks. We demonstrate how LLMs can
be iteratively prompted to learn and optimize policies for specific RL tasks.
In addition, we leverage the introduced prompting technique for episode
simulation and Q-Learning, facilitated by LLMs. We then show the practicality
of our approach through two detailed case studies for "Research Scientist" and
"Legal Matter Intake" workflows.

使用大型语言模型作为强化学习代理以解决对话式强化学习问题，通过提出的提示技术，演示了如何迭代引导语言模型学习和优化特定强化学习任务的策略，并通过两个具体案例研究展示了该方法的实用性。

大规模语言模型的强化学习问题解决

Reinforcement Learning Problem Solving with Large Language Models

For sophisticated reinforcement learning (RL) systems to interact usefully
with real-world environments, we need to communicate complex goals to these
systems. In this work, we explore goals defined in terms of (non-expert) human
preferences between pairs of trajectory segments. We show that this approach
can effectively solve complex RL tasks without access to the reward function,
including Atari games and simulated robot locomotion, while providing feedback
on less than one percent of our agent's interactions with the environment. This
reduces the cost of human oversight far enough that it can be practically
applied to state-of-the-art RL systems. To demonstrate the flexibility of our
approach, we show that we can successfully train complex novel behaviors with
about an hour of human time. These behaviors and environments are considerably
more complex than any that have been previously learned from human feedback.

本文研究了使用非专家人类偏好来定义复杂目标的强化学习系统的方法，并且证明此方法可实现许多复杂的强化学习任务，包括 Atari 游戏和模拟机器人，同时也大幅降低了人类监督成本，以及展示了本方法的灵活性，并可成功使用较短时间完成复杂的新颖行为的训练，同时也采用了前人的人类反馈信息和环境。

深度强化学习从人类偏好中学习

Deep reinforcement learning from human preferences

There has been a recent explosion in the capabilities of game-playing
artificial intelligence. Many classes of RL tasks, from Atari games to motor
control to board games, are now solvable by fairly generic algorithms, based on
deep learning, that learn to play from experience with minimal knowledge of the
specific domain of interest. In this work, we will investigate the performance
of these methods on Super Smash Bros. Melee (SSBM), a popular console fighting
game. The SSBM environment has complex dynamics and partial observability,
making it challenging for human and machine alike. The multi-player aspect
poses an additional challenge, as the vast majority of recent advances in RL
have focused on single-agent environments. Nonetheless, we will show that it is
possible to train agents that are competitive against and even surpass human
professionals, a new result for the multi-player video game setting.

研究了在多人游戏环境中采用强化学习 (RL) 和深度学习的方法，成功训练了一个超越人类专业玩家的自适应智能体，成果在多人视频游戏环境中具有里程碑意义。

使用深度强化学习击败世界级的超级 Smash Bros

Beating the World's Best at Super Smash Bros. with Deep Reinforcement  Learning

We describe an iterative procedure for optimizing policies, with guaranteed
monotonic improvement. By making several approximations to the
theoretically-justified procedure, we develop a practical algorithm, called
Trust Region Policy Optimization (TRPO). This algorithm is similar to natural
policy gradient methods and is effective for optimizing large nonlinear
policies such as neural networks. Our experiments demonstrate its robust
performance on a wide variety of tasks: learning simulated robotic swimming,
hopping, and walking gaits; and playing Atari games using images of the screen
as input. Despite its approximations that deviate from the theory, TRPO tends
to give monotonic improvement, with little tuning of hyperparameters.

本文提出了一种名为 TRPO 的实用算法，通过优化政策来达到保证单调改善的目的，并通过一系列实验展示了其在学习模拟机器人的 Swimming、Hopping 以及 Walking，并使用屏幕图像玩 Atari 游戏等众多方面的优越表现。