Safety is critical to broadening the application of reinforcement learning
(RL). Often, we train RL agents in a controlled environment, such as a
laboratory, before deploying them in the real world. However, the real-world
target task might be unknown prior to deployment. Reward-free RL trains an
agent without the reward to adapt quickly once the reward is revealed. We
consider the constrained reward-free setting, where an agent (the guide) learns
to explore safely without the reward signal. This agent is trained in a
controlled environment, which allows unsafe interactions and still provides the
safety signal. After the target task is revealed, safety violations are not
allowed anymore. Thus, the guide is leveraged to compose a safe behaviour
policy. Drawing from transfer learning, we also regularize a target policy (the
student) towards the guide while the student is unreliable and gradually
eliminate the influence of the guide as training progresses. The empirical
analysis shows that this method can achieve safe transfer learning and helps
the student solve the target task faster.

安全是扩展强化学习应用的关键。我们提出了一种约束无奖励强化学习方法，通过在受控环境中训练引导智能体以安全探索，最终实现有效的安全传输学习，帮助学生机器人更快地解决目标任务。

引导安全探索的强化学习

Reinforcement Learning by Guided Safe Exploration

In constrained reinforcement learning (RL), a learning agent seeks to not
only optimize the overall reward but also satisfy the additional safety,
diversity, or budget constraints. Consequently, existing constrained RL
solutions require several new algorithmic ingredients that are notably
different from standard RL. On the other hand, reward-free RL is independently
developed in the unconstrained literature, which learns the transition dynamics
without using the reward information, and thus naturally capable of addressing
RL with multiple objectives under the common dynamics. This paper bridges
reward-free RL and constrained RL. Particularly, we propose a simple
meta-algorithm such that given any reward-free RL oracle, the approachability
and constrained RL problems can be directly solved with negligible overheads in
sample complexity. Utilizing the existing reward-free RL solvers, our framework
provides sharp sample complexity results for constrained RL in the tabular MDP
setting, matching the best existing results up to a factor of horizon
dependence; our framework directly extends to a setting of tabular two-player
Markov games, and gives a new result for constrained RL with linear function
approximation.

本文探讨奖励自由强化学习和受限制的强化学习之间的联系，在标记 MDP 设置中，我们提出了一种简单的元算法，利用现有的奖励自由 RL 解算器，对受限制的强化学习问题进行直接求解， 在现有结果的基础上匹配最佳结果，同时在线性函数近似下，我们直接将其扩展到标记二人马尔可夫博弈的设置中，并提供了一个新的受限制的 RL 结果。