Reinforcement Learning (RL) serves as a versatile framework for sequential
decision-making, finding applications across diverse domains such as robotics,
autonomous driving, recommendation systems, supply chain optimization, biology,
mechanics, and finance. The primary objective in these applications is to
maximize the average reward. Real-world scenarios often necessitate adherence
to specific constraints during the learning process.
This monograph focuses on the exploration of various model-based and
model-free approaches for Constrained RL within the context of average reward
Markov Decision Processes (MDPs). The investigation commences with an
examination of model-based strategies, delving into two foundational methods -
optimism in the face of uncertainty and posterior sampling. Subsequently, the
discussion transitions to parametrized model-free approaches, where the
primal-dual policy gradient-based algorithm is explored as a solution for
constrained MDPs. The monograph provides regret guarantees and analyzes
constraint violation for each of the discussed setups.
For the above exploration, we assume the underlying MDP to be ergodic.
Further, this monograph extends its discussion to encompass results tailored
for weakly communicating MDPs, thereby broadening the scope of its findings and
their relevance to a wider range of practical scenarios.

在这份研究论文中，通过系统研究了强化学习（Reinforcement Learning）在约束条件下的模型方法和无模型方法，着重分析了平均奖励随机决策过程中乐观和后验取样的基础方法以及参数化模型无关方法，同时在解决约束决策过程中提供遗憾保证和约束违规分析。同时，还探讨了在弱通信随机决策过程中的结果，扩大了研究结果的适用范围。

约束强化学习的平均奖励目标：基于模型和无模型算法

Constrained Reinforcement Learning with Average Reward Objective:  Model-Based and Model-Free Algorithms

Deep Reinforcement Learning (RL) has demonstrated impressive results in
solving complex robotic tasks such as quadruped locomotion. Yet, current
solvers fail to produce efficient policies respecting hard constraints. In this
work, we advocate for integrating constraints into robot learning and present
Constraints as Terminations (CaT), a novel constrained RL algorithm. Departing
from classical constrained RL formulations, we reformulate constraints through
stochastic terminations during policy learning: any violation of a constraint
triggers a probability of terminating potential future rewards the RL agent
could attain. We propose an algorithmic approach to this formulation, by
minimally modifying widely used off-the-shelf RL algorithms in robot learning
(such as Proximal Policy Optimization). Our approach leads to excellent
constraint adherence without introducing undue complexity and computational
overhead, thus mitigating barriers to broader adoption. Through empirical
evaluation on the real quadruped robot Solo crossing challenging obstacles, we
demonstrate that CaT provides a compelling solution for incorporating
constraints into RL frameworks. Videos and code are available at
this https URL

通过将约束条件作为终止条件，我们提出了一种将约束融入深度强化学习中的新方法，该方法能够在不引入过多复杂性和计算负担的情况下有效地遵守约束条件，并为广泛应用带来了希望。

CaT: 约束作为终止条件的腿式运动强化学习

CaT: Constraints as Terminations for Legged Locomotion Reinforcement  Learning

A popular framework for enforcing safe actions in Reinforcement Learning (RL)
is Constrained RL, where trajectory based constraints on expected cost (or
other cost measures) are employed to enforce safety and more importantly these
constraints are enforced while maximizing expected reward. Most recent
approaches for solving Constrained RL convert the trajectory based cost
constraint into a surrogate problem that can be solved using minor
modifications to RL methods. A key drawback with such approaches is an over or
underestimation of the cost constraint at each state. Therefore, we provide an
approach that does not modify the trajectory based cost constraint and instead
imitates ``good'' trajectories and avoids ``bad'' trajectories generated from
incrementally improving policies. We employ an oracle that utilizes a reward
threshold (which is varied with learning) and the overall cost constraint to
label trajectories as ``good'' or ``bad''. A key advantage of our approach is
that we are able to work from any starting policy or set of trajectories and
improve on it. In an exhaustive set of experiments, we demonstrate that our
approach is able to outperform top benchmark approaches for solving Constrained
RL problems, with respect to expected cost, CVaR cost, or even unknown cost
constraints.

通过模仿学习和轨迹标记的方法，解决强化学习中的约束问题，并在实验中展示了其优越性能。