Reinforcement Learning (RL) serves as a versatile framework for sequential
decision-making, finding applications across diverse domains such as robotics,
autonomous driving, recommendation systems, supply chain optimization, biology,
mechanics, and finance. The primary objective in these applications is to
maximize the average reward. Real-world scenarios often necessitate adherence
to specific constraints during the learning process.
This monograph focuses on the exploration of various model-based and
model-free approaches for Constrained RL within the context of average reward
Markov Decision Processes (MDPs). The investigation commences with an
examination of model-based strategies, delving into two foundational methods -
optimism in the face of uncertainty and posterior sampling. Subsequently, the
discussion transitions to parametrized model-free approaches, where the
primal-dual policy gradient-based algorithm is explored as a solution for
constrained MDPs. The monograph provides regret guarantees and analyzes
constraint violation for each of the discussed setups.
For the above exploration, we assume the underlying MDP to be ergodic.
Further, this monograph extends its discussion to encompass results tailored
for weakly communicating MDPs, thereby broadening the scope of its findings and
their relevance to a wider range of practical scenarios.

在这份研究论文中，通过系统研究了强化学习（Reinforcement Learning）在约束条件下的模型方法和无模型方法，着重分析了平均奖励随机决策过程中乐观和后验取样的基础方法以及参数化模型无关方法，同时在解决约束决策过程中提供遗憾保证和约束违规分析。同时，还探讨了在弱通信随机决策过程中的结果，扩大了研究结果的适用范围。

约束强化学习的平均奖励目标：基于模型和无模型算法

Constrained Reinforcement Learning with Average Reward Objective:  Model-Based and Model-Free Algorithms

Bayesian reinforcement learning (RL) offers a principled and elegant approach
for sequential decision making under uncertainty. Most notably, Bayesian agents
do not face an exploration/exploitation dilemma, a major pathology of
frequentist methods. A key challenge for Bayesian RL is the computational
complexity of learning Bayes-optimal policies, which is only tractable in toy
domains. In this paper we propose a novel model-free approach to address this
challenge. Rather than modelling uncertainty in high-dimensional state
transition distributions as model-based approaches do, we model uncertainty in
a one-dimensional Bellman operator. Our theoretical analysis reveals that
existing model-free approaches either do not propagate epistemic uncertainty
through the MDP or optimise over a set of contextual policies instead of all
history-conditioned policies. Both approximations yield policies that can be
arbitrarily Bayes-suboptimal. To overcome these issues, we introduce the
Bayesian exploration network (BEN) which uses normalising flows to model both
the aleatoric uncertainty (via density estimation) and epistemic uncertainty
(via variational inference) in the Bellman operator. In the limit of complete
optimisation, BEN learns true Bayes-optimal policies, but like in variational
expectation-maximisation, partial optimisation renders our approach tractable.
Empirical results demonstrate that BEN can learn true Bayes-optimal policies in
tasks where existing model-free approaches fail.

贝叶斯强化学习在面对不确定性的顺序决策问题中提供了一种原则性和优雅的方法，但其主要挑战是在高维状态转移分布中建模不确定性的计算复杂性。本文提出了一种新颖的无模型方法来解决这个挑战，通过在一维贝尔曼算子中建模不确定性，引入贝叶斯探索网络 (BEN)，通过正态化流来建模贝尔曼算子中的不确定性，并通过变分推断来建模知识性不确定性，实验结果表明，BEN 可以在现有的无模型方法失败的任务中学习到真正的贝叶斯最优策略。