In safety-critical RL settings, the inclusion of an additional cost function
is often favoured over the arduous task of modifying the reward function to
ensure the agent's safe behaviour. However, designing or evaluating such a cost
function can be prohibitively expensive. For instance, in the domain of
self-driving, designing a cost function that encompasses all unsafe behaviours
(e.g. aggressive lane changes) is inherently complex. In such scenarios, the
cost function can be learned from feedback collected offline in between
training rounds. This feedback can be system generated or elicited from a human
observing the training process. Previous approaches have not been able to scale
to complex environments and are constrained to receiving feedback at the state
level which can be expensive to collect. To this end, we introduce an approach
that scales to more complex domains and extends to beyond state-level feedback,
thus, reducing the burden on the evaluator. Inferring the cost function in such
settings poses challenges, particularly in assigning credit to individual
states based on trajectory-level feedback. To address this, we propose a
surrogate objective that transforms the problem into a state-level supervised
classification task with noisy labels, which can be solved efficiently.
Additionally, it is often infeasible to collect feedback on every trajectory
generated by the agent, hence, two fundamental questions arise: (1) Which
trajectories should be presented to the human? and (2) How many trajectories
are necessary for effective learning? To address these questions, we introduce
\textit{novelty-based sampling} that selectively involves the evaluator only
when the the agent encounters a \textit{novel} trajectory. We showcase the
efficiency of our method through experimentation on several benchmark Safety
Gymnasium environments and realistic self-driving scenarios.

在安全关键的强化学习环境中，通过引入额外的成本函数来确保智能体安全行为的方法优于修改奖励函数的繁琐任务。然而，设计或评估这样的成本函数可能会非常昂贵。为了应对这个问题，我们提出一种可以在复杂环境中扩展并得到超越状态级反馈的方法，从而减轻评估者的负担。我们引入了一种替代目标，通过将问题转化为带有噪声标记的状态级监督分类任务，从而解决了根据轨迹级反馈为各个状态分配信用的挑战。此外，由于无法对智能体生成的每个轨迹收集反馈，我们提出了一种基于新颖性的采样方法，只有当智能体遇到 “新颖” 的轨迹时才会选择性地引入评估者。我们通过在多个基准安全训练场和现实自动驾驶场景中进行实验证明了我们方法的效率。

限制下的强化学习中的反馈安全性

Safety through feedback in Constrained RL

In contextual optimization, a decision-maker observes historical samples of
uncertain variables and associated concurrent covariates, without knowing their
joint distribution. Given an additional covariate observation, the goal is to
choose a decision that minimizes some operational costs. A prevalent issue here
is covariate shift, where the marginal distribution of the new covariate
differs from historical samples, leading to decision performance variations
with nonparametric or parametric estimators. To address this, we propose a
distributionally robust approach that uses an ambiguity set by the intersection
of two Wasserstein balls, each centered on typical nonparametric or parametric
distribution estimators. Computationally, we establish the tractable
reformulation of this distributionally robust optimization problem.
Statistically, we provide guarantees for our Wasserstein ball intersection
approach under covariate shift by analyzing the measure concentration of the
estimators. Furthermore, to reduce computational complexity, we employ a
surrogate objective that maintains similar generalization guarantees. Through
synthetic and empirical case studies on income prediction and portfolio
optimization, we demonstrate the strong empirical performance of our proposed
models.

在上下文优化中，通过观察不确定变量的历史样本和相关联的并发协变量，不知道它们的联合分布。在给定附加协变量观测情况下，目标是选择最小化某些操作成本的决策。这里的一个普遍问题是协变量偏移，其中新协变量的边际分布与历史样本不同，导致具有非参数或参数估计器的决策性能变化。为了解决这个问题，我们提出了一个分布鲁棒方法，使用两个以典型的非参数或参数分布估计器为中心的 Wasserstein 球的交集作为模糊集合。在计算上，我们建立了这个分布鲁棒优化问题的易于计算的改写形式。在统计上，通过分析估计器的测度集中性，我们提供了我们的 Wasserstein 球交集方法在协变量偏移下的保证。此外，为了减少计算复杂性，我们采用了一个保持类似泛化保证的替代目标。通过对收入预测和投资组合优化的合成和实证案例研究，我们展示了我们提出的模型的强大实证性能。

上下文优化在协变量漂移下的鲁棒方法：通过相交的 Wasserstein 球

Contextual Optimization under Covariate Shift: A Robust Approach by  Intersecting Wasserstein Balls

Proximal Policy Optimization (PPO) methods learn a policy by iteratively
performing multiple mini-batch optimization epochs of a surrogate objective
with one set of sampled data. Ratio clipping PPO is a popular variant that
clips the probability ratios between the target policy and the policy used to
collect samples. Ratio clipping yields a pessimistic estimate of the original
surrogate objective, and has been shown to be crucial for strong performance.
We show in this paper that such ratio clipping may not be a good option as it
can fail to effectively bound the ratios. Instead, one can directly optimize
the original surrogate objective for multiple epochs; the key is to find a
proper condition to early stop the optimization epoch in each iteration. Our
theoretical analysis sheds light on how to determine when to stop the
optimization epoch, and call the resulting algorithm Early Stopping Policy
Optimization (ESPO). We compare ESPO with PPO across many continuous control
tasks and show that ESPO significantly outperforms PPO. Furthermore, we show
that ESPO can be easily scaled up to distributed training with many workers,
delivering strong performance as well.

本文探讨了比例剪切 PPO 方法的缺陷，提出了一种名为 ESPO 的早停策略优化算法，通过在多个连续控制任务上的比较，发现 ESPO 显著优于 PPO，而且能够轻松扩展到使用多个工作器进行分布式训练。