Empowering safe exploration of reinforcement learning (RL) agents during
training is a critical impediment towards deploying RL agents in many
real-world scenarios. Training RL agents in unknown, black-box environments
poses an even greater safety risk when prior knowledge of the domain/task is
unavailable. We introduce ADVICE (Adaptive Shielding with a Contrastive
Autoencoder), a novel post-shielding technique that distinguishes safe and
unsafe features of state-action pairs during training, thus protecting the RL
agent from executing actions that yield potentially hazardous outcomes. Our
comprehensive experimental evaluation against state-of-the-art safe RL
exploration techniques demonstrates how ADVICE can significantly reduce safety
violations during training while maintaining a competitive outcome reward.

在训练过程中，通过使用一种名为 ADVICE 的自适应屏蔽技术，可以识别出状态 - 动作对的安全和不安全特征，从而保护强化学习代理避免执行可能产生危险结果的动作，有效降低安全违规风险。

自适应屏蔽在黑盒环境中的安全强化学习

Safe Reinforcement Learning in Black-Box Environments via Adaptive  Shielding

In this paper, we propose a model-based offline reinforcement learning method
that integrates count-based conservatism, named $\texttt{Count-MORL}$. Our
method utilizes the count estimates of state-action pairs to quantify model
estimation error, marking the first algorithm of demonstrating the efficacy of
count-based conservatism in model-based offline deep RL to the best of our
knowledge. For our proposed method, we first show that the estimation error is
inversely proportional to the frequency of state-action pairs. Secondly, we
demonstrate that the learned policy under the count-based conservative model
offers near-optimality performance guarantees. Through extensive numerical
experiments, we validate that $\texttt{Count-MORL}$ with hash code
implementation significantly outperforms existing offline RL algorithms on the
D4RL benchmark datasets. The code is accessible at
$\href{https://github.com/oh-lab/Count-MORL}{this https URL}$.

本文提出了一种基于模型的离线强化学习方法 $	exttt {Count-MORL}$，该方法利用状态 - 动作对的计数估计量来量化模型估计误差，并首次演示了计数保守性在基于模型的离线深度强化学习中的效果。通过广泛的数值实验，我们验证了使用哈希码实现的 $	exttt {Count-MORL}$ 在 D4RL 基准数据集上明显优于现有离线强化学习算法。

基于模型的离线强化学习与基于计数的保守性

Model-based Offline Reinforcement Learning with Count-based Conservatism

Discount regularization, using a shorter planning horizon when calculating
the optimal policy, is a popular choice to restrict planning to a less complex
set of policies when estimating an MDP from sparse or noisy data (Jiang et al.,
2015). It is commonly understood that discount regularization functions by
de-emphasizing or ignoring delayed effects. In this paper, we reveal an
alternate view of discount regularization that exposes unintended consequences.
We demonstrate that planning under a lower discount factor produces an
identical optimal policy to planning using any prior on the transition matrix
that has the same distribution for all states and actions. In fact, it
functions like a prior with stronger regularization on state-action pairs with
more transition data. This leads to poor performance when the transition matrix
is estimated from data sets with uneven amounts of data across state-action
pairs. Our equivalence theorem leads to an explicit formula to set
regularization parameters locally for individual state-action pairs rather than
globally. We demonstrate the failures of discount regularization and how we
remedy them using our state-action-specific method across simple empirical
examples as well as a medical cancer simulator.

本文介绍一个基于状态 - 动作对的参数设置方法，用于解决基于折扣因子进行计划的正则化的不足和缺陷，能够更好地适应数据集中状态 - 动作对之间数据数量不平衡的情况。