Safe reinforcement learning (RL) trains a constraint satisfaction policy by
interacting with the environment. We aim to tackle a more challenging problem:
learning a safe policy from an offline dataset. We study the offline safe RL
problem from a novel multi-objective optimization perspective and propose the
$\epsilon$-reducible concept to characterize problem difficulties. The inherent
trade-offs between safety and task performance inspire us to propose the
constrained decision transformer (CDT) approach, which can dynamically adjust
the trade-offs during deployment. Extensive experiments show the advantages of
the proposed method in learning an adaptive, safe, robust, and high-reward
policy. CDT outperforms its variants and strong offline safe RL baselines by a
large margin with the same hyperparameters across all tasks, while keeping the
zero-shot adaptation capability to different constraint thresholds, making our
approach more suitable for real-world RL under constraints.

该论文研究了如何从离线数据集中学习到一个安全政策，提出了一种多目标优化的方法，并通过 “ε- 可减” 向量量化了问题难度，发现在安全性和任务性能之间存在平衡，于是提出了一种 “受限决策 Transformer” 方法并进行了实验，结果表明我们的方法比其他方法在各种任务中都表现出更好更安全和更高的效益。