We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.

提出了一种名为ARMOR的新型基于模型的离线RL框架，可在面对不确定性时优化最坏情况下的相对性能并学习在任何超参数下始终不降级基线策略的稳健策略改进，使其特别适用于建立实际学习系统。

ARMOR: 一种基于模型的框架，用于利用离线数据改进任意基准策略