We present a model-based offline reinforcement learning policy performance
lower bound that explicitly captures dynamics model misspecification and
distribution mismatch and we propose an empirical algorithm for optimal offline
policy selection. Theoretically, we prove a novel safe policy improvement
theorem by establishing pessimism approximations to the value function. Our key
insight is to jointly consider selecting over dynamics models and policies: as
long as a dynamics model can accurately represent the dynamics of the
state-action pairs visited by a given policy, it is possible to approximate the
value of that particular policy. We analyze our lower bound in the LQR setting
and also show competitive performance to previous lower bounds on policy
selection across a set of D4RL tasks.

我们提出了一个基于模型的离线强化学习策略性能下限，明确捕捉动力学模型误差和分布不匹配，并提出一种用于最优离线策略选择的实证算法。我们通过建立对价值函数的悲观近似来证明了一种新的安全策略改进定理。我们的关键见解是同时考虑动力学模型和策略的选择：只要动力学模型能够准确地表示给定策略访问的状态 - 操作对的动态特性，就可能近似该特定策略的值。我们在 LQR 设置下分析了我们的下限，并在一组 D4RL 任务的策略选择上展示了有竞争力的性能下限。