In offline model-based reinforcement learning (offline MBRL), we learn a dynamic model from historically collected data, and subsequently utilize the learned model and fixed datasets for policy learning, without further interacting with the environment. Offline MBRL algorithms can improve the efficiency and stability of policy learning over the model-free algorithms. However, in most of the existing offline MBRL algorithms, the learning objectives for the dynamic models and the policies are isolated from each other. Such an objective mismatch may lead to inferior performance of the learned agents. In this paper, we address this issue by developing an iterative offline MBRL framework, where we maximize a lower bound of the true expected return, by alternating between dynamic-model training and policy learning. With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets. Source code is publicly released.

本文提出了一种迭代离线模型学习(MBRL)框架，其中通过交替进行动态模型训练和策略学习来最大化真实预期回报的下限，从而解决了动态模型和策略学习之间的目标不匹配问题，从而在广泛的连续控制离线强化学习数据集上实现了竞争性能。

交替离线模型训练和策略学习的统一框架