Current model-based reinforcement learning approaches use the model simply as a learned black-box simulator to augment the data for policy optimization or value function learning. In this paper, we show how to make more effective use of the model by exploiting its differentiability. We construct a policy optimization algorithm that uses the pathwise derivative of the learned model and policy across future timesteps. Instabilities of learning across many timesteps are prevented by using a terminal value function, learning the policy in an actor-critic fashion. Furthermore, we present a derivation on the monotonic improvement of our objective in terms of the gradient error in the model and value function. We show that our approach (i) is consistently more sample efficient than existing state-of-the-art model-based algorithms, (ii) matches the asymptotic performance of model-free algorithms, and (iii) scales to long horizons, a regime where typically past model-based approaches have struggled.

本文介绍了一种新的基于模型的强化学习算法，通过利用学习到的模型和策略经过多个时间步长的路径导数来构建策略优化算法，同时通过学习一个演员评论家，使用终端值函数避免了通过多个时间步长的不稳定性。结果显示，该方法比现有的最先进的基于模型的算法在样本效率上更为一致，并且与基于模型的算法达到了基于模型的算法无法达到的渐近性能，而且具有可扩展性。

模型增强的Actor-Critic算法：透过路径反向传播