Reinforcement learning provides a mathematical framework for learning-based control, whose success largely depends on the amount of data it can utilize. The efficient utilization of historical trajectories obtained from previous policies is essential for expediting policy optimization. Empirical evidence has shown that policy gradient methods based on importance sampling work well. However, existing literature often neglect the interdependence between trajectories from different iterations, and the good empirical performance lacks a rigorous theoretical justification. In this paper, we study a variant of the natural policy gradient method with reusing historical trajectories via importance sampling. We show that the bias of the proposed estimator of the gradient is asymptotically negligible, the resultant algorithm is convergent, and reusing past trajectories helps improve the convergence rate. We further apply the proposed estimator to popular policy optimization algorithms such as trust region policy optimization. Our theoretical results are verified on classical benchmarks.

本文研究了一种重用历史轨迹的自然策略梯度方法变体，并证明了所提梯度估计器的偏差在渐近上是可以忽略的，算法收敛且重用过去的轨迹有助于提高收敛速度。我们进一步将所提估计器应用于流行的策略优化算法，如信任区域策略优化，并在经典基准测试上验证了我们的理论结果。

通过重要性采样在自然策略梯度中重新使用历史轨迹：收敛性和收敛速率