MDP优化策略界的精确确定

Jul, 2021

Refined Policy Improvement Bounds for MDPs

J. G. Dai, Mark Gluzman

TL;DR本文提出了一个新的连续界限改进，解决了当前界限在折现系数接近1时出现的问题，增加了 TRPO 算法在长期平均奖励上的适用性。

Abstract

The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-regio