BriefGPT.xyz
Jul, 2021
MDP优化策略界的精确确定
Refined Policy Improvement Bounds for MDPs
HTML
PDF
J. G. Dai, Mark Gluzman
TL;DR
本文提出了一个新的连续界限改进,解决了当前界限在折现系数接近1时出现的问题,增加了 TRPO 算法在长期平均奖励上的适用性。
Abstract
The
policy improvement bound
on the difference of the
discounted returns
plays a crucial role in the theoretical justification of the
trust-regio
→