TL;DR本文提出了第一个适用于off-policy learning的policy gradient定理,并通过使用emphatic weightings导出了简化的梯度公式,并使用Actor Critic with Emphatic weightings (ACE)算法验证了该定理的正确性。
Abstract
policy gradient methods are widely used for control in reinforcement learning, particularly for the continuous action setting. There have been a host of theoretically sound algorithms proposed for the on-policy s