通用离线演员-评论家

Mar, 2019

Generalized Off-Policy Actor-Critic

Shangtong Zhang, Wendelin Boehmer, Shimon Whiteson

TL;DR提出了一个新的目标函数，counterfactual objective，用于解决连续强化学习中离线策略梯度算法中的问题，得到了广义离线策略梯度定理，并发展出了广义离线行动者-评论者算法（Geoff-PAC），通过模拟机器人实验表明其优于现有算法。

Abstract

We propose a new objective, the counterfactual objective, unifying existing objectives for off-policy policy gradient algorithms in the continuin