TL;DR本文旨在改进 Deep Policy Gradient 基元的价值估计,提高样本效率和回报率,通过引入一个使用扰动值网络来搜索更好近似的价值函数搜索算法完成。
Abstract
deep policy gradient (PG) algorithms employ value networks to drive the learning of parameterized policies and reduce the variance of the gradient estimates. However, →