ReParameterization (RP) Policy Gradient Methods (PGMs) have been widely
adopted for continuous control tasks in robotics and computer graphics.
However, recent studies have revealed that, when applied to long-term
reinforcement learning problems, model-based RP PGMs may experience chaotic and
non-smooth optimization landscapes with exploding gradient variance, which
leads to slow convergence. This is in contrast to the conventional belief that
reparameterization methods have low gradient estimation variance in problems
such as training deep generative models. To comprehend this phenomenon, we
conduct a theoretical examination of model-based RP PGMs and search for
solutions to the optimization difficulties. Specifically, we analyze the
convergence of the model-based RP PGMs and pinpoint the smoothness of function
approximators as a major factor that affects the quality of gradient
estimation. Based on our analysis, we propose a spectral normalization method
to mitigate the exploding variance issue caused by long model unrolls. Our
experimental results demonstrate that proper normalization significantly
reduces the gradient variance of model-based RP PGMs. As a result, the
performance of the proposed method is comparable or superior to other gradient
estimators, such as the Likelihood Ratio (LR) gradient estimator. Our code is
available at this https URL

对长期强化学习问题应用基于模型的 ReParameterization Policy Gradient Methods 时，可能遇到爆炸梯度方差引起的优化困难。通过对模型的收敛性和函数逼近器的平滑性的分析，我们提出了一种谱归一化方法以缓解长模型展开引起的方差问题。实验结果表明，适当的归一化显著降低了基于模型的 ReParameterization Policy Gradient Methods 的梯度方差。与 Likelihood Ratio 梯度估计器等其他梯度估计器相比，我们的方法的性能相当或更好。