Knowing the learning dynamics of policy is significant to unveiling the
mysteries of Reinforcement Learning (RL). It is especially crucial yet
challenging to Deep RL, from which the remedies to notorious issues like sample
inefficiency and learning instability could be obtained. In this paper, we
study how the policy networks of typical DRL agents evolve during the learning
process by empirically investigating several kinds of temporal change for each
policy parameter. On typical MuJoCo and DeepMind Control Suite (DMC)
benchmarks, we find common phenomena for TD3 and RAD agents: 1) the activity of
policy network parameters is highly asymmetric and policy networks advance
monotonically along very few major parameter directions; 2) severe detours
occur in parameter update and harmonic-like changes are observed for all minor
parameter directions. By performing a novel temporal SVD along policy learning
path, the major and minor parameter directions are identified as the columns of
right unitary matrix associated with dominant and insignificant singular values
respectively. Driven by the discoveries above, we propose a simple and
effective method, called Policy Path Trimming and Boosting (PPTB), as a general
plug-in improvement to DRL algorithms. The key idea of PPTB is to periodically
trim the policy learning path by canceling the policy updates in minor
parameter directions, while boost the learning path by encouraging the advance
in major directions. In experiments, we demonstrate the general and significant
performance improvements brought by PPTB, when combined with TD3 and RAD in
MuJoCo and DMC environments respectively.

本文研究深度强化学习代理策略网络在学习过程中的演化，发现参数更新存在重大方向和次要方向，提出了基于此发现的简单而有效的方法 Policy Path Trimming and Boosting (PPTB)，并证明其与 TD3 和 RAD 在 MuJoCo 和 DMC 环境中结合使用可以带来更好的性能改进。