We study the roots of algorithmic progress in deep policy gradient algorithms
through a case study on two popular algorithms: Proximal Policy Optimization
(PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate
the consequences of "code-level optimizations:" algorithm augmentations found
only in implementations or described as auxiliary details to the core
algorithm. Seemingly of secondary importance, such optimizations turn out to
have a major impact on agent behavior. Our results show that they (a) are
responsible for most of PPO's gain in cumulative reward over TRPO, and (b)
fundamentally change how RL methods function. These insights show the
difficulty and importance of attributing performance gains in deep
reinforcement learning. Code for reproducing our results is available at
this https URL .

通过对两种流行算法（PPO 和 TRPO）的案例研究，我们研究了深度策略梯度算法中算法进展的根源，并调查了 “代码级优化” 的后果：这些优化仅出现在其他实现中或被描述为核心算法的辅助详细信息，它们似乎具有次要影响，但实际上极大地影响了代理行为。我们的结果表明，它们（a）负责 PPO 在累积奖励方面比 TRPO 获得的大部分收益，并且（b）从根本上改变了 RL 方法的功能。

深度策略梯度的实现问题: PPO 和 TRPO 的案例研究

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and  TRPO

We study how the behavior of deep policy gradient algorithms reflects the
conceptual framework motivating their development. To this end, we propose a
fine-grained analysis of state-of-the-art methods based on key elements of this
framework: gradient estimation, value prediction, and optimization landscapes.
Our results show that the behavior of deep policy gradient algorithms often
deviates from what their motivating framework would predict: the surrogate
objective does not match the true reward landscape, learned value estimators
fail to fit the true value function, and gradient estimates poorly correlate
with the "true" gradient. The mismatch between predicted and empirical behavior
we uncover highlights our poor understanding of current methods, and indicates
the need to move beyond current benchmark-centric evaluation methods.

研究了深度策略梯度算法的行为如何反映驱动其发展的概念框架，并提出了对最先进方法的细粒度分析。结果表明，深度策略梯度算法的行为经常偏离其驱动框架所预测的行为，这表明了我们对当前方法的了解不足，并提示需要超越当前基准中心的评估方法。