We study the roots of algorithmic progress in deep policy gradient algorithms
through a case study on two popular algorithms: Proximal Policy Optimization
(PPO) and Trust Region Policy Optimization (TRPO). Specifically, we investigate
the consequences of "code-level optimizations:" algorithm augmentations found
only in implementations or described as auxiliary details to the core
algorithm. Seemingly of secondary importance, such optimizations turn out to
have a major impact on agent behavior. Our results show that they (a) are
responsible for most of PPO's gain in cumulative reward over TRPO, and (b)
fundamentally change how RL methods function. These insights show the
difficulty and importance of attributing performance gains in deep
reinforcement learning. Code for reproducing our results is available at
this https URL .

通过对两种流行算法（PPO 和 TRPO）的案例研究，我们研究了深度策略梯度算法中算法进展的根源，并调查了 “代码级优化” 的后果：这些优化仅出现在其他实现中或被描述为核心算法的辅助详细信息，它们似乎具有次要影响，但实际上极大地影响了代理行为。我们的结果表明，它们（a）负责 PPO 在累积奖励方面比 TRPO 获得的大部分收益，并且（b）从根本上改变了 RL 方法的功能。

深度策略梯度的实现问题: PPO 和 TRPO 的案例研究

Implementation Matters in Deep Policy Gradients: A Case Study on PPO and  TRPO

Three factors drive the advance of AI: algorithmic innovation, data, and the
amount of compute available for training. Algorithmic progress has
traditionally been more difficult to quantify than compute and data. In this
work, we argue that algorithmic progress has an aspect that is both
straightforward to measure and interesting: reductions over time in the compute
needed to reach past capabilities. We show that the number of floating-point
operations required to train a classifier to AlexNet-level performance on
ImageNet has decreased by a factor of 44x between 2012 and 2019. This
corresponds to algorithmic efficiency doubling every 16 months over a period of
7 years. By contrast, Moore's Law would only have yielded an 11x cost
improvement. We observe that hardware and algorithmic efficiency gains multiply
and can be on a similar scale over meaningful horizons, which suggests that a
good model of AI progress should integrate measures from both.

通过计算量的减少和算法效率的提高，探讨了算法进步方面的量化问题，认为硬件和算法的效率提升是倍增的，应该综合考虑这两个因素来评估人工智能的进展。