Although Deep Reinforcement Learning (DRL) methods can learn effective
policies for challenging problems such as Atari games and robotics tasks,
algorithms are complex and training times are often long. This study
investigates how evolution strategies (ES) perform compared to gradient-based
deep reinforcement learning methods. We use ES to optimize the weights of a
neural network via neuroevolution, performing direct policy search. We
benchmark both regular networks and policy networks consisting of a single
linear layer from observations to actions; for three classical ES methods and
for three gradient-based methods such as PPO. Our results reveal that ES can
find effective linear policies for many RL benchmark tasks, in contrast to DRL
methods that can only find successful policies using much larger networks,
suggesting that current benchmarks are easier to solve than previously assumed.
Interestingly, also for higher complexity tasks, ES achieves results comparable
to gradient-based DRL algorithms. Furthermore, we find that by directly
accessing the memory state of the game, ES are able to find successful policies
in Atari, outperforming DQN. While gradient-based methods have dominated the
field in recent years, ES offers an alternative that is easy to implement,
parallelize, understand, and tune.

该研究通过神经进化的方式使用演化策略（ES），优化神经网络的权重来进行直接策略搜索，结果表明 ES 能够在许多强化学习基准任务中找到有效的线性策略，相比深度强化学习方法，ES 实现效果与梯度下降算法相当，并且通过直接访问游戏的内存状态，在 Atari 游戏中胜过了 DQN。

用线性策略网络解决深度强化学习基准

Solving Deep Reinforcement Learning Benchmarks with Linear Policy  Networks

We study derivative-free methods for policy optimization over the class of
linear policies. We focus on characterizing the convergence rate of these
methods when applied to linear-quadratic systems, and study various settings of
driving noise and reward feedback. We show that these methods provably converge
to within any pre-specified tolerance of the optimal policy with a number of
zero-order evaluations that is an explicit polynomial of the error tolerance,
dimension, and curvature properties of the problem. Our analysis reveals some
interesting differences between the settings of additive driving noise and
random initialization, as well as the settings of one-point and two-point
reward feedback. Our theory is corroborated by extensive simulations of
derivative-free methods on these systems. Along the way, we derive convergence
rates for stochastic zero-order optimization algorithms when applied to a
certain class of non-convex problems.

本文研究了在线性策略的类别中，基于无导数法的策略优化方法。研究了不同的驱动噪声和奖励反馈设置，特别是应用于线性二次系统时的收敛速度，发现这些方法会在求解问题的误差、维度和曲率特性的确定的多项式次零阶求解下收敛至最优解水平，并发现了不同驱动噪声和奖励反馈设置下的一些有趣差异，最终在对这些系统进行广泛的模拟验证下进行计算。此外，我们也研究了基于零阶优化算法的随机非凸问题的收敛速度。

无导数策略优化方法：线性二次系统的保证

Derivative-Free Methods for Policy Optimization: Guarantees for Linear  Quadratic Systems

A common belief in model-free reinforcement learning is that methods based on
random search in the parameter space of policies exhibit significantly worse
sample complexity than those that explore the space of actions. We dispel such
beliefs by introducing a random search method for training static, linear
policies for continuous control problems, matching state-of-the-art sample
efficiency on the benchmark MuJoCo locomotion tasks. Our method also finds a
nearly optimal controller for a challenging instance of the Linear Quadratic
Regulator, a classical problem in control theory, when the dynamics are not
known. Computationally, our random search algorithm is at least 15 times more
efficient than the fastest competing model-free methods on these benchmarks. We
take advantage of this computational efficiency to evaluate the performance of
our method over hundreds of random seeds and many different hyperparameter
configurations for each benchmark task. Our simulations highlight a high
variability in performance in these benchmark tasks, suggesting that commonly
used estimations of sample efficiency do not adequately evaluate the
performance of RL algorithms.

通过介绍一种随机搜索算法，我们证明了基于策略参数空间的随机搜索方法与探索动作空间的方法在样本效率方面没有显著差异。该算法可在连续控制问题中训练静态、线性策略，并在 MuJoCo 任务基准测试中表现出与最先进的模型无关方法相当的样本效率。此外，在动力学未知的控制理论经典问题中，我们的算法也能找到接近最优的控制器，计算效率至少比这些基准测试中最快的模型无关方法高出 15 倍。