Deep reinforcement learning algorithms that learn policies by trial-and-error
must learn from limited amounts of data collected by actively interacting with
the environment. While many prior works have shown that proper regularization
techniques are crucial for enabling data-efficient RL, a general understanding
of the bottlenecks in data-efficient RL has remained unclear. Consequently, it
has been difficult to devise a universal technique that works well across all
domains. In this paper, we attempt to understand the primary bottleneck in
sample-efficient deep RL by examining several potential hypotheses such as
non-stationarity, excessive action distribution shift, and overfitting. We
perform thorough empirical analysis on state-based DeepMind control suite (DMC)
tasks in a controlled and systematic way to show that high temporal-difference
(TD) error on the validation set of transitions is the main culprit that
severely affects the performance of deep RL algorithms, and prior methods that
lead to good performance do in fact, control the validation TD error to be low.
This observation gives us a robust principle for making deep RL efficient: we
can hill-climb on the validation TD error by utilizing any form of
regularization techniques from supervised learning. We show that a simple
online model selection method that targets the validation TD error is effective
across state-based DMC and Gym tasks.

本文通过对 DeepMind 控制套件中的任务进行控制和系统性分析，研究了数据高效 RL 的瓶颈，发现高 TD 错误是深度强化学习算法性能严重影响的主要罪魁祸首，因此，在任何形式的监督学习中，利用任何形式的正则化技术，找到验证 TD 误差的最低点是使深度 RL 高效的一个强有力的原则。一个简单的在线模型选择方法针对验证 TD 错误在基于状态的 DMC 和 Gym 任务中也是有效的。

高效深度强化学习需要控制过拟合

Efficient Deep Reinforcement Learning Requires Regulating Overfitting

Prioritized Experience Replay (PER) is a deep reinforcement learning
technique in which agents learn from transitions sampled with non-uniform
probability proportionate to their temporal-difference error. We show that any
loss function evaluated with non-uniformly sampled data can be transformed into
another uniformly sampled loss function with the same expected gradient.
Surprisingly, we find in some environments PER can be replaced entirely by this
new loss function without impact to empirical performance. Furthermore, this
relationship suggests a new branch of improvements to PER by correcting its
uniformly sampled loss function equivalent. We demonstrate the effectiveness of
our proposed modifications to PER and the equivalent loss function in several
MuJoCo and Atari environments.

本研究使用优先经验回放（PER）解决深度强化学习中样本分布不均衡问题，通过等价变换使得非均衡损失函数拥有与均衡损失函数相同的梯度，并在 MuJoCo 和 Atari 环境中验证了其优越性。

损失函数与非均匀采样在经验重放中的等效性

An Equivalence between Loss Functions and Non-Uniform Sampling in  Experience Replay

We introduce learning and planning algorithms for average-reward MDPs,
including 1) the first general proven-convergent off-policy model-free control
algorithm without reference states, 2) the first proven-convergent off-policy
model-free prediction algorithm, and 3) the first off-policy learning algorithm
that converges to the actual value function rather than to the value function
plus an offset. All of our algorithms are based on using the
temporal-difference error rather than the conventional error when updating the
estimate of the average reward. Our proof techniques are a slight
generalization of those by Abounadi, Bertsekas, and Borkar (2001). In
experiments with an Access-Control Queuing Task, we show some of the
difficulties that can arise when using methods that rely on reference states
and argue that our new algorithms can be significantly easier to use.

本研究提出了一种基于平均报酬 MDPs 的学习和规划算法，其中包括第一种无参考状态的普遍证明收敛的无模型控制算法、第一个证明收敛的无政策自由预测算法，以及第一个离线学习算法，其收敛于实际值函数而不是值函数增加一个偏移量。在使用时间差错错误而不是常规错误更新平均报酬估计时，我们的所有算法都基于此。