In deep reinforcement learning, estimating the value function to evaluate the
quality of states and actions is essential. The value function is often trained
using the least squares method, which implicitly assumes a Gaussian error
distribution. However, a recent study suggested that the error distribution for
training the value function is often skewed because of the properties of the
Bellman operator, and violates the implicit assumption of normal error
distribution in the least squares method. To address this, we proposed a method
called Symmetric Q-learning, in which the synthetic noise generated from a
zero-mean distribution is added to the target values to generate a Gaussian
error distribution. We evaluated the proposed method on continuous control
benchmark tasks in MuJoCo. It improved the sample efficiency of a
state-of-the-art reinforcement learning method by reducing the skewness of the
error distribution.

深度强化学习中，通过使用对称 Q 学习方法，将来自零均值分布的合成噪声添加到目标值中，从而生成高斯误差分布，以改善价值函数训练中的偏斜错误分布问题，并提高现有的强化学习方法在连续控制任务中的样本效率。

对称 Q 学习：减小在线强化学习中贝尔曼误差的偏斜度

Symmetric Q-learning: Reducing Skewness of Bellman Error in Online  Reinforcement Learning

The family of temporal difference (TD) methods span a spectrum from
computationally frugal linear methods like TD({\lambda}) to data efficient
least squares methods. Least square methods make the best use of available data
directly computing the TD solution and thus do not require tuning a typically
highly sensitive learning rate parameter, but require quadratic computation and
storage. Recent algorithmic developments have yielded several sub-quadratic
methods that use an approximation to the least squares TD solution, but incur
bias. In this paper, we propose a new family of accelerated gradient TD (ATD)
methods that (1) provide similar data efficiency benefits to least-squares
methods, at a fraction of the computation and storage (2) significantly reduce
parameter sensitivity compared to linear TD methods, and (3) are asymptotically
unbiased. We illustrate these claims with a proof of convergence in expectation
and experiments on several benchmark domains and a large-scale industrial
energy allocation domain.

本文提出了一种新的 TD 方法家族 ATD 方法，用于在保证数据效率、减少参数灵敏度和渐进无偏的情况下，大幅减少计算和存储的量，其收敛性得到了期望的证明，并在多个基准域和大型工业能源分配域上进行了实验。