We propose expected policy gradients (EPG), which unify stochastic policy
gradients (SPG) and deterministic policy gradients (DPG) for reinforcement
learning. Inspired by expected sarsa, EPG integrates across the action when
estimating the gradient, instead of relying only on the action in the sampled
trajectory. We establish a new general policy gradient theorem, of which the
stochastic and deterministic policy gradient theorems are special cases. We
also prove that EPG reduces the variance of the gradient estimates without
requiring deterministic policies and, for the Gaussian case, with no
computational overhead. Finally, we show that it is optimal in a certain sense
to explore with a Gaussian policy such that the covariance is proportional to
the exponential of the scaled Hessian of the critic with respect to the
actions. We present empirical results confirming that this new form of
exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic
in four challenging MuJoCo domains.

论文提出了一种集成了随机策略梯度和确定性策略梯度的预期策略梯度，通过对动作的积分来估算梯度，证明了其可以降低梯度估算的方差，对于高斯探索，通过设置动作的海森矩阵的指数作为协方差比标准探索更优，在四个 MuJoCo 域中明显优于使用奥恩斯坦 - 乌伦贝克启发式的确定性策略梯度.