We consider the problem of learning a set of probability distributions from
the empirical Bellman dynamics in distributional reinforcement learning (RL), a
class of state-of-the-art methods that estimate the distribution, as opposed to
only the expectation, of the total return. We formulate a method that learns a
finite set of statistics from each return distribution via neural networks, as
in (Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b). Existing
distributional RL methods however constrain the learned statistics to
\emph{predefined} functional forms of the return distribution which is both
restrictive in representation and difficult in maintaining the predefined
statistics. Instead, we learn \emph{unrestricted} statistics, i.e.,
deterministic (pseudo-)samples, of the return distribution by leveraging a
technique from hypothesis testing known as maximum mean discrepancy (MMD),
which leads to a simpler objective amenable to backpropagation. Our method can
be interpreted as implicitly matching all orders of moments between a return
distribution and its Bellman target. We establish sufficient conditions for the
contraction of the distributional Bellman operator and provide finite-sample
analysis for the deterministic samples in distribution approximation.
Experiments on the suite of Atari games show that our method outperforms the
standard distributional RL baselines and sets a new record in the Atari games
for non-distributed agents.

本文提出了一种无限制统计学习方法，利用神经网络和最大均值偏差技术来匹配收益分布和 Bellman 目标，适用于分布式 RL 并在 Atari 游戏中获得了优异表现。