In this paper, we propose the Quantile Option Architecture (QUOTA) for
exploration based on recent advances in distributional reinforcement learning
(RL). In QUOTA, decision making is based on quantiles of a value distribution,
not only the mean. QUOTA provides a new dimension for exploration via making
use of both optimism and pessimism of a value distribution. We demonstrate the
performance advantage of QUOTA in both challenging video games and physical
robot simulators.

该论文提出了量化期权体系结构（QUOTA），基于分布式强化学习的最新进展进行探索。 QUOTA 具有新的探索维度，同时利用价值分布的乐观和悲观。在具有挑战性的视频游戏和物理机器人模拟器中，我们证明了 QUOTA 的性能优势。

QUOTA：用于强化学习的分位数选项架构

QUOTA: The Quantile Option Architecture for Reinforcement Learning

In reinforcement learning an agent interacts with the environment by taking
actions and observing the next state and reward. When sampled
probabilistically, these state transitions, rewards, and actions can all induce
randomness in the observed long-term return. Traditionally, reinforcement
learning algorithms average over this randomness to estimate the value
function. In this paper, we build on recent work advocating a distributional
approach to reinforcement learning in which the distribution over returns is
modeled explicitly instead of only estimating the mean. That is, we examine
methods of learning the value distribution instead of the value function. We
give results that close a number of gaps between the theoretical and
algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we
extend existing results to the approximate distribution setting. Second, we
present a novel distributional reinforcement learning algorithm consistent with
our theoretical formulation. Finally, we evaluate this new algorithm on the
Atari 2600 games, observing that it significantly outperforms many of the
recent improvements on DQN, including the related distributional algorithm C51.

本文介绍了一种分布强化学习方法，不仅仅用于估计价值函数的平均值，而是显式地建模返回的分布，通过闭合实验和文献相关得到了一些理论和算法上的结果，最后在 Atari 2600 游戏中，该算法的表现显著优于许多 DQN 的改进方案，包括相关的分布式算法 C51。

分位数回归的分布式强化学习

Distributional Reinforcement Learning with Quantile Regression

In this paper we argue for the fundamental importance of the value
distribution: the distribution of the random return received by a reinforcement
learning agent. This is in contrast to the common approach to reinforcement
learning which models the expectation of this return, or value. Although there
is an established body of literature studying the value distribution, thus far
it has always been used for a specific purpose such as implementing risk-aware
behaviour. We begin with theoretical results in both the policy evaluation and
control settings, exposing a significant distributional instability in the
latter. We then use the distributional perspective to design a new algorithm
which applies Bellman's equation to the learning of approximate value
distributions. We evaluate our algorithm using the suite of games from the
Arcade Learning Environment. We obtain both state-of-the-art results and
anecdotal evidence demonstrating the importance of the value distribution in
approximate reinforcement learning. Finally, we combine theoretical and
empirical evidence to highlight the ways in which the value distribution
impacts learning in the approximate setting.

本文阐述了价值分配的重要性，提出了一种基于价值分配的学习算法，并通过实证结果证明了该算法的有效性。