Classical reinforcement learning (RL) aims to optimize the expected
cumulative reward. In this work, we consider the RL setting where the goal is
to optimize the quantile of the cumulative reward. We parameterize the policy
controlling actions by neural networks, and propose a novel policy gradient
algorithm called Quantile-Based Policy Optimization (QPO) and its variant
Quantile-Based Proximal Policy Optimization (QPPO) for solving deep RL problems
with quantile objectives. QPO uses two coupled iterations running at different
timescales for simultaneously updating quantiles and policy parameters, whereas
QPPO is an off-policy version of QPO that allows multiple updates of parameters
during one simulation episode, leading to improved algorithm efficiency. Our
numerical results indicate that the proposed algorithms outperform the existing
baseline algorithms under the quantile criterion.

在强化学习中考虑累积奖励分位数优化的问题，使用神经网络参数化策略，提出了 Quantile-Based Policy Optimization（QPO）和 Quantile-Based Proximal Policy Optimization（QPPO）算法来解决深度强化学习问题，实验结果表明该方法在分位数优化指标下优于现有基准算法。

使用双时间尺度策略梯度算法的基于分位数的深度强化学习

Quantile-Based Deep Reinforcement Learning using Two-Timescale Policy  Gradient Algorithms

Classical reinforcement learning (RL) aims to optimize the expected
cumulative rewards. In this work, we consider the RL setting where the goal is
to optimize the quantile of the cumulative rewards. We parameterize the policy
controlling actions by neural networks and propose a novel policy gradient
algorithm called Quantile-Based Policy Optimization (QPO) and its variant
Quantile-Based Proximal Policy Optimization (QPPO) to solve deep RL problems
with quantile objectives. QPO uses two coupled iterations running at different
time scales for simultaneously estimating quantiles and policy parameters and
is shown to converge to the global optimal policy under certain conditions. Our
numerical results demonstrate that the proposed algorithms outperform the
existing baseline algorithms under the quantile criterion.

本文提出了一种名为 Quantile-Based Policy Optimization（QPO）的 RL 算法，与原有算法相比在 quantile 目标的情况下表现更好，算法使用神经网络对策略进行参数化，同时使用两个相互耦合的迭代来估计量位和策略参数。