Offline reinforcement learning (RL) tries to learn the near-optimal policy
with recorded offline experience without online exploration. Current offline RL
research includes: 1) generative modeling, i.e., approximating a policy using
fixed data; and 2) learning the state-action value function. While most
research focuses on the state-action function part through reducing the
bootstrapping error in value function approximation induced by the distribution
shift of training data, the effects of error propagation in generative modeling
have been neglected. In this paper, we analyze the error in generative
modeling. We propose AQL (action-conditioned Q-learning), a residual generative
model to reduce policy approximation error for offline RL. We show that our
method can learn more accurate policy approximations in different benchmark
datasets. In addition, we show that the proposed offline RL method can learn
more competitive AI agents in complex control tasks under the multiplayer
online battle arena (MOBA) game Honor of Kings.

本文主要研究离线强化学习中的生成建模和状态 - 动作值函数学习，并提出了一种新的针对离线强化学习中策略近似误差的残差生成模型 AQL。实验证明，AQL 可以在不同质量测试数据集中学习到更准确的政策近似。另外，该离线 RL 方法在多人在线战场游戏 “王者荣耀” 中能够学习到更具竞争力的 AI 代理。

使用残差生成建模提升离线强化学习

Boosting Offline Reinforcement Learning with Residual Generative  Modeling

Applying Q-learning to high-dimensional or continuous action spaces can be
difficult due to the required maximization over the set of possible actions.
Motivated by techniques from amortized inference, we replace the expensive
maximization over all actions with a maximization over a small subset of
possible actions sampled from a learned proposal distribution. The resulting
approach, which we dub Amortized Q-learning (AQL), is able to handle discrete,
continuous, or hybrid action spaces while maintaining the benefits of
Q-learning. Our experiments on continuous control tasks with up to 21
dimensional actions show that AQL outperforms D3PG (Barth-Maron et al, 2018)
and QT-Opt (Kalashnikov et al, 2018). Experiments on structured discrete action
spaces demonstrate that AQL can efficiently learn good policies in spaces with
thousands of discrete actions.

本文提出一种名为 Amortized Q-learning (AQL) 的方法，利用类似于摊销推理的技术，通过从学习到的建议分布中采样一小部分可能的动作，取代所有动作的昂贵最大化，从而处理离散、连续或混合动作空间，同时保持 Q-learning 的优点。实验结果表明，在具有高达 21 个维度动作的连续控制任务中，AQL 优于 D3PG (Barth-Maron et al., 2018) 和 QT-Opt (Kalashnikov et al., 2018)。在结构化的离散动作空间实验中，AQL 能够有效地学习到良好的策略。