Offline reinforcement learning (RL) tries to learn the near-optimal policy
with recorded offline experience without online exploration. Current offline RL
research includes: 1) generative modeling, i.e., approximating a policy using
fixed data; and 2) learning the state-action value function. While most
research focuses on the state-action function part through reducing the
bootstrapping error in value function approximation induced by the distribution
shift of training data, the effects of error propagation in generative modeling
have been neglected. In this paper, we analyze the error in generative
modeling. We propose AQL (action-conditioned Q-learning), a residual generative
model to reduce policy approximation error for offline RL. We show that our
method can learn more accurate policy approximations in different benchmark
datasets. In addition, we show that the proposed offline RL method can learn
more competitive AI agents in complex control tasks under the multiplayer
online battle arena (MOBA) game Honor of Kings.

本文主要研究离线强化学习中的生成建模和状态 - 动作值函数学习，并提出了一种新的针对离线强化学习中策略近似误差的残差生成模型 AQL。实验证明，AQL 可以在不同质量测试数据集中学习到更准确的政策近似。另外，该离线 RL 方法在多人在线战场游戏 “王者荣耀” 中能够学习到更具竞争力的 AI 代理。

使用残差生成建模提升离线强化学习

Boosting Offline Reinforcement Learning with Residual Generative  Modeling

We consider off-policy evaluation (OPE), which evaluates the performance of a
new policy from observed data collected from previous experiments, without
requiring the execution of the new policy. This finds important applications in
areas with high execution cost or safety concerns, such as medical diagnosis,
recommendation systems and robotics. In practice, due to the limited
information from off-policy data, it is highly desirable to construct rigorous
confidence intervals, not just point estimation, for the policy performance. In
this work, we propose a new variational framework which reduces the problem of
calculating tight confidence bounds in OPE into an optimization problem on a
feasible set that catches the true state-action value function with high
probability. The feasible set is constructed by leveraging statistical
properties of a recently proposed kernel Bellman loss (Feng et al., 2019). We
design an efficient computational approach for calculating our bounds, and
extend it to perform post-hoc diagnosis and correction for existing estimators.
Empirical results show that our method yields tight confidence intervals in
different settings.

本文提出一个新的变分框架，将 OPE 中计算紧密置信区间的问题转化为一个可行集上的优化问题，通过利用最近提出的 kernel Bellman 损失的统计特性来构造可行集。实证结果表明，我们的方法在不同环境下都能产生紧密的置信区间。

使用核贝尔曼统计量进行负责任离线策略评估

Accountable Off-Policy Evaluation With Kernel Bellman Statistics

Value-based methods constitute a fundamental methodology in planning and deep
reinforcement learning (RL). In this paper, we propose to exploit the
underlying structures of the state-action value function, i.e., Q function, for
both planning and deep RL. In particular, if the underlying system dynamics
lead to some global structures of the Q function, one should be capable of
inferring the function better by leveraging such structures. Specifically, we
investigate the low-rank structure, which widely exists for big data matrices.
We verify empirically the existence of low-rank Q functions in the context of
control and deep RL tasks. As our key contribution, by leveraging Matrix
Estimation (ME) techniques, we propose a general framework to exploit the
underlying low-rank structure in Q functions. This leads to a more efficient
planning procedure for classical control, and additionally, a simple scheme
that can be applied to any value-based RL techniques to consistently achieve
better performance on "low-rank" tasks. Extensive experiments on control tasks
and Atari games confirm the efficacy of our approach. Code is available at
this https URL

利用矩阵估计技术，提出了一种利用 Q 函数中的全局低秩结构来提高经典控制器和深度强化学习性能的方案。在控制任务和 Atari 游戏中进行的实验证实了该方法的有效性。