In this work, we present a scalable reinforcement learning method for
training multi-task policies from large offline datasets that can leverage both
human demonstrations and autonomously collected data. Our method uses a
Transformer to provide a scalable representation for Q-functions trained via
offline temporal difference backups. We therefore refer to the method as
Q-Transformer. By discretizing each action dimension and representing the
Q-value of each action dimension as separate tokens, we can apply effective
high-capacity sequence modeling techniques for Q-learning. We present several
design decisions that enable good performance with offline RL training, and
show that Q-Transformer outperforms prior offline RL algorithms and imitation
learning techniques on a large diverse real-world robotic manipulation task
suite. The project's website and videos can be found at
this https URL

本文介绍了一种用于训练多任务策略的可扩展强化学习方法，该方法可以利用人类示范和自主收集的数据。通过使用 Transformer 作为 Q 函数的可扩展表示方法，并应用于离线时间差分备份的训练中，我们称之为 Q-Transformer。通过将每个动作维度离散化并将每个动作维度的 Q 值表示为单独的标记，我们可以应用有效的高容量序列建模技术进行 Q 学习。我们还提出了几个设计决策，使得 Q-Transformer 在离线强化学习训练中表现出良好性能，并且在大型多样的真实世界机器人操纵任务套件上，Q-Transformer 优于先前的离线强化学习算法和模仿学习技术。项目的网站和视频可在此 URL 找到。

Q-Transformer: 基于自回归 Q 函数的可扩展离线强化学习

Q-Transformer: Scalable Offline Reinforcement Learning via  Autoregressive Q-Functions

How to extract as much learning signal from each trajectory data has been a
key problem in reinforcement learning (RL), where sample inefficiency has posed
serious challenges for practical applications. Recent works have shown that
using expressive policy function approximators and conditioning on future
trajectory information -- such as future states in hindsight experience replay
or returns-to-go in Decision Transformer (DT) -- enables efficient learning of
multi-task policies, where at times online RL is fully replaced by offline
behavioral cloning, e.g. sequence modeling. We demonstrate that all these
approaches are doing hindsight information matching (HIM) -- training policies
that can output the rest of trajectory that matches some statistics of future
state information. We present Generalized Decision Transformer (GDT) for
solving any HIM problem, and show how different choices for the feature
function and the anti-causal aggregator not only recover DT as a special case,
but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for
matching different statistics of the future. For evaluating CDT and BDT, we
define offline multi-task state-marginal matching (SMM) and imitation learning
(IL) as two generic HIM problems, propose a Wasserstein distance loss as a
metric for both, and empirically study them on MuJoCo continuous control
benchmarks. CDT, which simply replaces anti-causal summation with anti-causal
binning in DT, enables the first effective offline multi-task SMM algorithm
that generalizes well to unseen and even synthetic multi-modal state-feature
distributions. BDT, which uses an anti-causal second transformer as the
aggregator, can learn to model any statistics of the future and outperforms DT
variants in offline multi-task IL. Our generalized formulations from HIM and
GDT greatly expand the role of powerful sequence modeling architectures in
modern RL.

提出了广义决策转换器（GDT）以解决 HIM 问题，该方法能够从轨迹数据中提取多任务策略。 GDT 不仅恢复了决策转换器（DT）作为特殊情况，还引入了新的分类 DT（CDT）和双向 DT（BDT）以匹配未来的不同统计信息，并在 MuJoCo 连续控制基准测试中得到了很好的应用。