In this work, we present a scalable reinforcement learning method for
training multi-task policies from large offline datasets that can leverage both
human demonstrations and autonomously collected data. Our method uses a
Transformer to provide a scalable representation for Q-functions trained via
offline temporal difference backups. We therefore refer to the method as
Q-Transformer. By discretizing each action dimension and representing the
Q-value of each action dimension as separate tokens, we can apply effective
high-capacity sequence modeling techniques for Q-learning. We present several
design decisions that enable good performance with offline RL training, and
show that Q-Transformer outperforms prior offline RL algorithms and imitation
learning techniques on a large diverse real-world robotic manipulation task
suite. The project's website and videos can be found at
this https URL

本文介绍了一种用于训练多任务策略的可扩展强化学习方法，该方法可以利用人类示范和自主收集的数据。通过使用 Transformer 作为 Q 函数的可扩展表示方法，并应用于离线时间差分备份的训练中，我们称之为 Q-Transformer。通过将每个动作维度离散化并将每个动作维度的 Q 值表示为单独的标记，我们可以应用有效的高容量序列建模技术进行 Q 学习。我们还提出了几个设计决策，使得 Q-Transformer 在离线强化学习训练中表现出良好性能，并且在大型多样的真实世界机器人操纵任务套件上，Q-Transformer 优于先前的离线强化学习算法和模仿学习技术。项目的网站和视频可在此 URL 找到。