Large transformer models powered by diverse data and model scale have
dominated natural language modeling and computer vision and pushed the frontier
of multiple AI areas. In reinforcement learning (RL), despite many efforts into
transformer-based policies, a key limitation, however, is that current
transformer-based policies cannot learn by directly combining information from
multiple sub-optimal trials. In this work, we address this issue using recently
proposed chain of hindsight to relabel experience, where we train a transformer
on a sequence of trajectory experience ascending sorted according to their
total rewards. Our method consists of relabelling target return of each
trajectory to the maximum total reward among in sequence of trajectories and
training an autoregressive model to predict actions conditioning on past
states, actions, rewards, target returns, and task completion tokens, the
resulting model, Agentic Transformer (AT), can learn to improve upon itself
both at training and test time. As we show on D4RL and ExoRL benchmarks, to the
best our knowledge, this is the first time that a simple transformer-based
model performs competitively with both temporal-difference and
imitation-learning-based approaches, even from sub-optimal data. Our Agentic
Transformer also shows a promising scaling trend that bigger models
consistently improve results.

本文利用 “chain of hindsight” 方法在强化学习中训练了一个能够直接综合多个轨迹信息的 transformer 模型，并通过在 D4RL 和 ExoRL 基准测试中的表现证明了它的竞争力和可伸缩性。

后见之链中崛起的代理变形机

Emergent Agentic Transformer from Chain of Hindsight Experience

Recent work has shown the promise of creating generalist, transformer-based,
policies for language, vision, and sequential decision-making problems. To
create such models, we generally require centralized training objectives, data,
and compute. It is of interest if we can more flexibly create generalist
policies, by merging together multiple, task-specific, individually trained
policies. In this work, we take a preliminary step in this direction through
merging, or averaging, subsets of Decision Transformers in weight space trained
on different MuJoCo locomotion problems, forming multi-task models without
centralized training. We also propose that when merging policies, we can obtain
better results if all policies start from common, pre-trained initializations,
while also co-training on shared auxiliary tasks during problem-specific
finetuning. In general, we believe research in this direction can help
democratize and distribute the process of which forms generally capable agents.

本篇论文探讨了通过合并不同 MuJoCo 运动问题的决策 Transformer 子集，形成多任务模型（无集中式训练），从而更加灵活地创造通用策略的初步方法，同时提出了合并政策的更优结果可能性，并建议使用共同的预先训练初始化，以及在问题特定微调期间共同训练共享辅助任务，以帮助实现通用智能体的民主化和分布式过程。