Recently, methods such as Decision Transformer that reduce reinforcement
learning to a prediction task and solve it via supervised learning (RvS) have
become popular due to their simplicity, robustness to hyperparameters, and
strong overall performance on offline RL tasks. However, simply conditioning a
probabilistic model on a desired return and taking the predicted action can
fail dramatically in stochastic environments since trajectories that result in
a return may have only achieved that return due to luck. In this work, we
describe the limitations of RvS approaches in stochastic environments and
propose a solution. Rather than simply conditioning on the return of a single
trajectory as is standard practice, our proposed method, ESPER, learns to
cluster trajectories and conditions on average cluster returns, which are
independent from environment stochasticity. Doing so allows ESPER to achieve
strong alignment between target return and expected performance in real
environments. We demonstrate this in several challenging stochastic offline-RL
tasks including the challenging puzzle game 2048, and Connect Four playing
against a stochastic opponent. In all tested domains, ESPER achieves
significantly better alignment between the target return and achieved return
than simply conditioning on returns. ESPER also achieves higher maximum
performance than even the value-based baselines.

本文介绍了基于预测任务的强化学习方法在随机环境下的局限性，并提出了一种名为 ESPER 的解决方案，该方法学习轨迹聚类并以平均聚类收益进行条件约束，从而在真实环境中实现了目标收益和预期性能的强对齐。ESPER 在多项挑战性的离线 RL 任务中展现出了更好的表现。

不能只依赖运气：决策 Transformer 和 RvS 在随机环境中的失败

You Can't Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments

Large language models readily adapt to novel settings, even without
task-specific training data. Can their zero-shot capacity be extended to
multimodal inputs? In this work, we propose ESPER which extends language-only
zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to
language model generations without direct supervision: for example, in the
image case our reward optimization relies only on cosine similarity derived
from CLIP, and thus requires no additional explicitly paired (image, caption)
data. Because the parameters of the language model are left unchanged, the
model maintains its capacity for zero-shot generalization. Experiments
demonstrate that ESPER outperforms baselines and prior work on a variety of
zero-shot tasks; these include a new benchmark we collect+release, ESP dataset,
which tasks models with generating several diversely-styled captions for each
image.

本论文提出了一种名为 ESPER 的方法，将仅基于语言的零 - shot 模型扩展到未见过的多模态任务，如图像和音频字幕生成，采用强化学习来无需直接监督地将多模态输入与语言模型生成对齐，实验表明该方法胜过了基线和之前工作的新基准测试。