Large transformer models trained on diverse datasets have shown a remarkable
ability to learn in-context, achieving high few-shot performance on tasks they
were not explicitly trained to solve. In this paper, we study the in-context
learning capabilities of transformers in decision-making problems, i.e.,
reinforcement learning (RL) for bandits and Markov decision processes. To do
so, we introduce and study Decision-Pretrained Transformer (DPT), a supervised
pretraining method where the transformer predicts an optimal action given a
query state and an in-context dataset of interactions, across a diverse set of
tasks. This procedure, while simple, produces a model with several surprising
capabilities. We find that the pretrained transformer can be used to solve a
range of RL problems in-context, exhibiting both exploration online and
conservatism offline, despite not being explicitly trained to do so. The model
also generalizes beyond the pretraining distribution to new tasks and
automatically adapts its decision-making strategies to unknown structure.
Theoretically, we show DPT can be viewed as an efficient implementation of
Bayesian posterior sampling, a provably sample-efficient RL algorithm. We
further leverage this connection to provide guarantees on the regret of the
in-context algorithm yielded by DPT, and prove that it can learn faster than
algorithms used to generate the pretraining data. These results suggest a
promising yet simple path towards instilling strong in-context decision-making
abilities in transformers.

在这篇论文中，我们通过引入和研究 Decision-Pretrained Transformer（DPT）并展示它在上下文感知机器人决策中的运用，证明了大型变形机模型在多个数据集上的上下文学习能力，同时实现了对决策问题的研究及基于贝叶斯后验采样的跨任务性能。

监督预训练可学习上下文强化学习

Supervised Pretraining Can Learn In-Context Reinforcement Learning

Differentially private collaborative filtering is a challenging task, both in
terms of accuracy and speed. We present a simple algorithm that is provably
differentially private, while offering good performance, using a novel
connection of differential privacy to Bayesian posterior sampling via
Stochastic Gradient Langevin Dynamics. Due to its simplicity the algorithm
lends itself to efficient implementation. By careful systems design and by
exploiting the power law behavior of the data to maximize CPU cache bandwidth
we are able to generate 1024 dimensional models at a rate of 8.5 million
recommendations per second on a single PC.

提出了一种简单的算法来实现可证明偏差私有性以及良好性能的差异性私人协作过滤。通过差分隐私和贝叶斯后验采样的新型连接方式，该算法可有效实现。同时，通过精细的系统设计和利用数据的幂律行为最大化 CPU 缓存带宽，我们可以在单个 PC 上以 8.5 百万每秒的速率生成 1024 维模型实现推荐。