This work presents In-Context Policy Iteration, an algorithm for performing
Reinforcement Learning (RL), in-context, using foundation models. While the
application of foundation models to RL has received considerable attention,
most approaches rely on either (1) the curation of expert demonstrations
(either through manual design or task-specific pretraining) or (2) adaptation
to the task of interest using gradient methods (either fine-tuning or training
of adapter layers). Both of these techniques have drawbacks. Collecting
demonstrations is labor-intensive, and algorithms that rely on them do not
outperform the experts from which the demonstrations were derived. All gradient
techniques are inherently slow, sacrificing the "few-shot" quality that made
in-context learning attractive to begin with. In this work, we present an
algorithm, ICPI, that learns to perform RL tasks without expert demonstrations
or gradients. Instead we present a policy-iteration method in which the prompt
content is the entire locus of learning. ICPI iteratively updates the contents
of the prompt from which it derives its policy through trial-and-error
interaction with an RL environment. In order to eliminate the role of
in-weights learning (on which approaches like Decision Transformer rely
heavily), we demonstrate our algorithm using Codex, a language model with no
prior knowledge of the domains on which we evaluate it.

本文提出了一种名为 ICPI 的算法，它使用基础模型在上下文中执行强化学习任务，通过试错交互更新提示内容，以实现无需专家示范或梯度的强化学习任务。