In-context learning is a promising approach for offline reinforcement
learning (RL) to handle online tasks, which can be achieved by providing task
prompts. Recent works demonstrated that in-context RL could emerge with
self-improvement in a trial-and-error manner when treating RL tasks as an
across-episodic sequential prediction problem. Despite the self-improvement not
requiring gradient updates, current works still suffer from high computational
costs when the across-episodic sequence increases with task horizons. To this
end, we propose an In-context Decision Transformer (IDT) to achieve
self-improvement in a high-level trial-and-error manner. Specifically, IDT is
inspired by the efficient hierarchical structure of human decision-making and
thus reconstructs the sequence to consist of high-level decisions instead of
low-level actions that interact with environments. As one high-level decision
can guide multi-step low-level actions, IDT naturally avoids excessively long
sequences and solves online tasks more efficiently. Experimental results show
that IDT achieves state-of-the-art in long-horizon tasks over current
in-context RL methods. In particular, the online evaluation time of our IDT is
\textbf{36$\times$} times faster than baselines in the D4RL benchmark and
\textbf{27$\times$} times faster in the Grid World benchmark.

提出了一种高层次的基于试错的方法，通过在环境中提供任务提示来实现离线强化学习的上下文学习，可以更高效地解决在线任务，并在长期任务中取得了最先进的结果。

上下文决策变换器：通过分层思维链强化学习

In-Context Decision Transformer: Reinforcement Learning via Hierarchical  Chain-of-Thought

Computational complexity is a core theory of computer science, which dictates
the degree of difficulty of computation. There are many problems with high
complexity that we have to deal, which is especially true for AI. This raises a
big question: Is there a better way to deal with these highly complex problems
other than bounded by computational complexity? We believe that ideas and
methods from intelligence science can be applied to these problems and help us
to exceed computational complexity. In this paper, we try to clarify concepts,
and we propose definitions such as unparticularized computing, particularized
computing, computing agents, and dynamic search. We also propose and discuss a
framework, i.e., trial-and-error + dynamic search. Number Partition Problem is
a well-known NP-complete problem, and we use this problem as an example to
illustrate the ideas discussed.

本文旨在探讨计算机科学中的计算复杂性问题，提出了智能科学的思路和方法，运用试错和动态搜索的框架将 NP 完备问题 Number Partition Problem 作为案例进行讨论。

超越计算复杂度的试错动态行为与智能

Exceeding Computational Complexity Trial-and-Error Dynamic Action and Intelligence

This work presents In-Context Policy Iteration, an algorithm for performing
Reinforcement Learning (RL), in-context, using foundation models. While the
application of foundation models to RL has received considerable attention,
most approaches rely on either (1) the curation of expert demonstrations
(either through manual design or task-specific pretraining) or (2) adaptation
to the task of interest using gradient methods (either fine-tuning or training
of adapter layers). Both of these techniques have drawbacks. Collecting
demonstrations is labor-intensive, and algorithms that rely on them do not
outperform the experts from which the demonstrations were derived. All gradient
techniques are inherently slow, sacrificing the "few-shot" quality that made
in-context learning attractive to begin with. In this work, we present an
algorithm, ICPI, that learns to perform RL tasks without expert demonstrations
or gradients. Instead we present a policy-iteration method in which the prompt
content is the entire locus of learning. ICPI iteratively updates the contents
of the prompt from which it derives its policy through trial-and-error
interaction with an RL environment. In order to eliminate the role of
in-weights learning (on which approaches like Decision Transformer rely
heavily), we demonstrate our algorithm using Codex, a language model with no
prior knowledge of the domains on which we evaluate it.

本文提出了一种名为 ICPI 的算法，它使用基础模型在上下文中执行强化学习任务，通过试错交互更新提示内容，以实现无需专家示范或梯度的强化学习任务。

现场策略迭代

In-Context Policy Iteration

The high probability of hardware failures prevents many advanced robots
(e.g., legged robots) from being confidently deployed in real-world situations
(e.g., post-disaster rescue). Instead of attempting to diagnose the failures,
robots could adapt by trial-and-error in order to be able to complete their
tasks. In this situation, damage recovery can be seen as a Reinforcement
Learning (RL) problem. However, the best RL algorithms for robotics require the
robot and the environment to be reset to an initial state after each episode,
that is, the robot is not learning autonomously. In addition, most of the RL
methods for robotics do not scale well with complex robots (e.g., walking
robots) and either cannot be used at all or take too long to converge to a
solution (e.g., hours of learning). In this paper, we introduce a novel
learning algorithm called "Reset-free Trial-and-Error" (RTE) that (1) breaks
the complexity by pre-generating hundreds of possible behaviors with a dynamics
simulator of the intact robot, and (2) allows complex robots to quickly recover
from damage while completing their tasks and taking the environment into
account. We evaluate our algorithm on a simulated wheeled robot, a simulated
six-legged robot, and a real six-legged walking robot that are damaged in
several ways (e.g., a missing leg, a shortened leg, faulty motor, etc.) and
whose objective is to reach a sequence of targets in an arena. Our experiments
show that the robots can recover most of their locomotion abilities in an
environment with obstacles, and without any human intervention.

该论文提出了一种名为 “Reset-free Trial-and-Error” 的新型学习算法，有效解决了复杂机器人在面对硬件损坏后无法恢复运动能力的问题，而且该算法实现自主学习，能够在不同环境中快速适应。