Recent work has shown that offline reinforcement learning (RL) can be
formulated as a sequence modeling problem (Chen et al., 2021; Janner et al.,
2021) and solved via approaches similar to large-scale language modeling.
However, any practical instantiation of RL also involves an online component,
where policies pretrained on passive offline datasets are finetuned via
taskspecific interactions with the environment. We propose Online Decision
Transformers (ODT), an RL algorithm based on sequence modeling that blends
offline pretraining with online finetuning in a unified framework. Our
framework uses sequence-level entropy regularizers in conjunction with
autoregressive modeling objectives for sample-efficient exploration and
finetuning. Empirically, we show that ODT is competitive with the
state-of-the-art in absolute performance on the D4RL benchmark but shows much
more significant gains during the finetuning procedure.

本文提出了基于序列建模的决策转换器（ODT）算法，该算法在离线预训练和在线调整中融合了序列级熵正则化和自回归建模目标，以实现高效的探索和调整。实验证明，在 D4RL 基准测试中，ODT 在绝对性能方面与最先进的方法具有竞争力，在微调过程中展现出更显著的提高。

在线决策转换器

Online Decision Transformer

Sample-efficient exploration is crucial not only for discovering rewarding
experiences but also for adapting to environment changes in a task-agnostic
fashion. A principled treatment of the problem of optimal input synthesis for
system identification is provided within the framework of sequential Bayesian
experimental design. In this paper, we present an effective
trajectory-optimization-based approximate solution of this otherwise
intractable problem that models optimal exploration in an unknown Markov
decision process (MDP). By interleaving episodic exploration with Bayesian
nonlinear system identification, our algorithm takes advantage of the inductive
bias to explore in a directed manner, without assuming prior knowledge of the
MDP. Empirical evaluations indicate a clear advantage of the proposed algorithm
in terms of the rate of convergence and the final model fidelity when compared
to intrinsic-motivation-based algorithms employing exploration bonuses such as
prediction error and information gain. Moreover, our method maintains a
computational advantage over a recent model-based active exploration (MAX)
algorithm, by focusing on the information gain along trajectories instead of
seeking a global exploration policy. A reference implementation of our
algorithm and the conducted experiments is publicly available.

本文针对未知 Markov 决策过程提出一种机器学习算法，采用序贯贝叶斯实验设计框架，通过基于轨迹优化的近似方法处理最优探寻问题，以在无先验知识的情况下探索未知环境，实现最优输入合成的系统识别。相比于其他以内在动机为基础的算法，该算法在收敛速度和最终模型保真度上都表现出明显的优势，同时与最近的基于模型的主动探索算法相比，该方案更专注于沿轨迹获取的信息量，具有明显的计算优势。

远视视野好奇心

Receding Horizon Curiosity

Efficient exploration is a long-standing problem in sensorimotor learning.
Major advances have been demonstrated in noise-free, non-stochastic domains
such as video games and simulation. However, most of these formulations either
get stuck in environments with stochastic dynamics or are too inefficient to be
scalable to real robotics setups. In this paper, we propose a formulation for
exploration inspired by the work in active learning literature. Specifically,
we train an ensemble of dynamics models and incentivize the agent to explore
such that the disagreement of those ensembles is maximized. This allows the
agent to learn skills by exploring in a self-supervised manner without any
external reward. Notably, we further leverage the disagreement objective to
optimize the agent's policy in a differentiable manner, without using
reinforcement learning, which results in a sample-efficient exploration. We
demonstrate the efficacy of this formulation across a variety of benchmark
environments including stochastic-Atari, Mujoco and Unity. Finally, we
implement our differentiable exploration on a real robot which learns to
interact with objects completely from scratch. Project videos and code are at
this https URL

提出了基于活跃学习文献中的一种探索式学习方法，该方法使用动态模型集成，通过最大化这些集成之间的差异性来训练智能体，从而使该智能体自我监督地学习技能，无需外部奖励，并且还利用该探索方法来优化代理的策略而不使用强化学习