Offline reinforcement learning (RL) aims to learn policies from static
datasets of previously collected trajectories. Existing methods for offline RL
either constrain the learned policy to the support of offline data or utilize
model-based virtual environments to generate simulated rollouts. However, these
methods suffer from (i) poor generalization to unseen states; and (ii) trivial
improvement from low-qualified rollout simulation. In this paper, we propose
offline trajectory generalization through world transformers for offline
reinforcement learning (OTTO). Specifically, we use casual Transformers, a.k.a.
World Transformers, to predict state dynamics and the immediate reward. Then we
propose four strategies to use World Transformers to generate high-rewarded
trajectory simulation by perturbing the offline data. Finally, we jointly use
offline data with simulated data to train an offline RL algorithm. OTTO serves
as a plug-in module and can be integrated with existing offline RL methods to
enhance them with better generalization capability of transformers and
high-rewarded data augmentation. Conducting extensive experiments on D4RL
benchmark datasets, we verify that OTTO significantly outperforms
state-of-the-art offline RL methods.

通过世界转换器进行线下强化学习的离线轨迹泛化方法（OTTO）在 D4RL 基准数据集上验证了其相对于最先进的线下强化学习方法具有显著优势。

离线强化学习的轨迹概括

Offline Trajectory Generalization for Offline Reinforcement Learning

In this paper, we propose a model-based offline reinforcement learning method
that integrates count-based conservatism, named $\texttt{Count-MORL}$. Our
method utilizes the count estimates of state-action pairs to quantify model
estimation error, marking the first algorithm of demonstrating the efficacy of
count-based conservatism in model-based offline deep RL to the best of our
knowledge. For our proposed method, we first show that the estimation error is
inversely proportional to the frequency of state-action pairs. Secondly, we
demonstrate that the learned policy under the count-based conservative model
offers near-optimality performance guarantees. Through extensive numerical
experiments, we validate that $\texttt{Count-MORL}$ with hash code
implementation significantly outperforms existing offline RL algorithms on the
D4RL benchmark datasets. The code is accessible at
$\href{https://github.com/oh-lab/Count-MORL}{this https URL}$.

本文提出了一种基于模型的离线强化学习方法 $	exttt {Count-MORL}$，该方法利用状态 - 动作对的计数估计量来量化模型估计误差，并首次演示了计数保守性在基于模型的离线深度强化学习中的效果。通过广泛的数值实验，我们验证了使用哈希码实现的 $	exttt {Count-MORL}$ 在 D4RL 基准数据集上明显优于现有离线强化学习算法。