Offline Reinforcement Learning (RL) methods leverage previous experiences to
learn better policies than the behavior policy used for experience collection.
In contrast to behavior cloning, which assumes the data is collected from
expert demonstrations, offline RL can work with non-expert data and multimodal
behavior policies. However, offline RL algorithms face challenges in handling
distribution shifts and effectively representing policies due to the lack of
online interaction during training. Prior work on offline RL uses conditional
diffusion models to obtain expressive policies to represent multimodal behavior
in the dataset. Nevertheless, they are not tailored toward alleviating the
out-of-distribution state generalization. We introduce a novel method
incorporating state reconstruction feature learning in the recent class of
diffusion policies to address the out-of-distribution generalization problem.
State reconstruction loss promotes more descriptive representation learning of
states to alleviate the distribution shift incurred by the out-of-distribution
states. We design a 2D Multimodal Contextual Bandit environment to demonstrate
and evaluate our proposed model. We assess the performance of our model not
only in this new environment but also on several D4RL benchmark tasks,
achieving state-of-the-art results.

利用先前的经验来学习比用于经验收集的行为策略更好的政策的离线强化学习方法。与行为克隆相比，离线强化学习可以使用非专家数据和多模态行为策略。然而，离线强化学习算法在处理分布偏移和有效表示策略方面面临挑战，因为训练过程中缺乏在线交互。既往研究在离线强化学习中使用条件扩散模型来获取表示多模态行为的表达性政策。然而，它们没有针对缓解分布偏移状态泛化问题进行优化。我们提出了一种新方法，将状态重构特征学习纳入最近的一类扩散策略中，以解决分布外泛化问题。状态重构损失促进对状态的更加描述性表示学习，从而减轻分布外状态引起的分布偏移。我们设计了一个二维多模态上下文强化学习环境来展示和评估我们提出的模型。我们在这个新的环境以及几个 D4RL 基准任务上评估了我们模型的性能，实现了最先进的结果。